Introduction

Current data-driven artificial intelligence (AI) has achieved remarkable success across various fields1,2,3,4,5,6,7,8, primarily through model training and evaluation using datasets9,10,11,12,13,14,15,16,17. However, most datasets contain inherent biases, leading models to learn and exploit unintended task-correlated features or shortcuts, a phenomenon referred to as shortcut learning18,19,20,21,22. This issue undermines the assessment of AI models’ true capabilities, limits our understanding of their underlying mechanisms, and hinders their explainability and robust deployment in critical areas such as healthcare and autonomous driving. As illustrated in Fig. 1a, both humans and AI models may rely on these unintended features when evaluated using such biased datasets, resulting in biased assessments that reflect models’ preferences rather than their true abilities. Given the importance of trustworthy AI applications, developing a shortcut-free evaluation methodology is vital, yet it poses a considerable challenge.

Fig. 1: Evaluation of the capabilities of AI models on shortcut-free datasets reliably.
figure 1

a When individuals or AI models learn from datasets containing shortcuts, they may use features other than the intended ones to recognize the same samples, leading to misleading evaluation results. In contrast, when learning from shortcut-free datasets, different individuals or AI models will only use the intended feature to recognize the same samples, thus producing reliable evaluation results. b The curse of shortcuts encompasses two challenges. The first challenge lies in covering all possible shortcut features, as the number of features in high-dimensional data grows exponentially with the data dimensions. The second challenge is in intervening in the covered shortcut features, where the overall label is coupled with local features, making it inevitable that intervening in local features will affect the overall label. c SHL includes a model suite composed of models with different inductive biases and learns the SH of high-dimensional datasets through the intersection of features learned by each model. The more models in the model suite, the more accurate the learning of SH. The diversity in the inductive biases of the models significantly accelerates the learning speed of SH, and directly learning SH avoids intervening in the features of the data, thus addressing both challenges mentioned in (b).

Richard Feynman, in his 1974 lecture “Cargo Cult Science”, highlighted the phenomenon of shortcuts in experimental designs, particularly in psychology and education. He noted that these designs superficially adhere to scientific principles without genuine scientific validation, leading to erroneous conclusions23. To eliminate shortcuts, psychological experiments typically hypothesize all potential shortcut variables and manipulate them to observe their impact18,23. A similar approach has been applied in AI, where researchers manipulate predefined shortcut features to create out-of-distribution (OOD) datasets and use these generalization tests to determine whether models have learned shortcut features18,21,24,25. However, this method only identifies specific shortcuts and fails to diagnose the entire dataset. Unlike traditional disciplines, which often deal with low-dimensional variables, AI faces the challenge of high-dimensional data, a complexity we refer to as the curse of shortcuts. As depicted in Fig. 1b, high-dimensional data exponentially increase the number of features, making it challenging to account for all possible shortcuts. Furthermore, in such complex data, overall labels are often intertwined with local features, making it impossible to intervene in specific shortcut features without affecting the overall label.

In this article, we introduce a paradigm for diagnosing shortcuts in high-dimensional datasets—shortcut hull learning (SHL)—which effectively addresses the curse of shortcuts. Building on SHL, we design a shortcut-free evaluation framework (SFEF). By formalizing a unified representation theory of data shortcuts within a probability space, we define a fundamental indicator for determining whether a dataset contains shortcuts, termed the shortcut hull (SH)—the minimal set of shortcut features, as shown in Fig. 1c. Inspired by existing dataset suite26,27 used to evaluate AI models, SHL incorporates a model suite composed of models with different inductive biases21,24,25,28,29,30,31 and employs a collaborative mechanism to learn the SH of high-dimensional datasets. This approach facilitates efficient and direct learning of SH, enabling the diagnosis of dataset shortcuts and circumventing the curse of shortcuts in conventional coverage and intervention approaches.

To validate SHL and SFEF, we apply them to study global topological perceptual capabilities, which are fundamental to biological visual perception32,33,34. Notably, Minsky and Papert previously used these capabilities to examine the expressive power of neural networks35,36,37, emphasizing their importance in both understanding biological cognition and analyzing AI models. Our experiments demonstrate that SHL quickly and accurately identifies inherent shortcuts in topological datasets representing global capabilities. Using SFEF, we successfully construct a shortcut-free topological dataset and apply it to evaluate the global capabilities of mainstream deep neural networks (DNNs). Previous studies on global capabilities failed to eliminate local shortcuts, thus inherently studying models’ preferences for global versus local features. These studies concluded that convolutional neural network (CNN)-based models21,24,28,38,39 are inferior to Transformer-based models25,29,30,31,40,41,42 in global capability and that DNNs are less effective than humans at recognizing global properties14,15,16. However, using SFEF, we arrive at significantly different and more reliable conclusions: CNN-based models outperform Transformer-based models in recognizing global properties, and all DNNs surpass human capabilities, challenging previous understandings of DNNs’ abilities. More importantly, our findings confirm a critical point: models’ learning preferences do not represent their learning capabilities. Moreover, our constructed topological dataset enables a shift from Minsky and Papert’s representational analysis of neural networks on connectedness predicate35 to an empirical investigation of their learning capacity.

Overall, SHL addresses the fundamental bottleneck in eliminating shortcuts from high-dimensional datasets, and the SFEF establishes a foundational framework for unbiased evaluation of AI models. Future development of more shortcut-free datasets using this approach promises to deepen our understanding of AI, offering new insights for research and potentially setting the groundwork for fairer comparisons between human and AI capabilities.

Results

Probabilistic formulation of data shortcuts

Since data shortcuts arise from the inherent nature of the data itself and are independent of specific representations, the same data can often be represented in various ways. For example, when data is represented as real-valued vectors, the number and semantics of dimensions in different representations can vary significantly, making alignment between them challenging. This misalignment complicates the analysis of data shortcuts. To address this issue, we employ formal methods from probability theory. In probability theory, a random variable is a mapping from a sample space to the real number space, where the sample space can be considered as the space representing the data itself. Different random variables can be viewed as different representations. Through this approach, we can formalize a unified shortcut representation in probability space, independent of the specific representation of the data.

Considering a classification problem, let \((\Omega,{{{\mathcal{F}}}},{\mathbb{P}})\) denote a probability space, where Ω denotes the sample space, \({{{\mathcal{F}}}}\) is a σ-algebra of events, and \({\mathbb{P}}\) is a probability measure on \({{{\mathcal{F}}}}\). The joint random variable of input and label is represented by the mapping \((X,Y):\Omega \to {{\mathbb{R}}}^{n}\times {\{0,1\}}^{c}\). The training event is defined as

$$\{\omega \in \Omega \,|\, X(\omega )\in B,Y(\omega )\in P\}\in {{{\mathcal{F}}}},\quad \forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n}),\forall P\in {2}^{{\{0,1\}}^{c}},$$
(1)

where \({{{\mathcal{B}}}}({{\mathbb{R}}}^{n})\) denotes the Borel σ-algebra on \({{\mathbb{R}}}^{n}\), and \({2}^{{\{0,1\}}^{c}}\) denotes the power set of {0, 1}c. The probability of each training event is given by

$${{\mathbb{P}}}_{X,Y}(B,P)={\mathbb{P}}(\{\omega \in \Omega \,|\, X(\omega )\in B,Y(\omega )\in P\}),\quad \forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n}),\forall P\in {2}^{{\{0,1\}}^{c}},$$
(2)

where \({{\mathbb{P}}}_{X,Y}\) is the joint probability distribution of (XY). The information contained in the input random variable X, represented by the σ-algebra generated by X:

$$\sigma (X)=\{E\subseteq \Omega \,|\, E={X}^{-1}(B),\forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n})\}\subseteq F,$$
(3)

including all events of interest in \({{{\mathcal{F}}}}\). It should be noted that, given a random variable X, possibly \(\exists {X}^{{\prime} },{X}^{{\prime} }{=}^{a.s.}X,\sigma ({X}^{{\prime} })\,\ne\, \sigma (X)\). This is due to the fact that possibly \(\exists \omega \in (\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})),X(\omega )\,\ne\, {X}^{{\prime} }(\omega )\), where \({{{\rm{supp}}}}({\mathbb{P}})\) denotes the support of \({\mathbb{P}}\). Similarly, the label information is defined as a collection of all events in the label random variable Y, represented by the σ-algebra generated by Y:

$$\sigma (Y)=\{E\subseteq \Omega \,|\, E={Y}^{-1}(P),\forall P\in {2}^{{\{0,1\}}^{c}}\}\subseteq F.$$
(4)

Since there are c categories in a classification context, the label \(Y={\{{{\mathbb{I}}}_{{A}_{j}}\}}_{j=1}^{c}\), where \({{\mathbb{I}}}_{{A}_{j}}\) is the indicator function of the set Aj, and the finite set \(C={\{{A}_{j}\}}_{j=1}^{c}\subseteq {{{\mathcal{F}}}}\) partitions the sample space Ω. Thus, the label Y acts as a classifier, linking the dimension of Y to the label of the event Aj43. Hence,

$$\sigma (Y)={2}^{C},$$
(5)

where 2C represents the power set of C.

If the label Y is learnable from the input X, this implies that all the information in Y is also contained in X, i.e., σ(Y) σ(X). In this case, there exists a Borel measurable function f such that σ(f(X)) = σ(Y). Let \({{{\mathcal{Y}}}}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}Y,\sigma ({Y}^{{\prime} })\subseteq \sigma (X)\}\) denote the collection of all possible partitionings of the sample space Ω induced by the label random variable Y. We denote the intended partitioning of the Ω by \(\sigma ({Y}_{{{{\rm{Int}}}}})\in {{{\mathcal{Y}}}}\). The essence of shortcuts in data is that the data distribution \({{\mathbb{P}}}_{X,Y}\) deviates from the intended solution. Formally, when the data distribution \({{\mathbb{P}}}_{X,Y}\) exhibits shortcuts, it holds that \(| {{{\mathcal{Y}}}}| > 1\), i.e.,

$$\exists \sigma (\,\,f(X))\in {{{\mathcal{Y}}}},\quad \sigma (f(X))\, \ne\, \sigma ({Y}_{{{{\rm{Int}}}}}).$$
(6)

Theory for diagnosing data shortcuts with shortcut hull learning

For a Borel measurable function f, if \(f(X){=}^{a.s.}Y\), i.e., \({{\mathbb{P}}}_{X,Y}[\,\,f(X)\,\ne\, Y]=0\), then \(\sigma (f(X))\in {{{\mathcal{Y}}}}\). In practice, neural networks can be employed to learn the function f. Given a dataset \({{{\mathcal{D}}}}={{\mathbb{P}}}_{X,Y}^{d}\), classification tasks typically divide \({{{\mathcal{D}}}}\) into a training set \({{{{\mathcal{D}}}}}_{{{{\rm{train}}}}}\) and a test set \({{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}\). If \(\frac{| \{(x,y)\in {{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}| f(x)\ne y\}| }{| {{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}| }=0\), it indicates that f has learned the data distribution \({{\mathbb{P}}}_{X,Y}\). Assuming a neural network has m layers, with the first i layers denoted by gi, then f = gm. It can be posited that each layer of the neural network performs feature extraction, resulting in the relationship σ(X) σ(g1(X)) σ(gm(X)) by the Doob–Dynkin lemma44. When interpreting neural networks, it is common to avoid considering thresholded output labels as output features due to potential significant information loss. Instead, we use the features from soft labels output or previous layers as task-correlated features learned by the neural networks, denoted as g(X), implying that σ(g(X)) σ(f(X)). Multiple neural networks can be employed to learn the intrinsic features of the dataset, i.e., σ(f(X)) = gσ(g(X))σ(f(X))σ(g(X)). By equivalently rewriting Eq. (6) as \({\bigcap }_{f| {{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}\sigma (f(X))\,\subsetneq\, \sigma ({Y}_{{{{\rm{Int}}}}})\), we obtain

$${\bigcap}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}\sigma (g(X))\,\subsetneq\, \sigma ({Y}_{{{{\rm{Int}}}}}).$$
(7)

We refer to Eq. (7) as the SH, which intuitively represents the minimal shortcut. As the number of models increases, it converges exponentially (see Supplementary Note 1 for proof). This approach of using multiple models to learn the SH is termed SHL. To map the abstract SHL in the probability space onto real-valued features that are both observable and operable, and to represent features from different models within the same space, we associate the neural network features with the input X, Consequently, Eq. (7) can be reformulated as follows:

$$\exists \omega \in {{{\rm{supp}}}}({\mathbb{P}}),\quad {\bigcap}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}J(I(\omega ))\,\subsetneq\, J({I}_{{{{\rm{Int}}}}}(\omega )),$$
(8)

where \(I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}\), denotes a random subset index mapping of X, For \(\forall I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}\) and ω Ω, \(J(I(\omega ))=\{{\bigcup }_{{I}^{{\prime\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| {X}_{{I}^{{\prime\prime} }}^{-1}({X}_{{I}^{{\prime\prime} }}(\omega ))={X}_{{I}^{{\prime} }}^{-1}({X}_{{I}^{{\prime} }}(\omega ))}\{{I}^{{\prime\prime} }(\omega )\}| {I}^{{\prime} }(\omega )\subseteq I(\omega )\}\). A detailed derivation of Eq. (8) can be found in the Methods section.

Shortcut-free evaluation framework

To reliably evaluate a model’s performance using shortcut-free datasets, we propose an SFEF based on the SHL. This framework consists of five steps, as illustrated in Fig. 2. Step 1 outlines the required structure of the dataset. Steps 2 to 4 demonstrate the use of a model suite to diagnose dataset shortcuts, and Step 5 shows the reliable evaluation of models using a shortcut-free dataset.

Fig. 2: Illustration of the proposed shortcut-free evaluation framework.
figure 2

Step 1. Construct a dataset targeted at a specific capability, with the dataset divided into multiple difficulty levels. The data distribution across these levels differs only in terms of difficulty. The lowest level, level 0, is also known as the diagnostic level, while the other levels are collectively referred to as evaluation levels. For instance, we consider a topological dataset designed to assess global capabilities, with further details provided in the section Design of the shortcut-free topological dataset. Step 2. Construct a model suite comprising models with various inductive biases. Step 3. Train the models in the suite using the diagnostic level data from Step 1, then select the models that can completely learn the data distribution. For classification tasks, this means achieving 100% accuracy on the test set. Step 4. Diagnose the shortcuts present in the dataset using the models selected in Step 3 via SHL. Here, the symbol  ∩ denotes the application of Eq. (8) at different hierarchies in data, and the red numbers indicate the total count of pixel position features, facilitating the diagnosis of whether the data can be used for global capability evaluation. The hierarchical structure of shortcuts is detailed in the section Diagnosing the topological dataset using shortcut hull learning. Step 5. Evaluate the models' capabilities using the datasets from the different difficulty levels established in Step 1.

In Step 1, we construct an evaluation dataset tailored to assess specific model capabilities. This dataset is structured into multiple levels of difficulty, where the distribution of data differs only in difficulty. The rationale for the multi-level difficulty approach is threefold:

  • Learning data distribution: As indicated by Eq. (7), the model must fully learn the data distribution, necessitating a low-difficulty dataset that adheres to an identical distribution.

  • Model capability assessment: If the data distribution is shortcut-free, the model’s accuracy on datasets of the same difficulty level will either be random or 100%, indicating where the model has successfully captured the capability or failed to do so. The multi-level difficulty approach offers a clearer and more precise assessment of the model’s capability.

  • Data difficulty expansion: The multi-level difficulty approach also facilitates easier expansion of the dataset’s difficulty range.

In Step 2, considering that models with different inductive biases may favor different solutions for the same data29, we construct a suite of models with various inductive biases. Each model in the suite is trained on the lowest difficulty level dataset, referred to as the diagnostic level, to learn the data distribution. Step 3 involves selecting models from the suite that have fully learned the data distribution, such as those achieving 100% accuracy on a classification task’s test set. In Step 4, these selected models are then employed to diagnose shortcuts in the diagnostic level data using SHL. Similar to how a model’s performance is evaluated using diverse datasets to assess generalization, increasing the number and variety of models enhances SHL’s accuracy in diagnosing dataset shortcuts. Since all difficulty levels in Step 1 share the same distribution, the presence of shortcuts in the diagnostic level implies the existence of shortcuts across all difficulty levels. Finally, in Step 5, we use the shortcut-free datasets, identified in Step 4, across varying difficulty levels, to evaluate the capability of the models. This ensures a reliable assessment, free from the confounding effects of data shortcuts.

Design of the shortcut-free topological dataset

To illustrate the efficacy of the proposed SHL and SFEF, we design a dataset that captures the global properties of visual perception without local shortcuts. The global nature of visual perception plays a pivotal role for both humans and AI. For humans, global features allow for rapid object detection without the need for an in-depth analysis of local characteristics, a capability crucial for survival in natural environments. Given the pronounced sensitivity of the human visual system to global features, it is posited that in visual processing, global features are processed first, guiding the interpretation of local features32,33. Similarly, in AI, the importance of the global connectedness predicate was underscored by Minsky and Papert during the early development of neural networks35,36. In contemporary large-scale models45,46, the attention mechanism40 is primarily designed to endow the model with the ability to directly perceive global information.

Global properties of visual perception can be described in terms of topological properties, which remain invariant under continuous transformations such as stretching, twisting, crumpling, and bending. These properties are fundamental to geometric objects and remain unchanged under topological transformations. When considering two-dimensional manifolds projected onto the retina, these invariant properties can be categorized into three main topological facets: connectivity, the count of holes, and inside/outside relationship32,33. While topological properties are not directly identifiable by local features, there is a risk of spurious correlation with local features during dataset construction. To address this, we design a visual topological synthesis dataset, which is crafted such that only minor and localized changes are needed to modify the topological properties of the images, without impacting the local statistics. This generation strategy enables the verification and control of local shortcuts.

In this visual topological dataset, each image consists of several closed loops, where images with different numbers of closed loops are topologically distinct. Moreover, closed loops of the same quantity may differ in their inside/outside relationships, further contributing to their topological uniqueness. In the synthesis process, images with a higher number of closed loops are generated from those with fewer closed loops through minor transformations. For simplicity, we assume the dataset contains images with either 1 or 2 closed loops, although the method can be extended to images with more loops. The synthesis process is shown in Fig. 3, where Fig. 3a–c represent different stages of synthesis, respectively. As depicted in Fig. 3a, we begin by initializing a graph G with n × n nodes, corresponding to the original image resolution of (4n + 1) × (4n + 1). Each node exclusively connects to adjacent nodes in four cardinal directions: top, bottom, left, and right. Every node represents a 3 × 3 pixel closed-loop node from the original image, with adjacent closed-loop nodes separated by one pixel.

Fig. 3: Synthesis process of visual topological dataset.
figure 3

a The fine-scale synthesis process reveals the structure of the data, the initialization procedure, and the manner of a single-node step traversal. b The medium-scale synthesis process demonstrates the process of node random walk. c The coarse-scale synthesis process depicts the full generation process of a single closed loop. d Upon minor local modifications, a single closed loop gives rise to two types of closed loops, differentiated by outside/inside relations.

To illustrate the generation of a visual topological dataset, we consider a graph with a resolution of 29 × 29, i.e., 7 × 7 nodes, as an example. To create a random closed loop, Wilson’s algorithm47 is employed to generate a uniform spanning tree of the graph G. We start by randomly selecting a vertex of graph G to form a single-vertex tree T (the white closed-loop vertex of Fig. 3a), representing a closed-loop vertex in the original image. Another vertex v, chosen randomly and not in the tree T (the red closed-loop vertex in Fig. 3a), undergoes a loop-erased random walk from v until it reaches a vertex in tree T, at which point the resulting path is appended to T. Upon connecting two closed-loop vertices, the corresponding closed loops merge into a single closed loop. The process of loop-erased random walk, illustrated in Fig. 3b, starts from vertex v, and at each step randomly moves to an untraversed adjacent vertex, forming connections between two closed loops until reaching a vertex in T. If the random walk reaches its own path, forming a loop, the loop is removed from the path before the walk continues. Figure 3c illustrates a complete process of generating a single closed loop, resulting in a uniformly distributed tree after all vertices have been included.

Figure 3d shows the generation of two closed loops from a single closed loop with minor local changes, allowing for the randomization of their inside/outside relationships. By modifying only the elements in 3 × 3 pixels, the topological properties of the image can be altered without affecting its local statistics. Since the generated closed-loop corresponds to the spanning tree T, which is a minimal connected subgraph of the graph G, all edges in T are edge-cuts. Removing any edge from T results in two connected subgraphs, thereby generating the two closed loops that represent the outside relations.

However, due to the method of generating the closed loop, the local features of the regions inside and outside the loop remain identical. Therefore, the connecting edges outside the closed loop can be severed in the same manner as those inside the loop. This process yields two closed loops corresponding to the inside relations depicted in Fig. 3d. When partitioning the outside of a closed loop, there is only one possible outside pathway, inevitably forming a closed loop inside. Initially, the outside access constitutes a closed loop, but the location of this closure changes during the process.

In this dataset, we categorize regions into three distinct topological classes: a connected region, a connected region with two outside relations, and a connected region with two inside relations. To ensure consistent difficulty in data generation across the same difficulty level (i.e., resolution), we pre-generate 1,000,000 samples for both the second and third data classes at each difficulty level. From these, we select the top 10,000 samples with the largest adjacent regions between two topological loops. These selected samples are then split into 8000 training samples and 2000 testing samples. Consequently, each difficulty level contains 30,000 samples in total. The full dataset is available at Figshare48.

Statistical characteristics of the topological dataset

The purpose of the topological dataset is to identify global topological features. Hence, we analyze potential statistical correlations in the local properties of images from different classes. In the topological data generation process, images of different classes are produced by modifying pixel values within a local 3 × 3 range of images from the same class. Our goal is to ascertain the impact of these local alterations on the surrounding region’s statistical characteristics. Therefore, we statistically assess the presence of distinguishable n × n local features from various n × n pixel scales. In the generation process shown in Fig. 3d, it is evident that the statistical characteristics of 1 × 1 local patterns are consistent across different classes.

As depicted in Fig. 4a, all possible 3 × 3 local patterns are enumerated. While there are 23 × 3 potential patterns for 3 × 3 pixels, not all patterns manifest in this configuration. Additionally, certain patterns exhibit strong correlations; for instance, rotations and symmetries can produce up to 8 patterns. After excluding non-occurring and strongly correlated patterns, we present the remaining 3 × 3 local patterns and demonstrate their statistical characteristics within a 29 × 29-sized dataset through template matching.

Fig. 4: Diagnosis of the topological dataset’s shortcuts.
figure 4

a Statistical characteristics of a 3 × 3 template in a 29 × 29-sized dataset. b The proportion of distinguishable classes under different template sizes in the 29 × 29 dataset. The horizontal axis represents the size of square templates, such as 1 × 1, 2 × 2, 3 × 3, etc. c Diagnosis based on the SHL by models with different inductive biases, including ResNet-5026, ViT-B/1641, RepVGG-A257, Swin-T58, PViG-S59, ResNeXt-5068, Inception-V363, ConvMixer-1024/1069, EfficientNet-B470, RegNetX-4.0GF71, and SE-ResNet-5072. The bar chart displays the count of topological dataset features for different classes within each model. The line chart indicates the count of common global features across different models, with each point on the horizontal axis representing common features for all models on the left. Error bars in both charts indicate standard deviation, capturing the variability across different samples within the dataset.

In Fig. 4b, we illustrate the proportion of distinguishable categories within a 29 × 29-sized dataset at different template scales. Similar to the 3 × 3 matching templates portrayed in Fig. 4a, templates of varying sizes are used for image matching. For each image, we search for a template of the corresponding size and match it with templates from the other two categories. If a matching template is found in the other two classes, it indicates that the image is statistically indistinguishable at that scale. Conversely, the absence of a matching template suggests statistically distinguishability. The results in Fig. 4b reveal that when the template size reaches 10 × 10, only about 1% of the dataset becomes distinguishable by classes. Hence, we conclude that categorizing data based on minor local changes does not introduce local statistical correlations. Due to the substantial computational demands of brute-force template matching, our analysis is limited to a 10 × 10 scale. Nonetheless, the results are conclusive for our purposes.

Diagnosing the topological dataset using shortcut hull learning

Through prior statistical analysis of the topological dataset, we demonstrate that the designed dataset is shortcut-free concerning global properties. Below, we show how the SHL diagnosis verifies the shortcut-free nature, as established by the earlier statistical analysis. Furthermore, by altering Ω, we illustrate how SHL diagnosis can reveal different hierarchies of shortcuts within the topological dataset.

To assess whether the global properties exhibit shortcuts related to local properties, we focus specifically on two types of features: global and local. This is consistent with the feature types analyzed in the earlier statistical analysis. In Eq. (8), a threshold value (thresh) is used to demarcate global and local characteristics. When \(| \left(I(\omega )\right.| < {{{\rm{thresh}}}}\), \(J(I(\omega ))=\{\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| < {{{\rm{thresh}}}}\}\}\), yielding J(I(ω)) = 1. Conversely, when I(ω)≥thresh, \(J(I(\omega ))=\{\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| < {{{\rm{thresh}}}}\},\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| \, \ge \, {{{\rm{thresh}}}}\}\}\), and thus J(I(ω)) = 2. As illustrated by the red numbers in Fig. 5 across different models, the values indicate the allowable range for the thresh parameter. For example, in the Class 1 case, ResNet-50 has 355 features of pixel position, implying that thresh ≤ 355. After validating all five models, the minimal number of features across them is 343, and so the threshold is constrained to thresh ≤ 343. This threshold allows us to test the global properties of the model below the 343-pixel region. It is important to note that this result is based on a single data sample. In the bar charts in Fig. 4c, we present the diagnostic results of global properties across the entire topological dataset and multiple models.

Fig. 5: An example of the shortcut hull learning.
figure 5

For each data class, we select a representative image as an example. We then showcase the features of this image as processed by models with different inductive biases, including ResNet-5026, ViT-B/1641, RepVGG-A257, Swin-T58, and PViG-S59. These features are projected onto the input image using the HiResCAM56 method. After thresholding, we obtain features corresponding to the pixel positions of the image. The common features across different models are then computed using Eq. (8). Beneath each feature, the red numbers indicate the total count of pixel position features.

Previously, we diagnosed whether global properties exhibit locality. Next, we demonstrate how to diagnose shortcuts involving different global properties. In this case, we not only consider local and global features, but also distinguish between different global features. Under this setting, for any given I(ω), J(I(ω)) = 2I(ω). As illustrated in Fig. 5, for the Class 1 example, after validation across all five models, the number of shared features is 37. This allows testing of global properties that are consistent across models below the 37-pixel region. The line chart in Fig. 4c displays the diagnostic results for common global properties across the complete topological dataset and multiple models.

Figure 4c presents statistical data on the number of features in models with different inductive biases. To ensure a fair comparison, these models have comparable parameter counts and computational costs. The bar and line charts corroborate our previous statistical analysis, showing that the local-global hierarchy is shortcut-free, as indicated by the relatively high minimum number of features for different models. This demonstrates the effectiveness of the SHL diagnostic paradigm. However, when examining the cross-global hierarchy, the line chart shows that although the number of features for different models remains high, the number of common features decreases, with the trend gradually flattening as more models are added. This suggests that SHL quickly converges when shortcuts are present.

Evaluating models on the shortcut-free topological dataset

We assess global properties of models using a standard topological dataset structured by varying difficulty levels, with each level defined by a graph G initialized with n × n nodes, where \(n\in {{\mathbb{N}}}^{+}\) represents the difficulty associated with image resolution (4n + 1) × (4n + 1). The difficulty level is defined as \(\frac{n-7}{2}\), setting a resolution of 29 × 29 as difficulty level 0. We begin with the lowest difficulty data, which ensures shortcut-free global properties, before assessing models with data from higher difficulty levels.

The experimental results are summarized in Table 1. A comparison between ResNet and ViT models reveals that, when parameters and computational costs are comparable, CNNs, designed based on local operations, significantly outperform ViT models, which are intended for global objectives, in recognizing global properties. The Swin-transformer model, which integrates features from both CNN and ViT, exhibits comparable capability to CNNs in global properties recognition, while MLP-based models fall between CNNs and ViT. These findings challenge previous assumptions about the capabilities of DNNs14,15,16,24,30,41,42. Moreover, with an increase in the number of parameters and computational load, ResNet performance remains relatively stable, while the Swin-transformer exhibits minor degradation. This suggests that a larger model does not necessarily perform better in recognizing global topological properties. Additional experimental details can be found in Supplementary Note 2.

Table 1 Evaluation of global capabilities of various models on topological datasets

Minsky and Papert initially explored the representational capacity of feedforward neural networks by examining their ability to encode topological properties such as connectivity35. In contrast, our proposed shortcut-free topological dataset shifts the focus toward learning capability, offering a more practical and comprehensive method for evaluating how neural networks acquire and generalize topological information. Our experimental results show that the complexity that current feedforward neural networks can recognize far exceeds their initial expectations (see Supplementary Note 4). Furthermore, the complexity that current feedforward neural networks can recognize surpasses that of the data typically used in human psychological experiments32,33, indicating to some extent that these networks have a much greater ability to recognize global properties than humans.

Discussion

This study introduces SHL, a diagnostic paradigm that formalizes shortcut representations within a unified probability space and leverages diverse models with varying inductive biases to effectively uncover and analyze shortcuts in data. Building upon SHL, we propose the SFEF—a shortcut-free evaluation framework validated through the construction of a controlled, topologically grounded dataset aimed at assessing global recognition abilities in DNNs.

Our contributions hold broad .implications for both AI development and interdisciplinary work comparing humans and AI. By revealing the discrepancy between perceived and actual model capabilities—for instance, the surprising underperformance of Transformers in global tasks—our findings urge a reassessment of prevailing assumptions in AI cognition. Notably, the superior performance of CNNs in global reasoning challenges the dominant narrative surrounding Transformer architectures, suggesting that architectural biases and training dynamics interact in more nuanced ways than previously assumed.

This work also draws attention to methodological mismatches in human–AI comparative studies10,13,16,49,50,51,52. While shortcut elimination is common in biological cognition experiments, similar strategies do not always translate effectively to AI contexts. Our dataset-centered approach offers a paradigm-specific diagnostic tool tailored to AI’s learning dynamics, enhancing the validity of such comparisons and paving the way for more robust interdisciplinary research.

In contrast to existing mitigation techniques that modify model architectures or training procedures, our approach focuses on removing shortcuts at the dataset level, offering strong generalizability across models. This enables consistent and fair evaluation of learning abilities, rather than learning preferences. By constructing datasets that remove confounding correlations, we shift the emphasis from feature reliance to genuine capability assessment. This distinction is crucial for understanding models’ inductive biases and for developing reliable benchmarks in fairness and reasoning.

Our experiments demonstrate how shortcut-free evaluations, under controlled conditions, can lead to qualitatively different conclusions compared to previous studies. Whereas earlier work—often confounded by unremoved local shortcuts—suggested that DNNs are inferior to humans, and that CNNs underperform Transformers in global reasoning tasks, our results, based on evaluations using our carefully constructed dataset, reveal a contrasting picture: CNN-based models outperform Transformers, and all examined DNNs surpass human performance. While these findings are specific to our experimental setup, they illustrate the impact of shortcut-free evaluations on our understanding of model capabilities.

More importantly, our results underscore a crucial insight observable within our framework: a model’s apparent feature preferences do not necessarily align with its underlying capabilities. This reinforces the importance of using properly controlled datasets when evaluating complex reasoning behaviors—especially when drawing conclusions about architectural advantages or human–AI performance comparisons.

Despite the clear benefits of our framework, several avenues could be further improved. One assumption in our evaluation is that models achieve 100% accuracy on test sets, which enables clean capability assessment under idealized conditions. While this is consistent with methodologies in cognitive psychology, such an assumption may not hold in real-world scenarios involving noisy data and overlapping features. Exploring relaxed evaluation criteria or incorporating probabilistic interpretations could extend the applicability of our method to more complex tasks.

In addition, the dataset construction process involves a computationally intensive sample selection stage, where millions of samples per difficulty level are generated and filtered based on topological properties. While this ensures consistency and control, the volume of required samples may increase substantially for higher difficulty levels, posing challenges in scalability to large or high-dimensional datasets. Future work could explore more efficient selection mechanisms or generative modeling techniques to alleviate this bottleneck.

Several promising directions emerge for future research. First, SFEF could be extended to probe beyond global topological reasoning, enabling broader evaluations of AI capabilities across different cognitive domains. Second, scaling SHL to large-scale, real-world datasets would allow testing its robustness in complex, noisy environments. Third, integrating SHL into continual learning pipelines could support real-time shortcut identification and removal during data acquisition, pushing toward dynamic shortcut-free learning. Finally, applying SHL to multi-modal and cross-modal systems, such as vision-language models, offers a path toward understanding shortcut behavior in more complex AI systems, ultimately contributing to fairer and more trustworthy AI.

Methods

Operational formulation of shortcut hull learning

This section presents the operational formulation of SHL, bridging the abstract probabilistic definition in Eq. (7) with its concrete, feature-based expression in Eq. (8).

Eq. (7) is equivalent to

$$\exists E\in {{{\mathcal{F}}}},\quad {\bigcup}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}^{-1}({g}^{-1}(g(X(E)))) \supsetneq {Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}(E)).$$
(9)

The partitioning of \({{{\rm{supp}}}}({\mathbb{P}})\) within the sample space Ω induced by Y should be unique, i.e., \(| \{\sigma ({Y}^{{\prime} }){| }_{{{{\rm{supp}}}}({\mathbb{P}})}\,| \,{Y}^{{\prime} }{=}^{a.s.}Y\} |=| \{\{E\cap {{{\rm{supp}}}}({\mathbb{P}})| E\in \sigma ({Y}^{{\prime} })\}| {Y}^{{\prime} }{=}^{a.s.}Y\}|=1\). Consequently, the existence of different solutions within the data essentially indicates different partitionings of \(\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})\). Essentially, \({{{\rm{supp}}}}({\mathbb{P}})\) represents the in-distribution (ID) sample set, denoted as \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\), while \(\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})\) represents the OOD sample set, denoted as \({{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}\). For the same ID event \({E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\), different solutions will generalize to different OOD events \({E}_{{{{\rm{OOD}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}\). Therefore, Eq. (9) is equivalent to

$$\exists {E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcup}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}^{-1}\left({g}^{-1}(g(X({E}_{{{{\rm{ID}}}}})))\right) \supsetneq {Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}({E}_{{{{\rm{ID}}}}})),$$
(10)

where \({X}^{-1}\left({g}^{-1}(g(X({E}_{{{{\rm{ID}}}}})))\right)\) is the event generalized by g(X(EID)), and \({Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}({E}_{{{{\rm{ID}}}}}))\) is the intended event generalized by Y(EID).

By reflecting the abstract SHL in the probability space into observable and operable real number features, we can directly diagnose shortcuts using the ID dataset itself. To represent features of different models within the same space, we map the neural network features to the input X. Typically, each dimension of the input vector is represented as a feature of the input53. In this context, each pixel position of an image is represented as a feature. For grayscale images, this is a pixel value, and for color images, it is a 3-dimensional pixel vector. Note that the selection of features here represents a form of derivation. Indeed, the representation of features is not unique, but this does not affect the derivation of conclusions. Let \(I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}\) denote a random index subset of the input random variable X. For ω Ω, define \({X}_{I}(\omega ):={({X}_{i}(\omega ))}_{i\in I(\omega )}\in {{\mathbb{R}}}^{| I(\omega )| }\), which represents a new subvector composed of components of X(ω) indexed by I(ω). Given a neural network feature function g(X), to reflect the SHL on the input space X, we consider a subvector XI such that σ(XI) = σ(g(X)), as illustrated in Fig. 6. We can then rewrite Eq. (10) as

$$\exists {E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcup}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}_{I}^{-1}({X}_{I}({E}_{{{{\rm{ID}}}}})) \supsetneq {X}_{{I}_{{{{\rm{Int}}}}}}^{-1}({X}_{{I}_{{{{\rm{Int}}}}}}({E}_{{{{\rm{ID}}}}})).$$
(11)
Fig. 6: An intuitive demonstration of the shortcut hull principle.
figure 6

a Analyzing how the partitioning of data within the sample space \((\Omega,{{{\mathcal{F}}}},{\mathbb{P}})\) changes as it propagates through the layers of neural networks, and how these changes are reflected in the features of the input X. b When multiple solutions exist in data, different models partition the final layer in the ID sample set \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) in the same way but partition the OOD sample set \({{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}\) differently, with these differences being mapped to distinct features in the common input X.

For a given feature representation g(X) and ω Ω, there may exist multiple index subsets I(ω) such that \({X}_{I}^{-1}({X}_{I}(\omega ))={X}^{-1}({g}^{-1}(g(X(\omega ))))\). For \(\forall I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}\) and ω Ω, define the set:

$$J(I(\omega ))\!:=\left\{\left.{\bigcup}_{{I}^{{\prime\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| {X}_{{I}^{{\prime\prime} }}^{-1}({X}_{{I}^{{\prime\prime} }}(\omega ))={X}_{{I}^{{\prime} }}^{-1}({X}_{{I}^{{\prime} }}(\omega ))}\{{I}^{{\prime\prime} }(\omega )\}\,\right\vert \,{I}^{{\prime} }(\omega )\subseteq I(\omega )\right\}.$$
(12)

Then, Eq. (11) is equivalent to stating that

$$\exists \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcap}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}J(I(\omega ))\,\subsetneq\, J({I}_{{{{\rm{Int}}}}}(\omega )).$$
(13)

Morgan’s Canon of data: hierarchical principle for data shortcuts

In comparative psychology, Morgan’s Canon posits that if an animal’s behavior can be adequately explained by lower-level psychological processes, then it should not be attributed to higher-level processes54,55. Similarly, we can derive Morgan’s Canon of data fields. There are two scenarios where shortcuts might occur:

  • For data providers, the measurable space \((\Omega,{{{\mathcal{F}}}})\) represents all possible events within a given research task. In this context, \((\Omega,{{{\mathcal{F}}}})\) remains consistent across different probability spaces, while the ID sample set \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) differs. Consider two probability spaces \((\Omega,{{{\mathcal{F}}}},{{\mathbb{P}}}_{1})\) and \((\Omega,{{{\mathcal{F}}}},{{\mathbb{P}}}_{2})\), with corresponding ID sample sets \({{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}\) and \({{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}\), where \({{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}\). Let random variables (X1Y1) and (X2Y2) be defined on these two probability spaces, respectively, and satisfy ω Ω, X1(ω) = X2(ω). If Y1(ω) = Y2(ω) holds for \(\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}\), then it must also hold for \(\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}\). That is, for \({{{{\mathcal{Y}}}}}_{1}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{1},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{1})\}\) and \({{{{\mathcal{Y}}}}}_{2}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{2},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{2})\}\), it holds that \({{{{\mathcal{Y}}}}}_{1}\subseteq {{{{\mathcal{Y}}}}}_{2}\). This implies that reducing the \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) may introduce shortcuts. Therefore, to mitigate such shortcuts, data providers must appropriately control the probability measure \({\mathbb{P}}\), which amounts to carefully selecting \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) to ensure that Eq. (10) is not satisfied.

  • For data users, by contrast, the ID sample set \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) remains fixed across different probability spaces. However, when applying the same data to different tasks, the OOD sample set \({{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}\) changes, implying a change in the sample space Ω. Suppose two probability spaces \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})\) and \(({\Omega }_{2},{{{{\mathcal{F}}}}}_{2},{{\mathbb{P}}}_{2})\) share the same ID sample set \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\), with \({\Omega }_{2}\in {{{{\mathcal{F}}}}}_{1}\) and \({{{{\mathcal{F}}}}}_{2}={{{{\mathcal{F}}}}}_{1}{| }_{{\Omega }_{2}}=\{E\cap {\Omega }_{2}| E\in {{{{\mathcal{F}}}}}_{1}\}\). Let (X1Y1) and (X2Y2) be random variables defined on these two respective probability spaces such that ω Ω2X1(ω) = X2(ω) and \(\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},{Y}_{1}(\omega )={Y}_{2}(\omega )\). It follows that σ(X2) σ(X1), and thus, for \({{{{\mathcal{Y}}}}}_{1}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{1},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{1})\}\) and \({{{{\mathcal{Y}}}}}_{2}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{2},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{2})\}\), it holds that \({{{{\mathcal{Y}}}}}_{2}\subseteq {{{{\mathcal{Y}}}}}_{1}\). This suggests that enlarging Ω may also introduce shortcuts. Therefore, to avoid these artifacts, data users must apply the dataset within an appropriately defined task—i.e., choose Ω carefully—to prevent Eq. (6) from being satisfied. In practice, the choice of Ω influences the function J in Eq. (12).

Hence, we can derive Morgan’s Canon of data. For data providers, the hierarchical structure is reflected in data \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\). Specifically, shortcuts at higher levels of dataset \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) do not indicate shortcuts at lower levels. For data users, the hierarchy is reflected in tasks Ω. In particular, shortcuts at lower levels of task Ω do not indicate shortcuts at higher levels.

Taking cat-dog classification as an example, Fig. 7 provides an intuitive illustration of Morgan’s Canon of data. For data providers, consider the random variables (X1Y1) and (X2Y2), defined on the probability spaces \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})\) and \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{2})\), respectively. Although the measurable space \((\Omega,{{{\mathcal{F}}}})\) is shared, variations in the ID sample set \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) lead to the emergence of shortcuts. For data users, consider the random variables (X1Y1) and (X3Y3), defined on the probability spaces \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})\) and \(({\Omega }_{2},{{{{\mathcal{F}}}}}_{2},{{\mathbb{P}}}_{3})\), respectively. Even if \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) remains fixed, differences in Ω result in shortcuts.

Fig. 7: An intuitive demonstration of Morgan’s Canon of data in cat-dog classification.
figure 7

a The boxes represent three probability spaces \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})\), \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{2})\), and \(({\Omega }_{2},{{{{\mathcal{F}}}}}_{2},{{\mathbb{P}}}_{3})\), where \({\Omega }_{2}\in {{{{\mathcal{F}}}}}_{1}\) and \({{{{\mathcal{F}}}}}_{2}={{{{\mathcal{F}}}}}_{1}{| }_{{\Omega }_{2}}=\{E\cap {\Omega }_{2}| E\in {{{{\mathcal{F}}}}}_{1}\}\). The corresponding ID sample sets within each space are \({{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}\), \({{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}\), and \({{{{\mathcal{E}}}}}_{{{{\rm{ID3}}}}}\) with \({{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}={\Omega }_{1}\), and \({{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}={{{{\mathcal{E}}}}}_{{{{\rm{ID3}}}}}={\Omega }_{2}\). In the measurable space \(({\Omega }_{1},{{{{\mathcal{F}}}}}_{1})\), the event representing dogs is defined as Sdog Tdog, where Sdog and Tdog denote shape and texture events for dogs, respectively. Similarly, the event cats is defined as Scat Tcat, where Scat and Tcat represent shape and texture events for cats. {Sdog TdogScat Tcat} constitutes a partitioning of the sample space Ω1. In another measurable space \(({\Omega }_{2},{{{{\mathcal{F}}}}}_{2})\), the event Sdog ∩ Tdog representing dogs comprises the same samples as it does in Ω1. Likewise, the event Scat ∩ Tcat represents cats. {Sdog ∩ TdogScat ∩ Tcat} constitutes a partitioning of the sample space Ω2. b The boxes denote random variables (X1Y1), (X2Y2), and (X3Y3) defined respectively over these three probability spaces, where ω Ω1X1(ω) = X2(ω), and ω Ω2X1(ω) = X2(ω) = X3(ω), Y1(ω) = Y2(ω) = Y3(ω). c An intuitive illustration of the possible partitionings \({{{{\mathcal{Y}}}}}_{1}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{1},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{1})\}\), \({{{{\mathcal{Y}}}}}_{2}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{2},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{2})\}\), and \({{{{\mathcal{Y}}}}}_{3}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{3},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{3})\}\) respectively induced by Y1, Y2, and Y3, where \(| {{{{\mathcal{Y}}}}}_{1}| > 1\), \(| {{{{\mathcal{Y}}}}}_{2}|=| {{{{\mathcal{Y}}}}}_{3}|=1\). \({{{{\mathcal{Y}}}}}_{1}\) and \({{{{\mathcal{Y}}}}}_{2}\) illustrate the impact of \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) on \({{{\mathcal{Y}}}}\) under a fixed \((\Omega,{{{\mathcal{F}}}})\), while \({{{{\mathcal{Y}}}}}_{1}\) and \({{{{\mathcal{Y}}}}}_{3}\) illustrate the effect of the Ω on \({{{\mathcal{Y}}}}\) when \({{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}\) remains constant.

Feature preferences of models with different inductive biases

As illustrated in Fig. 5, we utilize the HiResCAM56 method to map model features to input data with precision. We analyze features from models including ResNet-5026, ViT-B/1641, RepVGG-A257, Swin-T58, and PViG-S59 using data across three distinct classes. The results indicate that models with analogous inductive biases display remarkably similar features. For instance, CNN-based models like ResNet-50 and RepVGG-A2 show similar features across the classes. In contrast, models with divergent inductive biases exhibit pronounced feature differences. In Class 1, CNN-based models highlight the gap region between two outside relational loops, while ViT-B/16 and Swin-T emphasize the loops themselves rather than the gap. In Class 2, ViT-B/16 distinctly diverges from the other four models, focusing on the contours of the two loops rather than the gap. PViG demonstrates similarities to CNN-based models in both Class 1 and Class 2.

Analyzing model convergence on the topological dataset

Figure 8 shows that, when evaluating the same model with datasets of varying difficulty levels, model accuracy either approximates random guesses (with an accuracy close to \(33.\overline{3}\%\)) or approaches 100%, indicating that models either effectively capture global properties or fall short. Thus, categorizing by difficulty level provides a more precise and clearer assessment of model capability.

Fig. 8: The convergence curves of the ResNet26 models for the topological dataset.
figure 8

The conditions include different model sizes, dataset difficulty levels, and learning rate parameters.

The analysis of Fig. 8b, a, and e reveals that the ResNet models’ convergence rate is notably sensitive to the learning rate when applied to the topological dataset, under consistent model sizes and dataset difficulty levels. A learning rate of 0.1 ensures model convergence, whereas any slight increase to 1 or decrease to 0.01 leads to non-convergence. Additionally, an examination of Fig. 8b–d, f, g, h shows that ResNet models of various sizes can converge with a 100% accuracy at a dataset difficulty level of 12, corresponding to a resolution of 125 × 125. However, when the difficulty level escalates to 13, with a resolution of 133 × 133, models of different sizes fail to converge, achieving an accuracy rate of ~\(33.\overline{3}\%\), which is equivalent to random guessing. This indicates that the scale of the model has a marginal impact on convergence in topological dataset, and the model predominantly fluctuates between two states under these conditions: achieving 100% accuracy or resorting to random guessing.

Experimental setup

The MarkovJunior60 framework is employed to accelerate the generation of topological datasets.

To harness the knowledge from the pre-trained model on the ImageNet dataset27, we resized each data instance to a dimension of 224 × 224 pixels with nearest-neighbor interpolation. As the original size of each data instance is smaller than 224 × 224, this scaling method does not result in a loss of detail or cause any changes to the data labels.

To ensure a fair comparison between each model, we utilize the cross-entropy function as the loss function for all models, avoiding any dropout-related techniques61,62. To preserve the topological properties of the dataset, data augmentation is limited to flipping and rotation, without employing label smoothing63.

Due to observations during the experiments indicating that the model either failed to converge or converged to a significantly high loss value on topological datasets (see Fig. 8 and Supplementary Note 2), we employed an optimizer without weight decay for training. Considering the higher convergence ceiling of stochastic gradient descent (SGD), we chose SGD with momentum64 and no weight decay as the training optimizer. A comparative analysis between SGD and Adam65 on the topological datasets is provided in Supplementary Note 3. For the purpose of facilitating grid search for the optimal learning rate, we adopt a constant learning rate in our experiments.

In the context of neural networks, the resolution of the feature map decreases as the network depth increases, leading to a coarser granularity in the HiResCAM56 method’s visualizations. However, the features from the latter layers of the neural network tend to be more precise. Thus, we strike a balance by selecting features from intermediate layers where the resolution remains sufficiently high, ensuring detailed visualizations. Based on our preliminary theoretical considerations, we utilize output soft labels or features from prior layers as network features, ensuring that our evaluation method remains accurate. The specific feature layers chosen for each evaluation model are detailed in Table 2.

Table 2 Detailed parameter configurations of the models used for SHL

No ethical approval or inclusion considerations were required for this study, as it did not involve human or animal subjects, nor did it use any sensitive or personally identifiable data.