Mitigating data bias and ensuring reliable evaluation of AI models with shortcut hull learning

Zhou, Wenhao; Liu, Faqiang; Zheng, Hao; Zhao, Rong

doi:10.1038/s41467-025-60801-6

Download PDF

Article
Open access
Published: 01 July 2025

Mitigating data bias and ensuring reliable evaluation of AI models with shortcut hull learning

Wenhao Zhou ORCID: orcid.org/0000-0002-0411-414X^1,2,3,4,5,
Faqiang Liu ORCID: orcid.org/0000-0002-2236-0539^1,2,3,4,5,
Hao Zheng^1,2,3,4,5 &
…
Rong Zhao ORCID: orcid.org/0000-0002-2320-0326^1,2,3,4,5

Nature Communications volume 16, Article number: 5513 (2025) Cite this article

3896 Accesses
2 Citations
Metrics details

Subjects

Abstract

Shortcut learning poses a significant challenge to both the interpretability and robustness of artificial intelligence, arising from dataset biases that lead models to exploit unintended correlations, or shortcuts, which undermine performance evaluations. Addressing these inherent biases is particularly difficult due to the complex, high-dimensional nature of data. Here, we introduce shortcut hull learning, a diagnostic paradigm that unifies shortcut representations in probability space and utilizes diverse models with different inductive biases to efficiently learn and identify shortcuts. This paradigm establishes a comprehensive, shortcut-free evaluation framework, validated by developing a shortcut-free topological dataset to assess deep neural networks’ global capabilities, enabling a shift from Minsky and Papert’s representational analysis to an empirical investigation of learning capacity. Unexpectedly, our experimental results suggest that under this framework, convolutional models—typically considered weak in global capabilities—outperform transformer-based models, challenging prevailing beliefs. By enabling robust and bias-free evaluation, our framework uncovers the true model capabilities beyond architectural preferences, offering a foundation for advancing AI interpretability and reliability.

Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data

Article Open access 14 May 2024

A GPU-based computational framework that bridges neuron simulation and artificial intelligence

Article Open access 18 September 2023

Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization

Article Open access 04 January 2024

Introduction

Current data-driven artificial intelligence (AI) has achieved remarkable success across various fields^{1,2,3,4,5,6,7,8}, primarily through model training and evaluation using datasets^{9,10,11,12,13,14,15,16,17}. However, most datasets contain inherent biases, leading models to learn and exploit unintended task-correlated features or shortcuts, a phenomenon referred to as shortcut learning^{18,19,20,21,22}. This issue undermines the assessment of AI models’ true capabilities, limits our understanding of their underlying mechanisms, and hinders their explainability and robust deployment in critical areas such as healthcare and autonomous driving. As illustrated in Fig. 1a, both humans and AI models may rely on these unintended features when evaluated using such biased datasets, resulting in biased assessments that reflect models’ preferences rather than their true abilities. Given the importance of trustworthy AI applications, developing a shortcut-free evaluation methodology is vital, yet it poses a considerable challenge.

**Fig. 1: Evaluation of the capabilities of AI models on shortcut-free datasets reliably.**

Richard Feynman, in his 1974 lecture “Cargo Cult Science”, highlighted the phenomenon of shortcuts in experimental designs, particularly in psychology and education. He noted that these designs superficially adhere to scientific principles without genuine scientific validation, leading to erroneous conclusions²³. To eliminate shortcuts, psychological experiments typically hypothesize all potential shortcut variables and manipulate them to observe their impact^18,23. A similar approach has been applied in AI, where researchers manipulate predefined shortcut features to create out-of-distribution (OOD) datasets and use these generalization tests to determine whether models have learned shortcut features^18,21,24,25. However, this method only identifies specific shortcuts and fails to diagnose the entire dataset. Unlike traditional disciplines, which often deal with low-dimensional variables, AI faces the challenge of high-dimensional data, a complexity we refer to as the curse of shortcuts. As depicted in Fig. 1b, high-dimensional data exponentially increase the number of features, making it challenging to account for all possible shortcuts. Furthermore, in such complex data, overall labels are often intertwined with local features, making it impossible to intervene in specific shortcut features without affecting the overall label.

In this article, we introduce a paradigm for diagnosing shortcuts in high-dimensional datasets—shortcut hull learning (SHL)—which effectively addresses the curse of shortcuts. Building on SHL, we design a shortcut-free evaluation framework (SFEF). By formalizing a unified representation theory of data shortcuts within a probability space, we define a fundamental indicator for determining whether a dataset contains shortcuts, termed the shortcut hull (SH)—the minimal set of shortcut features, as shown in Fig. 1c. Inspired by existing dataset suite^26,27 used to evaluate AI models, SHL incorporates a model suite composed of models with different inductive biases^{21,24,25,28,29,30,31} and employs a collaborative mechanism to learn the SH of high-dimensional datasets. This approach facilitates efficient and direct learning of SH, enabling the diagnosis of dataset shortcuts and circumventing the curse of shortcuts in conventional coverage and intervention approaches.

To validate SHL and SFEF, we apply them to study global topological perceptual capabilities, which are fundamental to biological visual perception^32,33,34. Notably, Minsky and Papert previously used these capabilities to examine the expressive power of neural networks^35,36,37, emphasizing their importance in both understanding biological cognition and analyzing AI models. Our experiments demonstrate that SHL quickly and accurately identifies inherent shortcuts in topological datasets representing global capabilities. Using SFEF, we successfully construct a shortcut-free topological dataset and apply it to evaluate the global capabilities of mainstream deep neural networks (DNNs). Previous studies on global capabilities failed to eliminate local shortcuts, thus inherently studying models’ preferences for global versus local features. These studies concluded that convolutional neural network (CNN)-based models^{21,24,28,38,39} are inferior to Transformer-based models^{25,29,30,31,40,41,42} in global capability and that DNNs are less effective than humans at recognizing global properties^14,15,16. However, using SFEF, we arrive at significantly different and more reliable conclusions: CNN-based models outperform Transformer-based models in recognizing global properties, and all DNNs surpass human capabilities, challenging previous understandings of DNNs’ abilities. More importantly, our findings confirm a critical point: models’ learning preferences do not represent their learning capabilities. Moreover, our constructed topological dataset enables a shift from Minsky and Papert’s representational analysis of neural networks on connectedness predicate³⁵ to an empirical investigation of their learning capacity.

Overall, SHL addresses the fundamental bottleneck in eliminating shortcuts from high-dimensional datasets, and the SFEF establishes a foundational framework for unbiased evaluation of AI models. Future development of more shortcut-free datasets using this approach promises to deepen our understanding of AI, offering new insights for research and potentially setting the groundwork for fairer comparisons between human and AI capabilities.

Results

Probabilistic formulation of data shortcuts

Since data shortcuts arise from the inherent nature of the data itself and are independent of specific representations, the same data can often be represented in various ways. For example, when data is represented as real-valued vectors, the number and semantics of dimensions in different representations can vary significantly, making alignment between them challenging. This misalignment complicates the analysis of data shortcuts. To address this issue, we employ formal methods from probability theory. In probability theory, a random variable is a mapping from a sample space to the real number space, where the sample space can be considered as the space representing the data itself. Different random variables can be viewed as different representations. Through this approach, we can formalize a unified shortcut representation in probability space, independent of the specific representation of the data.

Considering a classification problem, let $(\Omega,{{{\mathcal{F}}}},{\mathbb{P}})$ denote a probability space, where Ω denotes the sample space, ${{{\mathcal{F}}}}$ is a σ-algebra of events, and ${\mathbb{P}}$ is a probability measure on ${{{\mathcal{F}}}}$. The joint random variable of input and label is represented by the mapping $(X,Y):\Omega \to {{\mathbb{R}}}^{n}\times {\{0,1\}}^{c}$. The training event is defined as

$$\{\omega \in \Omega \,|\, X(\omega )\in B,Y(\omega )\in P\}\in {{{\mathcal{F}}}},\quad \forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n}),\forall P\in {2}^{{\{0,1\}}^{c}},$$

(1)

where ${{{\mathcal{B}}}}({{\mathbb{R}}}^{n})$ denotes the Borel σ-algebra on ${{\mathbb{R}}}^{n}$, and ${2}^{{\{0,1\}}^{c}}$ denotes the power set of {0, 1}^c. The probability of each training event is given by

$${{\mathbb{P}}}_{X,Y}(B,P)={\mathbb{P}}(\{\omega \in \Omega \,|\, X(\omega )\in B,Y(\omega )\in P\}),\quad \forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n}),\forall P\in {2}^{{\{0,1\}}^{c}},$$

(2)

where ${{\mathbb{P}}}_{X,Y}$ is the joint probability distribution of (X, Y). The information contained in the input random variable X, represented by the σ-algebra generated by X:

$$\sigma (X)=\{E\subseteq \Omega \,|\, E={X}^{-1}(B),\forall B\in {{{\mathcal{B}}}}({{\mathbb{R}}}^{n})\}\subseteq F,$$

(3)

including all events of interest in ${{{\mathcal{F}}}}$. It should be noted that, given a random variable X, possibly $\exists {X}^{{\prime} },{X}^{{\prime} }{=}^{a.s.}X,\sigma ({X}^{{\prime} })\,\ne\, \sigma (X)$. This is due to the fact that possibly $\exists \omega \in (\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})),X(\omega )\,\ne\, {X}^{{\prime} }(\omega )$, where ${{{\rm{supp}}}}({\mathbb{P}})$ denotes the support of ${\mathbb{P}}$. Similarly, the label information is defined as a collection of all events in the label random variable Y, represented by the σ-algebra generated by Y:

$$\sigma (Y)=\{E\subseteq \Omega \,|\, E={Y}^{-1}(P),\forall P\in {2}^{{\{0,1\}}^{c}}\}\subseteq F.$$

(4)

Since there are c categories in a classification context, the label $Y={\{{{\mathbb{I}}}_{{A}_{j}}\}}_{j=1}^{c}$, where ${{\mathbb{I}}}_{{A}_{j}}$ is the indicator function of the set A_j, and the finite set $C={\{{A}_{j}\}}_{j=1}^{c}\subseteq {{{\mathcal{F}}}}$ partitions the sample space Ω. Thus, the label Y acts as a classifier, linking the dimension of Y to the label of the event A_j⁴³. Hence,

$$\sigma (Y)={2}^{C},$$

(5)

where 2^C represents the power set of C.

If the label Y is learnable from the input X, this implies that all the information in Y is also contained in X, i.e., σ(Y) ⊆ σ(X). In this case, there exists a Borel measurable function f such that σ(f(X)) = σ(Y). Let ${{{\mathcal{Y}}}}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}Y,\sigma ({Y}^{{\prime} })\subseteq \sigma (X)\}$ denote the collection of all possible partitionings of the sample space Ω induced by the label random variable Y. We denote the intended partitioning of the Ω by $\sigma ({Y}_{{{{\rm{Int}}}}})\in {{{\mathcal{Y}}}}$. The essence of shortcuts in data is that the data distribution ${{\mathbb{P}}}_{X,Y}$ deviates from the intended solution. Formally, when the data distribution ${{\mathbb{P}}}_{X,Y}$ exhibits shortcuts, it holds that $| {{{\mathcal{Y}}}}| > 1$, i.e.,

$$\exists \sigma (\,\,f(X))\in {{{\mathcal{Y}}}},\quad \sigma (f(X))\, \ne\, \sigma ({Y}_{{{{\rm{Int}}}}}).$$

(6)

Theory for diagnosing data shortcuts with shortcut hull learning

For a Borel measurable function f, if $f(X){=}^{a.s.}Y$, i.e., ${{\mathbb{P}}}_{X,Y}[\,\,f(X)\,\ne\, Y]=0$, then $\sigma (f(X))\in {{{\mathcal{Y}}}}$. In practice, neural networks can be employed to learn the function f. Given a dataset ${{{\mathcal{D}}}}={{\mathbb{P}}}_{X,Y}^{d}$, classification tasks typically divide ${{{\mathcal{D}}}}$ into a training set ${{{{\mathcal{D}}}}}_{{{{\rm{train}}}}}$ and a test set ${{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}$. If $\frac{| \{(x,y)\in {{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}| f(x)\ne y\}| }{| {{{{\mathcal{D}}}}}_{{{{\rm{test}}}}}| }=0$, it indicates that f has learned the data distribution ${{\mathbb{P}}}_{X,Y}$. Assuming a neural network has m layers, with the first i layers denoted by g_i, then f = g_m. It can be posited that each layer of the neural network performs feature extraction, resulting in the relationship σ(X) ⊇ σ(g₁(X)) ⊇ ⋯ ⊇ σ(g_m(X)) by the Doob–Dynkin lemma⁴⁴. When interpreting neural networks, it is common to avoid considering thresholded output labels as output features due to potential significant information loss. Instead, we use the features from soft labels output or previous layers as task-correlated features learned by the neural networks, denoted as g(X), implying that σ(g(X)) ⊇ σ(f(X)). Multiple neural networks can be employed to learn the intrinsic features of the dataset, i.e., σ(f(X)) = ⋂_{g∣σ(g(X))⊇σ(f(X))}σ(g(X)). By equivalently rewriting Eq. (6) as ${\bigcap }_{f| {{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}\sigma (f(X))\,\subsetneq\, \sigma ({Y}_{{{{\rm{Int}}}}})$, we obtain

$${\bigcap}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}\sigma (g(X))\,\subsetneq\, \sigma ({Y}_{{{{\rm{Int}}}}}).$$

(7)

We refer to Eq. (7) as the SH, which intuitively represents the minimal shortcut. As the number of models increases, it converges exponentially (see Supplementary Note 1 for proof). This approach of using multiple models to learn the SH is termed SHL. To map the abstract SHL in the probability space onto real-valued features that are both observable and operable, and to represent features from different models within the same space, we associate the neural network features with the input X, Consequently, Eq. (7) can be reformulated as follows:

$$\exists \omega \in {{{\rm{supp}}}}({\mathbb{P}}),\quad {\bigcap}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}J(I(\omega ))\,\subsetneq\, J({I}_{{{{\rm{Int}}}}}(\omega )),$$

(8)

where $I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}$, denotes a random subset index mapping of X, For $\forall I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}$ and ∀ ω ∈ Ω, $J(I(\omega ))=\{{\bigcup }_{{I}^{{\prime\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| {X}_{{I}^{{\prime\prime} }}^{-1}({X}_{{I}^{{\prime\prime} }}(\omega ))={X}_{{I}^{{\prime} }}^{-1}({X}_{{I}^{{\prime} }}(\omega ))}\{{I}^{{\prime\prime} }(\omega )\}| {I}^{{\prime} }(\omega )\subseteq I(\omega )\}$. A detailed derivation of Eq. (8) can be found in the Methods section.

Shortcut-free evaluation framework

To reliably evaluate a model’s performance using shortcut-free datasets, we propose an SFEF based on the SHL. This framework consists of five steps, as illustrated in Fig. 2. Step 1 outlines the required structure of the dataset. Steps 2 to 4 demonstrate the use of a model suite to diagnose dataset shortcuts, and Step 5 shows the reliable evaluation of models using a shortcut-free dataset.

In Step 1, we construct an evaluation dataset tailored to assess specific model capabilities. This dataset is structured into multiple levels of difficulty, where the distribution of data differs only in difficulty. The rationale for the multi-level difficulty approach is threefold:

Learning data distribution: As indicated by Eq. (7), the model must fully learn the data distribution, necessitating a low-difficulty dataset that adheres to an identical distribution.
Model capability assessment: If the data distribution is shortcut-free, the model’s accuracy on datasets of the same difficulty level will either be random or 100%, indicating where the model has successfully captured the capability or failed to do so. The multi-level difficulty approach offers a clearer and more precise assessment of the model’s capability.
Data difficulty expansion: The multi-level difficulty approach also facilitates easier expansion of the dataset’s difficulty range.

In Step 2, considering that models with different inductive biases may favor different solutions for the same data²⁹, we construct a suite of models with various inductive biases. Each model in the suite is trained on the lowest difficulty level dataset, referred to as the diagnostic level, to learn the data distribution. Step 3 involves selecting models from the suite that have fully learned the data distribution, such as those achieving 100% accuracy on a classification task’s test set. In Step 4, these selected models are then employed to diagnose shortcuts in the diagnostic level data using SHL. Similar to how a model’s performance is evaluated using diverse datasets to assess generalization, increasing the number and variety of models enhances SHL’s accuracy in diagnosing dataset shortcuts. Since all difficulty levels in Step 1 share the same distribution, the presence of shortcuts in the diagnostic level implies the existence of shortcuts across all difficulty levels. Finally, in Step 5, we use the shortcut-free datasets, identified in Step 4, across varying difficulty levels, to evaluate the capability of the models. This ensures a reliable assessment, free from the confounding effects of data shortcuts.

Design of the shortcut-free topological dataset

To illustrate the efficacy of the proposed SHL and SFEF, we design a dataset that captures the global properties of visual perception without local shortcuts. The global nature of visual perception plays a pivotal role for both humans and AI. For humans, global features allow for rapid object detection without the need for an in-depth analysis of local characteristics, a capability crucial for survival in natural environments. Given the pronounced sensitivity of the human visual system to global features, it is posited that in visual processing, global features are processed first, guiding the interpretation of local features^32,33. Similarly, in AI, the importance of the global connectedness predicate was underscored by Minsky and Papert during the early development of neural networks^35,36. In contemporary large-scale models^45,46, the attention mechanism⁴⁰ is primarily designed to endow the model with the ability to directly perceive global information.

Global properties of visual perception can be described in terms of topological properties, which remain invariant under continuous transformations such as stretching, twisting, crumpling, and bending. These properties are fundamental to geometric objects and remain unchanged under topological transformations. When considering two-dimensional manifolds projected onto the retina, these invariant properties can be categorized into three main topological facets: connectivity, the count of holes, and inside/outside relationship^32,33. While topological properties are not directly identifiable by local features, there is a risk of spurious correlation with local features during dataset construction. To address this, we design a visual topological synthesis dataset, which is crafted such that only minor and localized changes are needed to modify the topological properties of the images, without impacting the local statistics. This generation strategy enables the verification and control of local shortcuts.

In this visual topological dataset, each image consists of several closed loops, where images with different numbers of closed loops are topologically distinct. Moreover, closed loops of the same quantity may differ in their inside/outside relationships, further contributing to their topological uniqueness. In the synthesis process, images with a higher number of closed loops are generated from those with fewer closed loops through minor transformations. For simplicity, we assume the dataset contains images with either 1 or 2 closed loops, although the method can be extended to images with more loops. The synthesis process is shown in Fig. 3, where Fig. 3a–c represent different stages of synthesis, respectively. As depicted in Fig. 3a, we begin by initializing a graph G with n × n nodes, corresponding to the original image resolution of (4n + 1) × (4n + 1). Each node exclusively connects to adjacent nodes in four cardinal directions: top, bottom, left, and right. Every node represents a 3 × 3 pixel closed-loop node from the original image, with adjacent closed-loop nodes separated by one pixel.

**Fig. 3: Synthesis process of visual topological dataset.**

To illustrate the generation of a visual topological dataset, we consider a graph with a resolution of 29 × 29, i.e., 7 × 7 nodes, as an example. To create a random closed loop, Wilson’s algorithm⁴⁷ is employed to generate a uniform spanning tree of the graph G. We start by randomly selecting a vertex of graph G to form a single-vertex tree T (the white closed-loop vertex of Fig. 3a), representing a closed-loop vertex in the original image. Another vertex v, chosen randomly and not in the tree T (the red closed-loop vertex in Fig. 3a), undergoes a loop-erased random walk from v until it reaches a vertex in tree T, at which point the resulting path is appended to T. Upon connecting two closed-loop vertices, the corresponding closed loops merge into a single closed loop. The process of loop-erased random walk, illustrated in Fig. 3b, starts from vertex v, and at each step randomly moves to an untraversed adjacent vertex, forming connections between two closed loops until reaching a vertex in T. If the random walk reaches its own path, forming a loop, the loop is removed from the path before the walk continues. Figure 3c illustrates a complete process of generating a single closed loop, resulting in a uniformly distributed tree after all vertices have been included.

Figure 3d shows the generation of two closed loops from a single closed loop with minor local changes, allowing for the randomization of their inside/outside relationships. By modifying only the elements in 3 × 3 pixels, the topological properties of the image can be altered without affecting its local statistics. Since the generated closed-loop corresponds to the spanning tree T, which is a minimal connected subgraph of the graph G, all edges in T are edge-cuts. Removing any edge from T results in two connected subgraphs, thereby generating the two closed loops that represent the outside relations.

However, due to the method of generating the closed loop, the local features of the regions inside and outside the loop remain identical. Therefore, the connecting edges outside the closed loop can be severed in the same manner as those inside the loop. This process yields two closed loops corresponding to the inside relations depicted in Fig. 3d. When partitioning the outside of a closed loop, there is only one possible outside pathway, inevitably forming a closed loop inside. Initially, the outside access constitutes a closed loop, but the location of this closure changes during the process.

In this dataset, we categorize regions into three distinct topological classes: a connected region, a connected region with two outside relations, and a connected region with two inside relations. To ensure consistent difficulty in data generation across the same difficulty level (i.e., resolution), we pre-generate 1,000,000 samples for both the second and third data classes at each difficulty level. From these, we select the top 10,000 samples with the largest adjacent regions between two topological loops. These selected samples are then split into 8000 training samples and 2000 testing samples. Consequently, each difficulty level contains 30,000 samples in total. The full dataset is available at Figshare⁴⁸.

Statistical characteristics of the topological dataset

The purpose of the topological dataset is to identify global topological features. Hence, we analyze potential statistical correlations in the local properties of images from different classes. In the topological data generation process, images of different classes are produced by modifying pixel values within a local 3 × 3 range of images from the same class. Our goal is to ascertain the impact of these local alterations on the surrounding region’s statistical characteristics. Therefore, we statistically assess the presence of distinguishable n × n local features from various n × n pixel scales. In the generation process shown in Fig. 3d, it is evident that the statistical characteristics of 1 × 1 local patterns are consistent across different classes.

As depicted in Fig. 4a, all possible 3 × 3 local patterns are enumerated. While there are 2^3 × 3 potential patterns for 3 × 3 pixels, not all patterns manifest in this configuration. Additionally, certain patterns exhibit strong correlations; for instance, rotations and symmetries can produce up to 8 patterns. After excluding non-occurring and strongly correlated patterns, we present the remaining 3 × 3 local patterns and demonstrate their statistical characteristics within a 29 × 29-sized dataset through template matching.

**Fig. 4: Diagnosis of the topological dataset’s shortcuts.**

In Fig. 4b, we illustrate the proportion of distinguishable categories within a 29 × 29-sized dataset at different template scales. Similar to the 3 × 3 matching templates portrayed in Fig. 4a, templates of varying sizes are used for image matching. For each image, we search for a template of the corresponding size and match it with templates from the other two categories. If a matching template is found in the other two classes, it indicates that the image is statistically indistinguishable at that scale. Conversely, the absence of a matching template suggests statistically distinguishability. The results in Fig. 4b reveal that when the template size reaches 10 × 10, only about 1% of the dataset becomes distinguishable by classes. Hence, we conclude that categorizing data based on minor local changes does not introduce local statistical correlations. Due to the substantial computational demands of brute-force template matching, our analysis is limited to a 10 × 10 scale. Nonetheless, the results are conclusive for our purposes.

Diagnosing the topological dataset using shortcut hull learning

Through prior statistical analysis of the topological dataset, we demonstrate that the designed dataset is shortcut-free concerning global properties. Below, we show how the SHL diagnosis verifies the shortcut-free nature, as established by the earlier statistical analysis. Furthermore, by altering Ω, we illustrate how SHL diagnosis can reveal different hierarchies of shortcuts within the topological dataset.

To assess whether the global properties exhibit shortcuts related to local properties, we focus specifically on two types of features: global and local. This is consistent with the feature types analyzed in the earlier statistical analysis. In Eq. (8), a threshold value (thresh) is used to demarcate global and local characteristics. When $| \left(I(\omega )\right.| < {{{\rm{thresh}}}}$, $J(I(\omega ))=\{\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| < {{{\rm{thresh}}}}\}\}$, yielding ∣J(I(ω))∣ = 1. Conversely, when ∣I(ω)∣≥thresh, $J(I(\omega ))=\{\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| < {{{\rm{thresh}}}}\},\{{I}^{{\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| | {I}^{{\prime} }(\omega )| \, \ge \, {{{\rm{thresh}}}}\}\}$, and thus ∣J(I(ω))∣ = 2. As illustrated by the red numbers in Fig. 5 across different models, the values indicate the allowable range for the thresh parameter. For example, in the Class 1 case, ResNet-50 has 355 features of pixel position, implying that thresh ≤ 355. After validating all five models, the minimal number of features across them is 343, and so the threshold is constrained to thresh ≤ 343. This threshold allows us to test the global properties of the model below the 343-pixel region. It is important to note that this result is based on a single data sample. In the bar charts in Fig. 4c, we present the diagnostic results of global properties across the entire topological dataset and multiple models.

**Fig. 5: An example of the shortcut hull learning.**

Previously, we diagnosed whether global properties exhibit locality. Next, we demonstrate how to diagnose shortcuts involving different global properties. In this case, we not only consider local and global features, but also distinguish between different global features. Under this setting, for any given I(ω), J(I(ω)) = 2^I(ω). As illustrated in Fig. 5, for the Class 1 example, after validation across all five models, the number of shared features is 37. This allows testing of global properties that are consistent across models below the 37-pixel region. The line chart in Fig. 4c displays the diagnostic results for common global properties across the complete topological dataset and multiple models.

Figure 4c presents statistical data on the number of features in models with different inductive biases. To ensure a fair comparison, these models have comparable parameter counts and computational costs. The bar and line charts corroborate our previous statistical analysis, showing that the local-global hierarchy is shortcut-free, as indicated by the relatively high minimum number of features for different models. This demonstrates the effectiveness of the SHL diagnostic paradigm. However, when examining the cross-global hierarchy, the line chart shows that although the number of features for different models remains high, the number of common features decreases, with the trend gradually flattening as more models are added. This suggests that SHL quickly converges when shortcuts are present.

Evaluating models on the shortcut-free topological dataset

We assess global properties of models using a standard topological dataset structured by varying difficulty levels, with each level defined by a graph G initialized with n × n nodes, where $n\in {{\mathbb{N}}}^{+}$ represents the difficulty associated with image resolution (4n + 1) × (4n + 1). The difficulty level is defined as $\frac{n-7}{2}$, setting a resolution of 29 × 29 as difficulty level 0. We begin with the lowest difficulty data, which ensures shortcut-free global properties, before assessing models with data from higher difficulty levels.

The experimental results are summarized in Table 1. A comparison between ResNet and ViT models reveals that, when parameters and computational costs are comparable, CNNs, designed based on local operations, significantly outperform ViT models, which are intended for global objectives, in recognizing global properties. The Swin-transformer model, which integrates features from both CNN and ViT, exhibits comparable capability to CNNs in global properties recognition, while MLP-based models fall between CNNs and ViT. These findings challenge previous assumptions about the capabilities of DNNs^{14,15,16,24,30,41,42}. Moreover, with an increase in the number of parameters and computational load, ResNet performance remains relatively stable, while the Swin-transformer exhibits minor degradation. This suggests that a larger model does not necessarily perform better in recognizing global topological properties. Additional experimental details can be found in Supplementary Note 2.

Table 1 Evaluation of global capabilities of various models on topological datasets

Full size table

Minsky and Papert initially explored the representational capacity of feedforward neural networks by examining their ability to encode topological properties such as connectivity³⁵. In contrast, our proposed shortcut-free topological dataset shifts the focus toward learning capability, offering a more practical and comprehensive method for evaluating how neural networks acquire and generalize topological information. Our experimental results show that the complexity that current feedforward neural networks can recognize far exceeds their initial expectations (see Supplementary Note 4). Furthermore, the complexity that current feedforward neural networks can recognize surpasses that of the data typically used in human psychological experiments^32,33, indicating to some extent that these networks have a much greater ability to recognize global properties than humans.

Discussion

This study introduces SHL, a diagnostic paradigm that formalizes shortcut representations within a unified probability space and leverages diverse models with varying inductive biases to effectively uncover and analyze shortcuts in data. Building upon SHL, we propose the SFEF—a shortcut-free evaluation framework validated through the construction of a controlled, topologically grounded dataset aimed at assessing global recognition abilities in DNNs.

Our contributions hold broad .implications for both AI development and interdisciplinary work comparing humans and AI. By revealing the discrepancy between perceived and actual model capabilities—for instance, the surprising underperformance of Transformers in global tasks—our findings urge a reassessment of prevailing assumptions in AI cognition. Notably, the superior performance of CNNs in global reasoning challenges the dominant narrative surrounding Transformer architectures, suggesting that architectural biases and training dynamics interact in more nuanced ways than previously assumed.

This work also draws attention to methodological mismatches in human–AI comparative studies^{10,13,16,49,50,51,52}. While shortcut elimination is common in biological cognition experiments, similar strategies do not always translate effectively to AI contexts. Our dataset-centered approach offers a paradigm-specific diagnostic tool tailored to AI’s learning dynamics, enhancing the validity of such comparisons and paving the way for more robust interdisciplinary research.

In contrast to existing mitigation techniques that modify model architectures or training procedures, our approach focuses on removing shortcuts at the dataset level, offering strong generalizability across models. This enables consistent and fair evaluation of learning abilities, rather than learning preferences. By constructing datasets that remove confounding correlations, we shift the emphasis from feature reliance to genuine capability assessment. This distinction is crucial for understanding models’ inductive biases and for developing reliable benchmarks in fairness and reasoning.

Our experiments demonstrate how shortcut-free evaluations, under controlled conditions, can lead to qualitatively different conclusions compared to previous studies. Whereas earlier work—often confounded by unremoved local shortcuts—suggested that DNNs are inferior to humans, and that CNNs underperform Transformers in global reasoning tasks, our results, based on evaluations using our carefully constructed dataset, reveal a contrasting picture: CNN-based models outperform Transformers, and all examined DNNs surpass human performance. While these findings are specific to our experimental setup, they illustrate the impact of shortcut-free evaluations on our understanding of model capabilities.

More importantly, our results underscore a crucial insight observable within our framework: a model’s apparent feature preferences do not necessarily align with its underlying capabilities. This reinforces the importance of using properly controlled datasets when evaluating complex reasoning behaviors—especially when drawing conclusions about architectural advantages or human–AI performance comparisons.

Despite the clear benefits of our framework, several avenues could be further improved. One assumption in our evaluation is that models achieve 100% accuracy on test sets, which enables clean capability assessment under idealized conditions. While this is consistent with methodologies in cognitive psychology, such an assumption may not hold in real-world scenarios involving noisy data and overlapping features. Exploring relaxed evaluation criteria or incorporating probabilistic interpretations could extend the applicability of our method to more complex tasks.

In addition, the dataset construction process involves a computationally intensive sample selection stage, where millions of samples per difficulty level are generated and filtered based on topological properties. While this ensures consistency and control, the volume of required samples may increase substantially for higher difficulty levels, posing challenges in scalability to large or high-dimensional datasets. Future work could explore more efficient selection mechanisms or generative modeling techniques to alleviate this bottleneck.

Several promising directions emerge for future research. First, SFEF could be extended to probe beyond global topological reasoning, enabling broader evaluations of AI capabilities across different cognitive domains. Second, scaling SHL to large-scale, real-world datasets would allow testing its robustness in complex, noisy environments. Third, integrating SHL into continual learning pipelines could support real-time shortcut identification and removal during data acquisition, pushing toward dynamic shortcut-free learning. Finally, applying SHL to multi-modal and cross-modal systems, such as vision-language models, offers a path toward understanding shortcut behavior in more complex AI systems, ultimately contributing to fairer and more trustworthy AI.

Methods

Operational formulation of shortcut hull learning

This section presents the operational formulation of SHL, bridging the abstract probabilistic definition in Eq. (7) with its concrete, feature-based expression in Eq. (8).

Eq. (7) is equivalent to

$$\exists E\in {{{\mathcal{F}}}},\quad {\bigcup}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}^{-1}({g}^{-1}(g(X(E)))) \supsetneq {Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}(E)).$$

(9)

The partitioning of ${{{\rm{supp}}}}({\mathbb{P}})$ within the sample space Ω induced by Y should be unique, i.e., $| \{\sigma ({Y}^{{\prime} }){| }_{{{{\rm{supp}}}}({\mathbb{P}})}\,| \,{Y}^{{\prime} }{=}^{a.s.}Y\} |=| \{\{E\cap {{{\rm{supp}}}}({\mathbb{P}})| E\in \sigma ({Y}^{{\prime} })\}| {Y}^{{\prime} }{=}^{a.s.}Y\}|=1$. Consequently, the existence of different solutions within the data essentially indicates different partitionings of $\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})$. Essentially, ${{{\rm{supp}}}}({\mathbb{P}})$ represents the in-distribution (ID) sample set, denoted as ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$, while $\Omega \setminus {{{\rm{supp}}}}({\mathbb{P}})$ represents the OOD sample set, denoted as ${{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}$. For the same ID event ${E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$, different solutions will generalize to different OOD events ${E}_{{{{\rm{OOD}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}$. Therefore, Eq. (9) is equivalent to

$$\exists {E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcup}_{g| \sigma (g(X))\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}^{-1}\left({g}^{-1}(g(X({E}_{{{{\rm{ID}}}}})))\right) \supsetneq {Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}({E}_{{{{\rm{ID}}}}})),$$

(10)

where ${X}^{-1}\left({g}^{-1}(g(X({E}_{{{{\rm{ID}}}}})))\right)$ is the event generalized by g(X(E_ID)), and ${Y}_{{{{\rm{Int}}}}}^{-1}({Y}_{{{{\rm{Int}}}}}({E}_{{{{\rm{ID}}}}}))$ is the intended event generalized by Y(E_ID).

By reflecting the abstract SHL in the probability space into observable and operable real number features, we can directly diagnose shortcuts using the ID dataset itself. To represent features of different models within the same space, we map the neural network features to the input X. Typically, each dimension of the input vector is represented as a feature of the input⁵³. In this context, each pixel position of an image is represented as a feature. For grayscale images, this is a pixel value, and for color images, it is a 3-dimensional pixel vector. Note that the selection of features here represents a form of derivation. Indeed, the representation of features is not unique, but this does not affect the derivation of conclusions. Let $I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}$ denote a random index subset of the input random variable X. For ∀ ω ∈ Ω, define ${X}_{I}(\omega ):={({X}_{i}(\omega ))}_{i\in I(\omega )}\in {{\mathbb{R}}}^{| I(\omega )| }$, which represents a new subvector composed of components of X(ω) indexed by I(ω). Given a neural network feature function g(X), to reflect the SHL on the input space X, we consider a subvector X_I such that σ(X_I) = σ(g(X)), as illustrated in Fig. 6. We can then rewrite Eq. (10) as

$$\exists {E}_{{{{\rm{ID}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcup}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}{X}_{I}^{-1}({X}_{I}({E}_{{{{\rm{ID}}}}})) \supsetneq {X}_{{I}_{{{{\rm{Int}}}}}}^{-1}({X}_{{I}_{{{{\rm{Int}}}}}}({E}_{{{{\rm{ID}}}}})).$$

(11)

**Fig. 6: An intuitive demonstration of the shortcut hull principle.**

For a given feature representation g(X) and ω ∈ Ω, there may exist multiple index subsets I(ω) such that ${X}_{I}^{-1}({X}_{I}(\omega ))={X}^{-1}({g}^{-1}(g(X(\omega ))))$. For $\forall I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}$ and ∀ ω ∈ Ω, define the set:

$$J(I(\omega ))\!:=\left\{\left.{\bigcup}_{{I}^{{\prime\prime} }(\omega )\subseteq {\{i\}}_{i=1}^{n}| {X}_{{I}^{{\prime\prime} }}^{-1}({X}_{{I}^{{\prime\prime} }}(\omega ))={X}_{{I}^{{\prime} }}^{-1}({X}_{{I}^{{\prime} }}(\omega ))}\{{I}^{{\prime\prime} }(\omega )\}\,\right\vert \,{I}^{{\prime} }(\omega )\subseteq I(\omega )\right\}.$$

(12)

Then, Eq. (11) is equivalent to stating that

$$\exists \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},\quad {\bigcap}_{I:\Omega \to {2}^{{\{i\}}_{i=1}^{n}}| \sigma ({X}_{I})\supseteq \sigma (f(X)),{{\mathbb{P}}}_{X,Y}[f(X)\ne Y]=0}J(I(\omega ))\,\subsetneq\, J({I}_{{{{\rm{Int}}}}}(\omega )).$$

(13)

Morgan’s Canon of data: hierarchical principle for data shortcuts

In comparative psychology, Morgan’s Canon posits that if an animal’s behavior can be adequately explained by lower-level psychological processes, then it should not be attributed to higher-level processes^54,55. Similarly, we can derive Morgan’s Canon of data fields. There are two scenarios where shortcuts might occur:

For data providers, the measurable space $(\Omega,{{{\mathcal{F}}}})$ represents all possible events within a given research task. In this context, $(\Omega,{{{\mathcal{F}}}})$ remains consistent across different probability spaces, while the ID sample set ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ differs. Consider two probability spaces $(\Omega,{{{\mathcal{F}}}},{{\mathbb{P}}}_{1})$ and $(\Omega,{{{\mathcal{F}}}},{{\mathbb{P}}}_{2})$, with corresponding ID sample sets ${{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}$ and ${{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}$, where ${{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}\subseteq {{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}$. Let random variables (X₁, Y₁) and (X₂, Y₂) be defined on these two probability spaces, respectively, and satisfy ∀ ω ∈ Ω, X₁(ω) = X₂(ω). If Y₁(ω) = Y₂(ω) holds for $\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID1}}}}}$, then it must also hold for $\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID2}}}}}$. That is, for ${{{{\mathcal{Y}}}}}_{1}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{1},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{1})\}$ and ${{{{\mathcal{Y}}}}}_{2}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{2},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{2})\}$, it holds that ${{{{\mathcal{Y}}}}}_{1}\subseteq {{{{\mathcal{Y}}}}}_{2}$. This implies that reducing the ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ may introduce shortcuts. Therefore, to mitigate such shortcuts, data providers must appropriately control the probability measure ${\mathbb{P}}$, which amounts to carefully selecting ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ to ensure that Eq. (10) is not satisfied.
For data users, by contrast, the ID sample set ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ remains fixed across different probability spaces. However, when applying the same data to different tasks, the OOD sample set ${{{{\mathcal{E}}}}}_{{{{\rm{OOD}}}}}$ changes, implying a change in the sample space Ω. Suppose two probability spaces $({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})$ and $({\Omega }_{2},{{{{\mathcal{F}}}}}_{2},{{\mathbb{P}}}_{2})$ share the same ID sample set ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$, with ${\Omega }_{2}\in {{{{\mathcal{F}}}}}_{1}$ and ${{{{\mathcal{F}}}}}_{2}={{{{\mathcal{F}}}}}_{1}{| }_{{\Omega }_{2}}=\{E\cap {\Omega }_{2}| E\in {{{{\mathcal{F}}}}}_{1}\}$. Let (X₁, Y₁) and (X₂, Y₂) be random variables defined on these two respective probability spaces such that ∀ ω ∈ Ω₂, X₁(ω) = X₂(ω) and $\forall \omega \in {{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}},{Y}_{1}(\omega )={Y}_{2}(\omega )$. It follows that σ(X₂) ⊆ σ(X₁), and thus, for ${{{{\mathcal{Y}}}}}_{1}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{1},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{1})\}$ and ${{{{\mathcal{Y}}}}}_{2}=\{\sigma ({Y}^{{\prime} })| {Y}^{{\prime} }{=}^{a.s.}{Y}_{2},\sigma ({Y}^{{\prime} })\subseteq \sigma ({X}_{2})\}$, it holds that ${{{{\mathcal{Y}}}}}_{2}\subseteq {{{{\mathcal{Y}}}}}_{1}$. This suggests that enlarging Ω may also introduce shortcuts. Therefore, to avoid these artifacts, data users must apply the dataset within an appropriately defined task—i.e., choose Ω carefully—to prevent Eq. (6) from being satisfied. In practice, the choice of Ω influences the function J in Eq. (12).

Hence, we can derive Morgan’s Canon of data. For data providers, the hierarchical structure is reflected in data ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$. Specifically, shortcuts at higher levels of dataset ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ do not indicate shortcuts at lower levels. For data users, the hierarchy is reflected in tasks Ω. In particular, shortcuts at lower levels of task Ω do not indicate shortcuts at higher levels.

Taking cat-dog classification as an example, Fig. 7 provides an intuitive illustration of Morgan’s Canon of data. For data providers, consider the random variables (X₁, Y₁) and (X₂, Y₂), defined on the probability spaces $({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})$ and $({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{2})$, respectively. Although the measurable space $(\Omega,{{{\mathcal{F}}}})$ is shared, variations in the ID sample set ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ lead to the emergence of shortcuts. For data users, consider the random variables (X₁, Y₁) and (X₃, Y₃), defined on the probability spaces $({\Omega }_{1},{{{{\mathcal{F}}}}}_{1},{{\mathbb{P}}}_{1})$ and $({\Omega }_{2},{{{{\mathcal{F}}}}}_{2},{{\mathbb{P}}}_{3})$, respectively. Even if ${{{{\mathcal{E}}}}}_{{{{\rm{ID}}}}}$ remains fixed, differences in Ω result in shortcuts.

**Fig. 7: An intuitive demonstration of Morgan’s Canon of data in cat-dog classification.**

Feature preferences of models with different inductive biases

As illustrated in Fig. 5, we utilize the HiResCAM⁵⁶ method to map model features to input data with precision. We analyze features from models including ResNet-50²⁶, ViT-B/16⁴¹, RepVGG-A2⁵⁷, Swin-T⁵⁸, and PViG-S⁵⁹ using data across three distinct classes. The results indicate that models with analogous inductive biases display remarkably similar features. For instance, CNN-based models like ResNet-50 and RepVGG-A2 show similar features across the classes. In contrast, models with divergent inductive biases exhibit pronounced feature differences. In Class 1, CNN-based models highlight the gap region between two outside relational loops, while ViT-B/16 and Swin-T emphasize the loops themselves rather than the gap. In Class 2, ViT-B/16 distinctly diverges from the other four models, focusing on the contours of the two loops rather than the gap. PViG demonstrates similarities to CNN-based models in both Class 1 and Class 2.

Analyzing model convergence on the topological dataset

Figure 8 shows that, when evaluating the same model with datasets of varying difficulty levels, model accuracy either approximates random guesses (with an accuracy close to $33.\overline{3}\%$) or approaches 100%, indicating that models either effectively capture global properties or fall short. Thus, categorizing by difficulty level provides a more precise and clearer assessment of model capability.

**Fig. 8: The convergence curves of the ResNet²⁶ models for the topological dataset.**

The analysis of Fig. 8b, a, and e reveals that the ResNet models’ convergence rate is notably sensitive to the learning rate when applied to the topological dataset, under consistent model sizes and dataset difficulty levels. A learning rate of 0.1 ensures model convergence, whereas any slight increase to 1 or decrease to 0.01 leads to non-convergence. Additionally, an examination of Fig. 8b–d, f, g, h shows that ResNet models of various sizes can converge with a 100% accuracy at a dataset difficulty level of 12, corresponding to a resolution of 125 × 125. However, when the difficulty level escalates to 13, with a resolution of 133 × 133, models of different sizes fail to converge, achieving an accuracy rate of ~$33.\overline{3}\%$, which is equivalent to random guessing. This indicates that the scale of the model has a marginal impact on convergence in topological dataset, and the model predominantly fluctuates between two states under these conditions: achieving 100% accuracy or resorting to random guessing.

Experimental setup

The MarkovJunior⁶⁰ framework is employed to accelerate the generation of topological datasets.

To harness the knowledge from the pre-trained model on the ImageNet dataset²⁷, we resized each data instance to a dimension of 224 × 224 pixels with nearest-neighbor interpolation. As the original size of each data instance is smaller than 224 × 224, this scaling method does not result in a loss of detail or cause any changes to the data labels.

To ensure a fair comparison between each model, we utilize the cross-entropy function as the loss function for all models, avoiding any dropout-related techniques^61,62. To preserve the topological properties of the dataset, data augmentation is limited to flipping and rotation, without employing label smoothing⁶³.

Due to observations during the experiments indicating that the model either failed to converge or converged to a significantly high loss value on topological datasets (see Fig. 8 and Supplementary Note 2), we employed an optimizer without weight decay for training. Considering the higher convergence ceiling of stochastic gradient descent (SGD), we chose SGD with momentum⁶⁴ and no weight decay as the training optimizer. A comparative analysis between SGD and Adam⁶⁵ on the topological datasets is provided in Supplementary Note 3. For the purpose of facilitating grid search for the optimal learning rate, we adopt a constant learning rate in our experiments.

In the context of neural networks, the resolution of the feature map decreases as the network depth increases, leading to a coarser granularity in the HiResCAM⁵⁶ method’s visualizations. However, the features from the latter layers of the neural network tend to be more precise. Thus, we strike a balance by selecting features from intermediate layers where the resolution remains sufficiently high, ensuring detailed visualizations. Based on our preliminary theoretical considerations, we utilize output soft labels or features from prior layers as network features, ensuring that our evaluation method remains accurate. The specific feature layers chosen for each evaluation model are detailed in Table 2.

Table 2 Detailed parameter configurations of the models used for SHL

Full size table

No ethical approval or inclusion considerations were required for this study, as it did not involve human or animal subjects, nor did it use any sensitive or personally identifiable data.

Data availability

The synthesized topological datasets used in this study are publicly available on Figshare at https://doi.org/10.6084/m9.figshare.28794407⁴⁸.

Code availability

All code used in this study is publicly available on Figshare at https://doi.org/10.6084/m9.figshare.28794407⁴⁸.

References

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article CAS PubMed Google Scholar
Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144 (2018).
Article MathSciNet CAS PubMed Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Davies, A. et al. Advancing mathematics by guiding human intuition with AI. Nature 600, 70–74 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fawzi, A. et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022).
Article CAS PubMed Google Scholar
OpenAI et al. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774 (2024).
Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).
Article CAS PubMed PubMed Central Google Scholar
John & Raven, J. Raven Progressive Matrices. 223–237. https://doi.org/10.1007/978-1-4615-0153-4_11 (Springer, 2003).
Chollet, F. On the Measure of Intelligence. https://arxiv.org/abs/1911.01547 (2019).
Fleuret, F. et al. Comparing machines and humans on a visual categorization test. Proc. Natl. Acad. Sci. USA 108, 17621–17625 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kim, J., Ricci, M. & Serre, T. Not-So-CLEVR: learning same-different relations strains feedforward neural networks. Interface Focus 8, 20180011 (2018).
Article PubMed PubMed Central Google Scholar
Firestone, C. Performance vs. competence in human-machine comparisons. Proc. Natl. Acad. Sci. USA 117, 26562–26571 (2020).
Article CAS PubMed PubMed Central Google Scholar
Baker, N., Lu, H., Erlikhman, G. & Kellman, P. J. Local features and global shape information in object classification by deep convolutional neural networks. Vis. Res. 172, 46–61 (2020).
Article PubMed Google Scholar
Heinke, D., Wachman, P., van Zoest, W. & Leek, E. C. A failure to learn object shape geometry: Implications for convolutional neural networks as plausible models of biological vision. Vis. Res. 189, 81–92 (2021).
Article PubMed Google Scholar
Funke, C. M. et al. Five points to check when comparing visual perception in humans and machines. J. Vis. 21, 16 (2021).
Article PubMed PubMed Central Google Scholar
Puebla, G. & Bowers, J. S. Can deep convolutional neural networks support relational reasoning in the same-different task? J. Vis. 22, 11 (2022).
Article PubMed PubMed Central Google Scholar
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Jo, J. & Bengio, Y. Measuring the tendency of CNNs to Learn Surface Statistical Regularities. https://arxiv.org/abs/1711.11561 (2017).
Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Proc. Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E. & Garnett, R.) https://proceedings.neurips.cc/paper_files/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf (Curran Associates, Inc., 2019).
Geirhos, R. et al. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. International Conference on Learning Representations https://openreview.net/forum?id=Bygh9j09KX (2019).
Gururangan, S. et al. Annotation artifacts in natural language inference data. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 2, 107–112. https://doi.org/10.18653/v1/n18-2017 (Association for Computational Linguistics, 2018).
Feynman, R. P. Cargo cult science. Eng. Sci. 37, 10–13 (1974).
Google Scholar
Baker, N., Lu, H., Erlikhman, G. & Kellman, P. J. Deep convolutional networks do not classify based on global object shape. PLOS Comput. Biol. 14, 1–43 (2018).
Article Google Scholar
Geirhos, R. et al. Partial success in closing the gap between human and machine vision. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) 23885–23899. https://proceedings.neurips.cc/paper_files/paper/2021/file/c8877cff22082a16395a57e97232bb6f-Paper.pdf (Curran Associates, Inc., 2021).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778. https://doi.org/10.1109/cvpr.2016.90 (IEEE, 2016).
Deng, J. et al. Imagenet: a large-scale hierarchical image database. In Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255. https://doi.org/10.1109/cvpr.2009.5206848 (IEEE, 2009).
Brendel, W. & Bethge, M. Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. In Proc. International Conference on Learning Representations https://openreview.net/pdf?id=SkfMWhAqYQ (2019).
Tuli, S., Dasgupta, I., Grant, E. & Griffiths, T. Are convolutional neural networks or transformers more like human vision? In Proc. Annual Meeting of the Cognitive Science Society Vol. 43, 1844–1850 (Cognitive Science Society, 2021).
Naseer, M. M. et al. Intriguing properties of vision transformers. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) 23296–23308. https://proceedings.neurips.cc/paper_files/paper/2021/file/c404a5adbf90e09631678b13b05d9d7a-Paper.pdf (Curran Associates, Inc., 2021).
Dehghani, M. et al. Scaling vision transformers to 22 billion parameters. In Proc. 40th International Conference on Machine Learning Vol. 202 (eds Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S. & Scarlett, J.) 7480–7512. https://proceedings.mlr.press/v202/dehghani23a.html (2023).
Chen, L. Topological structure in visual perception. Science 218, 699–700 (1982).
Article CAS PubMed Google Scholar
Chen, L. The topological approach to perceptual organization. Vis. Cogn. 12, 553–637 (2005).
Article Google Scholar
Bowers, J. S. et al. Deep problems with neural network models of human vision. Behav. Brain Sci. 46, https://doi.org/10.1017/s0140525x22002813 (2023).
Minsky, M. & Papert, S. Perceptrons: An Introduction to Computational Geometry Expanded edn (MIT Press, 1988).
Wang, D. The time dimension for scene analysis. IEEE Trans. Neural Netw. 16, 1401–1426 (2005).
Article PubMed Google Scholar
Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms Vol. 55 (Spartan Books, 1962).
Fukushima, K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980).
Article CAS PubMed Google Scholar
LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. & Garnett, R.) https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (Curran Associates, Inc., 2017).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations https://openreview.net/forum?id=YicbFdNTTy (2020).
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks? In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) 12116–12128. https://proceedings.neurips.cc/paper_files/paper/2021/file/652cf38361a209088302ba2b8b7f51e0-Paper.pdf (Curran Associates, Inc., 2021).
Calin, O. Deep Learning Architectures: A Mathematical Approach (Springer, 2020).
Kallenberg, O. Foundations of Modern Probability 3rd edn. https://doi.org/10.1007/978-3-030-61871-1 (Springer, 2021).
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf (Curran Associates, Inc., 2020).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 8748–8763. https://proceedings.mlr.press/v139/radford21a.html (2021).
Wilson, D. B. Generating random spanning trees more quickly than the cover time. In Proc. Twenty-Eighth Annual ACM Symposium on Theory of Computing. STOC ’96 296–303. https://doi.org/10.1145/237814.237880 (Association for Computing Machinery, 1996).
Zhou, W., Liu, F., Zheng, H. & Zhao, R. Mitigating data bias and ensuring reliable evaluation of AI models with shortcut hull learning. figshare https://doi.org/10.6084/m9.figshare.28794407 (2025).
Seeliger, K. & Hebart, M. N. What comparing deep neural networks can teach us about human vision. Nat. Mach. Intell. 6, 122–123 (2024).
Article Google Scholar
Mei, Q., Xie, Y., Yuan, W. & Jackson, M. O. A Turing test of whether AI chatbots are behaviorally similar to humans. Proc. Natl. Acad. Sci. USA 121, 2313925121 (2024).
Article Google Scholar
Doerig, A. et al. The neuroconnectionist research programme. Nat. Rev. Neurosci. 24, 431–450 (2023).
Article CAS PubMed Google Scholar
Wichmann, F. A. & Geirhos, R. Are deep neural networks adequate behavioral models of human visual perception? Annu. Rev. Vis. Sci. 9, 501–524 (2023).
Article PubMed Google Scholar
Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).
Morgan, C. L. An Introduction to Comparative Psychology (Walter Scott Publishing Co., 1894).
Buckner, C. The Comparative Psychology of Artificial Intelligence. https://philsci-archive.pitt.edu/16034/ (2019).
Draelos, R. L. & Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. https://arxiv.org/abs/2011.08891 (2021).
Ding, X. et al. Repvgg: Making VGG-style convnets great again. In Proc. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13728–13737 https://doi.org/10.1109/CVPR46437.2021.01352 (2021).
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986 (IEEE, 2021).
Han, K., Wang, Y., Guo, J., Tang, Y. & Wu, E. Vision GNN: an image is worth graph of nodes. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K. & Oh, A.) 8291–8303. https://proceedings.neurips.cc/paper_files/paper/2022/file/3743e69c8e47eb2e6d3afaea80e439fb-Paper-Conference.pdf (Curran Associates, Inc., 2022).
Gumin, M. MarkovJunior, a probabilistic programming language based on pattern matching and constraint propagation. https://github.com/mxgmn/MarkovJunior.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
MathSciNet Google Scholar
Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K. Q. Deep networks with stochastic depth. In Proc. Computer Vision—ECCV 2016 646–661. https://doi.org/10.1007/978-3-319-46493-0_39 (Springer, 2016).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 https://doi.org/10.1109/CVPR.2016.308 (2016).
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In Proc. 30th International Conference on Machine Learning Vol. 28 (eds Dasgupta, S. & McAllester, D.) 1139–1147. https://proceedings.mlr.press/v28/sutskever13.html (2013).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. International Conference on Learning Representations https://arxiv.org/abs/1412.6980 (2015).
Tolstikhin, I. O. et al. MLP-Mixer: an all-MLP architecture for vision. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) 24261–24272. https://proceedings.neurips.cc/paper_files/paper/2021/file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf (Curran Associates, Inc., 2021).
Contributors, M. OpenMMLab’s Pre-training Toolbox and Benchmark. https://github.com/open-mmlab/mmpretrain (2023).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5987–5995 https://doi.org/10.1109/CVPR.2017.634 (2017).
Trockman, A. & Kolter, J. Z. Patches are all you need? Transactions on Machine Learning Research https://openreview.net/forum?id=rAnB7JSMXL (2023).
Tan, M. & Le, Q. EfficientNet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning Vol. 97 (eds Chaudhuri, K. & Salakhutdinov, R.) 6105–6114. https://proceedings.mlr.press/v97/tan19a.html (2019).
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K. & Dollár, P. Designing network design spaces. In Proc. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10425–10433 https://doi.org/10.1109/CVPR42600.2020.01044 (2020).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 7132–7141 https://doi.org/10.1109/CVPR.2018.00745 (2018).

Download references

Acknowledgements

This work was partly supported by the STI 2030–Major Projects 2021ZD0200300 to R.Z. We would like to thank Luping Shi, Zhiyong Zhao, Lukai Li, and Wenli Zhang for their valuable discussion.

Author information

Authors and Affiliations

Center for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, China
Wenhao Zhou, Faqiang Liu, Hao Zheng & Rong Zhao
Department of Precision Instruments, Tsinghua University, Beijing, China
Wenhao Zhou, Faqiang Liu, Hao Zheng & Rong Zhao
IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China
Wenhao Zhou, Faqiang Liu, Hao Zheng & Rong Zhao
Tsinghua University–China Electronics Technology HIK Group Co. Joint Research Center for Brain-Inspired Computing, Tsinghua University, Beijing, China
Wenhao Zhou, Faqiang Liu, Hao Zheng & Rong Zhao
Optical Memory National Engineering Research Center, Tsinghua University, Beijing, China
Wenhao Zhou, Faqiang Liu, Hao Zheng & Rong Zhao

Authors

Wenhao Zhou
View author publications
Search author on:PubMed Google Scholar
Faqiang Liu
View author publications
Search author on:PubMed Google Scholar
Hao Zheng
View author publications
Search author on:PubMed Google Scholar
Rong Zhao
View author publications
Search author on:PubMed Google Scholar

Contributions

W.Z. conceived the study, conducted the theoretical derivation, designed the dataset, performed the experiments, and analyzed the results. All authors contributed to discussions on experimental design under the supervision of R.Z. W.Z., F.L., H.Z., and R.Z. wrote the manuscript. R.Z. supervised the project.

Corresponding author

Correspondence to Rong Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Yujiang Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, W., Liu, F., Zheng, H. et al. Mitigating data bias and ensuring reliable evaluation of AI models with shortcut hull learning. Nat Commun 16, 5513 (2025). https://doi.org/10.1038/s41467-025-60801-6

Download citation

Received: 09 December 2024
Accepted: 04 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41467-025-60801-6