Abstract
Do humans learn like transformers? We trained both humans (n = 530) and transformer networks on a rule learning task where they had to respond to a query in a sequence. At test, we measured ‘in-context’ learning (generalize the rule to novel queries) and ‘in-weights’ learning (recall past experiences from memory). Manipulating the diversity and redundancy of examples in the training distribution, we found that humans and transformer networks respond in very similar ways. In both types of learner, redundancy and diversity trade off in driving in-weights and in-context learning, respectively, whereas a composite distribution with a balanced mix of redundancy and diversity allows the two strategies to be used in tandem. However, we also found that while humans benefit from dynamic training schedules that emphasize diverse examples early, transformers do not. So, while the same data-distributional properties promote learning in humans and transformer networks, only people benefit from curricula.
Similar content being viewed by others
Main
The relationship between memory and reasoning is among the oldest problems in the cognitive sciences. Humans can make strong inductive inferences, allowing them to reason about novel data—for example, using the laws of calculus to compute integrals on a maths exam, or applying grammar rules to understand a sentence never heard before. However, the ability to encode and retain specific instances of past experience in memory is also a critical hallmark of healthy cognitive function. This duality was first articulated in the 1940s by Cattell, who distinguished ‘crystallized’ from ‘fluid’ intelligence—the former indexing the integrity of core skills and knowledge and the latter our ability to reason beyond extant data1. This dichotomy prefigured seminal dual-process frameworks in psychology and neuroscience, which separated heuristics from rational computation2, information integration from rule-based categorization3, associative from symbolic processes4 and model-free from model-based reinforcement learning5. However, the nature of the computations that allow humans (and perhaps other animals) to use both memory and inductive inference to solve complex problems remains an open question in psychology, neuroscience and artificial intelligence research.
Throughout the twentieth century, symbolic systems that strictly separated memory and inference remained popular6,7,8, but connectionist models have since reemerged as theories of biological cognition9,10. Neural networks can be trained either to store and retrieve information from memory or to learn generalizable patterns in data11, doing so by modifying their weights, which serves both to store information and support generalization (‘in-weights’ learning). Nevertheless, one surprising finding is that modern deep networks can be pretrained to generalize over patterns in sequential data after just a few examples, a capacity (dubbed ‘in-context’ learning) that is reminiscent of human inductive inference12,13. Rather than relying on weight updates, in-context learning arises from the networks’ internal processing: it is best understood as an emergent result of meta-learning, where training leads the network to ‘learn how to learn’ from the structure of its input, enabling it to perform few-shot learning without updating its weights14,15. In-context learning has come to prominence with the arrival of a new neural network architecture known as the transformer. Transformer networks use self-attention to compute how much each token in a sequence should influence the representation of every other token. This allows the model to integrate information across positions and build context-aware representations at each layer16. Large transformer networks trained on giant text corpora are able to generate fluent sentences, equations or code on the fly17,18,19, and it has been claimed that these networks can make inferences beyond their training data in ways that resemble human fluid intelligence20,21. Conversely, the idea that human cognition might emerge from a relatively undifferentiated neural network architecture has once again become fashionable in the neurosciences22,23. While the distinction between in-context and in-weights learning is reminiscent of dual-process frameworks, it is important to note that classical dual-system models do not make specific predictions about how learning strategies should vary with the statistical structure of the training data. This is the central focus of our work.
In a recent line of work, machine learning researchers have studied how the distributional properties of training data variously promote in-weights (memory-based) and in-context (inference-based) learning in transformer networks24,25,26,27,28,29,30,31,32,33. Using cleverly designed probes that can distinguish the two types of learning, researchers have shown that training distributions that involve lots of repetitions (redundancy) promote in-weights learning, whereas distributions that involve lots of diverse examples (diversity) promote in-context learning, with hints that a sweet spot may exist in between. Here we asked whether the results reported in these papers also hold true for human participants performing a comparable task. We found that human learners and transformers respond to the training data distribution in remarkably similar ways, and that a near-identical manipulation allows both humans and transformers to learn in-weights and in-context solutions in tandem. However, we also observed an important dissociation: humans, but not transformers, benefit from curricula that prioritize diverse examples early on in training.
Results
Transformers trade off in-context and in-weights learning depending on the training data distribution
We adapted a paradigm previously used to distinguish in-context and in-weights learning in transformers32. On each trial, the learner is prompted with a sequence of {item: label} pairs, and then a single item is queried for its label {item:?}. A real-world analogy might be learning vocabulary items in a foreign language. For example, during training the learner sees pairs such as the following:
oiseau: bird; chien: dog; chat: cat; poisson: fish; chat:?
(training trial)
At test, we can evaluate both in-context and in-weights learning by varying the novelty and familiarity of the sequences. These evaluations occur without any feedback (or gradient updates). In-context learning is indexed by zero-shot performance on previously unseen sequences with comparable structure, such as:
katze: cat; hund: dog; vogel: bird; fisch: fish; katze:?
(in-context test trial)
By contrast, in-weights learning is quantified as a tendency to repeat answers to queries previously experienced during training, ignoring any contextual information:
pferd: horse; hund: dog; vogel: bird; fisch: fish; chat:?
(in-weights test trial)
We used this approach to study how the training data distribution influences the learning strategy used by transformer networks and humans (Fig. 1a–c). Like previous studies involving transformers only, we varied the diversity and redundancy of training examples. To illustrate, consider two extremes: a fully redundant distribution in which every training trial contains the same item–label pair and a fully diverse distribution in which every trial contains entirely novel item–label pairs. We can interpolate between these extremes by sampling trials from a rank-frequency (or Zipfian) distribution parameterized by the exponent α, where α controls the skewness (Fig. 1a). At α = 0, the distribution is uniform (fully diverse); at α > 0, the distribution is skewed, and in the limit of α → +∞, the distribution is concentrated around one single example (fully redundant).
a, We studied learning in an image–label association task by manipulating the distribution of the training data. Under a uniform distribution (α = 0), all images are equally likely to appear. In skewed distributions, some images are more likely than others (α > 0). b, Example training trial. In a given trial, agents were asked to select the label corresponding to the query image, presented at the centre of the screen. Seven images and seven labels were also presented in a surrounding circle (the context). During training, a copy of the query image (the target image) was always present in the context. The correct label was always located three steps clockwise relative to the target image (the target label). c, Paradigm overview. During training, two learning strategies are available. The in-context learning strategy consists in using the context to infer the correct label—that is, using the ‘+3 steps’ rule. The in-weights learning strategy consists in learning each image–label association in memory using the feedback. Test blocks were designed to probe which strategy (or strategies) the agent is using. On in-context test blocks, novel images (depicted in grey) were presented, such that the only way to be accurate was to use information from the context—that is, the in-context strategy. On in-weights test blocks, a training image (depicted in blue) was presented as the query image, but novel images (depicted in grey) were presented in the context, such that the only way to be accurate was to use information stored in memory—that is, the in-weights strategy. On arbitrage test blocks, a training image was presented as the query image, and the context indicated a different label than the one that was presented during training. This was done to reveal the dominant strategy used by the agent when presented with conflicting evidence for the two strategies. d, A minimal transformer model, composed of two attention-only layers of one attention head each, was trained on the task. e, Accuracy curves for two example transformers trained on two different training distributions, uniform (α = 0, left) and skewed (α = 4, right). When α < 1, transformers learn in-context but not in-weights. Conversely, when α > 1, transformers learn in-weights but not in-context.
Using this task, we first attempted to replicate previously reported findings using a simple transformer architecture comprising two attention-only layers (one attention head each) followed by a classifier (Fig. 1d and Extended Data Fig. 1). Inputs were coded as vectors sampled from multidimensional Gaussian distributions (Methods). We first confirmed that transformers were able to learn the task (Fig. 2a). Indeed, on training trials, transformers learned well irrespective of the statistics of the training distribution (all accuracies near 100%, except when α = 1). However, at test, we found that performance varied sharply with the distributional properties of the training data. Transformers trained on a uniform distribution (α = 0) scored nearly perfectly on in-context test trials (accuracy of 100%), whereas those trained on a skewed distribution scored close to chance on these trials (~10% for transformers trained on α > 1). By contrast, transformers trained on a uniform distribution (α = 0) performed at chance on in-weights test trials, whereas transformers trained on a skewed distribution scored very highly (accuracy near 100% for transformers trained on α = 4). In both cases, a transition between these two regimes occurred close to α = 1, at which point approximately half of transformers learned to solve the task, and half remained at chance. These findings, which are shown in Fig. 2a, replicate previous reports that the relative balance between in-weights and in-context learning depends on the distribution of examples in the training data30,32.
a, Training and test performances for transformers (n = 30 per training data distribution). b, Same for human participants (Exp. 1, n = 30 per training data distribution). The small dots indicate data from individual transformers/humans; the large dots indicate group averages. c, Scatter plots of the in-context versus in-weights test performances for feed-forward networks (left), LSTM networks (middle left), transformers (middle right) and humans (right). Feed-forward and LSTM networks do not learn in-context. Transformers and human participants trade off in-context and in-weights learning. Each dot indicates data from an individual model/human.
We also trained other classes of neural networks on the task, including feed-forward architecture (multi-layer perceptron (MLP)) and long short-term memory (LSTM) networks. In general, these architectures had no difficulty learning the task, but none showed effective in-context learning (Fig. 2c and Extended Data Fig. 2). This result replicates previous findings showing that neural architecture matters for in-context learning32. It should be noted that in-context learning is not exclusive to transformer networks—under specific conditions, both feed-forward and recurrent architectures such as LSTM networks can learn in-context34,35. However, transformers adopt this strategy more robustly and flexibly across a wider range of settings, including those used in our study. This is probably due to the attention mechanism, which explicitly provides an opportunity to integrate information present in the context when processing the query (see the mechanistic interpretability analysis below).
Humans trade off in-context and in-weights learning in a similar manner to transformers (Experiment 1)
Next, we designed a variant of the task that could be performed by human participants, recruited via an online platform. The context sequence was composed of seven alternating images (items) and numbers (labels) that were presented in a ring. The query item was presented centrally, inside the ring. During training, the query image was always also present in the context (for example, ‘cat’ in the example above; we call this the ‘target image’). As shown in Fig. 1b, the correct (or target) label was always located three steps clockwise from the target image. Participants responded by pressing a digit between 0 and 9 on their keyboard. They were not instructed as to the rule but learned gradually from fully informative feedback that was provided after each trial. Thus, during training, agents could use two strategies to solve the task: they could either memorize the class label for each image from the trialwise feedback (in-weights learning), or they could learn the ‘+3 steps’ rule to infer the correct label from the context of any sequence, including potentially novel sequences (in-context learning).
We used a between-group design, in which four groups (n = 30 each, Experiment 1) experienced training distributions characterized by different parameters α ∈ {0, 1, 2, 4}. The results are presented in Fig. 2b. Like transformer networks, humans in all four groups learned to become proficient at the task. They had mostly reached a stable level of accuracy by the final training block (average accuracy of 85.6 ± 2.3%), and the data distribution did not impact their performance in training (effect of α on accuracy, β = 0.247 ± 0.179; P = 0.168; Bayes factor (BF), 0.042; ‘strong’ evidence in favour of an absence of effect). Thus, as for transformers, manipulating the training data distribution did not immediately affect agents’ learning or their ability to associate images with labels, as performance remained consistent regardless of α.
However, again like transformer networks, the performance of human participants at test was greatly influenced by the training data distribution. This was the case for both in-context test trials (effect of α on accuracy, β = −1.543 ± 0.208, P = 0.0, BF > 100, ‘decisive’ evidence) and in-weights test trials (effect of α on accuracy, β = 1.89 ± 0.106, P = 0.0, BF > 100, ‘decisive’ evidence). Similar to transformers, participants trained on a uniform distribution were very accurate on in-context test trials (85.7 ± 5.3% for the group trained on α = 0), while participants trained on a skewed distribution were near chance level (17.0 ± 3.9% for the group trained on α = 4). Conversely, on the in-weights test, participants trained on a uniform distribution were at chance level (7.26 ± 0.8% for the group trained on α = 0), while participants trained on a skewed distribution showed near-perfect performance (97.4 ± 0.8% for the group trained on α = 4). Once again, a transition between successful strategies occurred around α = 1. These findings are reported in Fig. 2b.
To better understand what drives performance in the in-weights test, we analysed accuracy as a function of item frequency during training (Extended Data Figs. 3 and 4). Both transformer networks and human participants performed better on frequent items, confirming that they learned from repeated exposure.
Finally, we also used a class of test that we call an ‘arbitrage’ trial, designed to disambiguate in-context and in-weights responding with a single query. Arbitrage test trials resembled in-weights test trials in that the query matched examples in the training data, and so the trial could be solved from memory. However, they also resembled in-context test trials, in that the query item was repeated in the context, so that the +3 rule could be applied. Crucially, the query item was paired with a different label in the context than the one it was paired with during training.
vogel: bird; hund: dog; chat: kitty; fisch: fish; chat:?
(arbitrage test trial)
Arbitrage trials had no inherently correct answer but allowed us to evaluate whether humans and transformer networks were using an in-context or an in-weights approach to solve the trial. We posed this type of trial to both human participants and transformer networks. Note that this condition is nearly identical to set-ups used in recent machine learning studies: the ‘ICL2’ trials in ref. 30, the ‘Flip’ condition in ref. 36 and the ‘Swap’ condition in ref. 35.
The results followed a similar pattern to those observed on in-context and in-weights test trials. Transformers trained on uniform data (α = 0) responded according to in-context learning and not in-weights learning, whereas transformers trained on skewed data (α > 1) responded the other way around. Once again, transformers trade off in-weights for in-context learning around α = 1. Similarly, human participants responded according to in-context learning when trained on a uniform distribution (α = 0) and progressively more according to in-weights learning as the skewness of the distribution increased (α > 0). Indeed, we observed a strong negative effect of α on accuracy with respect to in-context learning (β = −1.542 ± 0.183, P = 0.0, BF > 100, ‘decisive’ evidence) and a strong positive effect of α on accuracy with respect to in-weights learning (β = 1.752 ± 0.111, P = 0.0, BF > 100, ‘decisive’ evidence). Note that these two accuracies do not necessarily sum to 1, as agents can respond according to neither strategy.
To confirm the robustness of our findings, we conducted a preregistered replication of Experiment 1 with a new sample of human participants (n = 30 per training distribution; the preregistration is available at AsPredicted no. 231356, https://aspredicted.org/rqgz-rdfk.pdf). All key effects were replicated (Extended Data Fig. 4), including the trade-off between in-context learning and in-weights learning as a function of the training distribution.
In-context and in-weights learning trade off in both humans and transformer networks
In all three types of test trial, we observed a transition in learning strategies that occurred around α = 1. At this point transformers and humans seem to trade off in-context for in-weights learning. This implies that no (or very few) agents learn both strategies simultaneously. We confirmed that this was the case by plotting individual transformers’ and individual participants’ in-context test performance against their in-weights test performance (Fig. 2c). The majority of transformers were either pure in-context learners (26.7%; cluster of red points in the bottom right in Fig. 2c) or pure in-weights learners (66.7%; cluster of blue points in the top left in Fig. 2c), whereas just 6.7% learned both strategies. Similarly, most human participants were clustered in two groups, corresponding to in-context and in-weights learners (negative correlation between in-context and in-weights across the entire cohort, β = −0.286 ± 0.097, P = 0.004, BF = 6.66, ‘strong’ evidence). The majority of transformers and humans thus appear to trade off between in-context and in-weights learning, favouring one strategy depending on the data distribution.
Nevertheless, we noted that a few participants had good performance in both tests (5/127, 4%), meaning that humans can in principle learn both strategies simultaneously. Similarly, a few transformers had better-than-chance—but poor—performance in both tests (6.7%; cluster of grey points in Fig. 2c). These transformers learned some image classes in-weights but also discovered a suboptimal in-context learning strategy consisting in choosing one random label from the context, reducing the chance performance from 1/10 to ~1/7, thus slightly improving performance. All these models were trained with the critical value α = 1 (on a side note, they are also the models that did not reach perfect performance at the end of training; Fig. 2a). This suggests that transformers can also in principle learn both strategies independently and at the same time, although a Zipfian distribution might not be optimal. This is what we explored in Experiment 2.
Transformers and humans learn both strategies in tandem when exposed to a non-Zipfian, composite training distribution (Experiment 2)
Experiment 1 revealed that a training distribution with maximal diversity (α = 0) promotes in-context learning, while training with high levels of redundancy (α > 1) promotes in-weights learning. Crucially, however, we see that in both humans and transformer networks, a training distribution that advantages one type of learning seems to impair the other, so that no (or very few) learners were able to acquire both an in-weights and an in-context strategy. Inspired by this result, we reasoned that a distribution that contains a mix of redundancy and diversity might favour learning both strategies at the same time. We thus moved beyond standard Zipfian distributions and created a ‘composite’ distribution where a fraction Pc of the query images are sampled from a uniform distribution (α = 0) and the remainder are sampled from a skewed distribution (αs > 0) (Fig. 3a).
a, Composite distribution, where a fraction Pc = 0.5 of the query images are sampled from a uniform distribution (α = 0) and the rest from a skewed distribution (αs = 2). This distribution contains redundant images, thus promoting in-weights learning, but also rare, diverse images, thus promoting in-context learning as well. b, Training and test performances of humans (Exp. 2, n = 50) when training query images were sampled from this composite distribution. On average, human participants became accurate in both in-context and in-weights test blocks. The small dots indicate data from individuals; the large dots indicate group averages. c, Double learning index for human participants trained on uniform (Uni, α = 0, Exp. 1, n = 30), composite (Comp, Exp. 2, n = 50) and skewed distributions (Skw, α = 2, Exp. 1, n = 30). Human participants had a greater double learning index value when trained on a composite distribution than when trained on a uniform distribution (linear regression with the group as a fixed effect, β = −0.295 ± 0.066, P = 0.0, BF > 100, ‘decisive’ evidence) or a skewed distribution (β = −0.191 ± 0.067, P = 0.005, BF > 100, ‘decisive’ evidence). **P < 0.01; ***P < 0.001. d, Scatter plots of the in-context versus in-weights test performances for transformers (left) and human participants (right). The dots indicate data from individual transformers/humans. The stars indicate group averages for uniform (blue, α = 0, Exp. 1), composite (pink, Exp. 2) and skewed distributions (red, α = 2, Exp. 1).
First, we trained the same transformer architecture on this composite distribution. The results from a full sweep of parameters are shown in Extended Data Fig. 5, but here we focus on the case where Pc = 0.5 and αs = 2. In contrast to what we observed with Zipfian distributions, under this parameterization transformers performed well in both in-context and in-weights test trials simultaneously. Plotting individual transformers’ in-context test performance against their in-weights test performance revealed a large cluster of models located in the top-right corner (~31/50, 62%; Fig. 3d, left). These models have high levels of accuracy in both in-context and in-weights. This confirms that transformers are able to learn both strategies independently, if exposed to a distribution containing both redundant and diverse training examples.
Human participants trained on this composite distribution (Experiment 2; Fig. 3b) also had high levels of accuracy for both in-context test trials (65.6 ± 5.9%) and in-weights trials (57.4 ± 5.0%). Note that this does not directly imply that participants learned both strategies simultaneously, as what is true at the population level might not be reflected at the individual level—there could simply be two subgroups, one learning in-context and one learning in-weights. We thus introduced a ‘double learning index’ to quantify the amount of learning of both strategies at the individual level. Formally, it was computed as a product of the individual performance in-context and in-weights trials scaled to account for chance level (Methods). The index varies between 0 (when the individual is at chance in either one of the two tests) and 1 (when the individual has perfect performance in both tests). We confirmed that human participants had a greater double learning index value when trained on a composite distribution (0.27 ± 0.05 a.u.) than when trained on a uniform distribution (α = 0, −0.02 ± 0.01 a.u.; difference between groups, β = −0.295 ± 0.066, P = 0.0, BF > 100, ‘decisive’ evidence) or a skewed distribution (α = 2, 0.08 ± 0.04 a.u., β = −0.191 ± 0.067, P = 0.005, BF > 100, ‘decisive’ evidence) (Fig. 3c). We further confirmed that human participants truly became ‘double learners’ by plotting individual participants’ in-context test performance against their in-weights test performance (Fig. 3d). We observed a large cluster of double-learners participants (17/50, 34%), located in the top-right corner.
Humans, but not transformers, benefit from curricula that prioritize diverse samples early on in training (Experiment 3)
We have so far investigated static, unstructured training regimes, where examples are sampled independently and identically across training. Next, we asked whether a dynamic training curriculum would improve learning in transformers and humans. The question was whether the order of presentation of the trials would influence performance—for example, because learning one strategy interacts with the learning of the other strategy. For that, we used the same composite distribution as previously, known to promote the learning of both strategies, but we manipulated the order of the skewed and uniform trials across training.
Specifically, we designed two training curricula for transformers. The first curriculum (C1) involved maximally diverse exemplars in the first half of the training (the ‘uniform part’ of the composite distribution, α = 0) and then more redundant exemplars in the second half of the training (the ‘skewed part’ of the composite distribution, αs > 0). The second curriculum (C2) reversed this ordering (Fig. 4a). Transformers trained on these curricula failed to become double learners. Indeed, the double learning index was near zero for all transformers, and this was true for a wide range of αs, as shown in Fig. 4d. Even in an extremely skewed regime (αs = 4), transformers do not become good double learners. In fact, there is an important interference from learning in initial trials. For example, when α = 4, 92% of the trials are dominated by one item–label pair and >99% by the first five item–label pairs, so in-weights learning should be straightforward. Nevertheless, when initially trained on a uniform distribution (C1), transformer networks failed to learn this task. These data are illustrated in Fig. 4e and Extended Data Fig. 6, which shows the test performance of transformers trained on C1 or C2 as training progresses. During the first part of the C1 training, transformers become pure in-context learners (the red curve goes to the bottom-right corner). In the second part of the C1 training, transformers progressively forget the in-context strategy as they learn in-weights (the red curve goes to the top-left corner). A double-learning transformer would keep high performance for in-context trials while learning in-weights (the red curve would go to the top-right corner). We observed the same pattern in opposite directions for transformers trained on C2 (the blue curve in Fig. 4e). Thus, transformers converge towards one strategy during the first part of the training according to the training distribution, but then forget this strategy, showing a form of catastrophic interference37,38.
a, Training curricula based on the composite distribution (Pc = 0.5, αs > 0). The first curriculum (C1) involved maximally diverse exemplars in the first half of the training (the uniform part of the composite distribution, α = 0) and then more redundant exemplars in the second half of the training (the skewed part of the composite distribution, αs > 0). The second curriculum (C2) reversed this ordering. b, Two groups of human participants (Exp. 3, n = 50 per group) were exposed to two training curricula, C1 and C2 (composite distribution, Pc = 0.5, αs = 2). Human participants trained on C1 showed better performance on in-context trials than participants trained on C2 (logistic regression with the group as a fixed effect, β = −2.635 ± 0.784, P = 0.019, BF = 5.3, ‘substantial’ evidence, P value Bonferroni corrected). They also responded more using the in-context strategy in arbitrage trials (β = −2.676 ± 0.711, P = 0.004, BF = 12.1, ‘strong’ evidence, P value Bonferroni corrected). However, both groups had similar performance on in-weights trials (β = 0.096 ± 0.428, P = 1.0, BF = 0.019, ‘strong’ evidence, P value Bonferroni corrected). The small dots indicate data from individuals; the large dots indicate group averages. NS, P > 0.05; *P < 0.05; **P < 0.01. NS, not significant; prer., preregistered contrasts. c, Double learning index of human participants (linear regression with the group as a fixed effect, β = −0.162 ± 0.076, P = 0.036, BF = 0.969). *P < 0.05. The small dots indicate data from individuals; the large dots indicate group averages. d, Double learning index for transformers trained on the C1 curriculum (left) and the C2 curriculum (right). Transformers were trained with different values of αs for the skewed part of the composite distribution. e, Test performances over the course of training of transformers trained on C1 (red) and C2 (blue). The bold lines indicate group averages (n = 20 transformers per curriculum). The arrows were manually added to emphasize the direction of the trajectories.
We next used a similar approach to investigate this question in humans (Experiment 3). Training was composed of four blocks: two blocks where query images were sampled from a uniform distribution (α = 0) and two blocks from a skewed distribution (αs = 2). We then defined a curriculum as a permutation of the block order, denoted C1 and C2 (Fig. 4a). We used a between-group design, in which two groups of human participants (n = 50 per group) each experienced one curriculum. Both groups thus experienced the same trials but not in the same order. We preregistered our predictions prior to data collection (AsPredicted no. 173550, https://aspredicted.org/yhvp-6y3y.pdf, hypothesis H1). On the basis of pilot data, we predicted that C1 would favour in-context learning while not impairing in-weights learning relative to C2. The results are shown in Fig. 4b,c and reveal that, in line with our predictions, participants trained on C1 showed better performance on in-context trials than participants trained on C2 (difference between groups, β = −2.635 ± 0.784, P = 0.019, BF = 5.3, ‘substantial’ evidence, P value Bonferroni corrected). This was also the case in arbitrage trials, where participants trained on C1 responded more using the in-context strategy than participants trained on C2 (difference between groups, β = −2.676 ± 0.711, P = 0.004, BF = 12.1, ‘strong’ evidence). However, participants in both groups had the same performance on in-weights trials (difference between groups, β = 0.096 ± 0.428, P = 1.0, BF = 0.019, ‘strong’ evidence).
These results suggest that, in line with our preregistered predictions, a human curriculum that prioritizes diverse examples early on in training (C1) is beneficial for in-context learning while not impairing in-weights learning. We believe this reveals an asymmetry between in-context and in-weights learning in humans. Participants can still learn image–label associations even when they have discovered the in-context rule (C1) but have trouble discovering the in-context rule if they are first exposed to a training regime that favours in-weights learning (C2). For completeness, we tested all permutations of the block order as well as two ‘interleaved’ curricula where uniform and skewed distributions alternate during training (C3 and C4). The results are presented in Extended Data Fig. 7 and show that no other group contrasts were statistically significant (all P > 0.05, Bonferroni corrected; Extended Data Table 1).
Transformers and humans use an induction mechanism for in-context learning (Experiment 4)
One limitation of our comparison between transformers and humans is that it offers little insight into the mechanisms by which in-context learning is happening. To better understand the similarities between transformers and humans, we studied the inference process as it unfolds, using a mixture of tools from the emerging field of mechanistic interpretability (in transformers)39 and a behavioural mouse-tracking study (in humans)40. The results suggest that both humans and transformers solve the task using a two-step process composed of a binding operation followed by a searching operation.
For transformers, we first trained a transformer on the α = 0 distribution to create a pure in-context learning model. We then investigated the attention patterns of its two attention heads during an in-context learning test trial. Attention patterns can be illustrated as square matrices that plot how the transformer weights information about each item i when predicting each other item j. First, in attention head 1, the transformer associates each item with its corresponding label, which is located three positions ahead: we observed in Fig. 5a (matrix of attention head 1) that the attention weight for each item is concentrated on the token that is three positions ahead. This reflects a binding operation, where the attention head writes information about each item into the embedding of its corresponding label41,42,43. Crucially for the next step, it writes information about the target item into the embedding of the target label. Second, in attention head 2, the transformer searches for a match between the query item and the preceding context tokens. Since attention head 1 has already written information about the target item into the embedding of the target label, the match occurs at the target label’s location: in Fig. 5a (matrix of attention head 2, last column), we see that the attention weights for the query item are concentrated on the target label. The model then reads the information stored at this label. This computational architecture has been previously described in detail in refs. 41,42,44 and is referred to as an ‘induction head’. The two attention heads are essentially implementing a minimal induction operation of the form [A][B]…[A] → [B]. This copying operation indeed solves our in-context learning task ‘item; label; … ; item:?’.
a, Right, schematic representation of the computations realized by a two-layer transformer performing in-context learning. Left, attention matrices of both layers for the example sequence. The transformer binds the representations of the images and the labels in attention head 1 and searches for the target image in the context in attention head 2 (the induction head). b, Cursor trajectories of participants revealing their attention patterns. Top, trajectories in the in-context test block for human participants trained on a uniform (α = 0) distribution. Participants search for the target image in the context and then associate it with the target label. Bottom, trajectories in the in-context test block for participants trained on a skewed (α = 2) distribution (Exp. 4, n = 20 per group). Trajectories were aligned trial-by-trial to a common frame where the target image is located on the top of the context circle. The small lines are individual average trajectories; the diamonds are group average trajectories.
For humans, we trained a new in-person group of participants (Experiment 4, n = 20) on a uniform distribution (α = 0) to induce in-context learning, alongside a control group (n = 20) who encountered a skewed distribution (α = 2). For these participants, unlike in the previous experiments, we used a mouse-tracking paradigm to reveal the computational processes underlying human in-context inference as it unfolds (Methods, Fig. 5b and Extended Data Fig. 8). In test trials, the display was blurred and obscured, so that the locations of the images and labels could be seen but not their content. Participants were allowed to move a sharp aperture with their mouse to reveal part of the screen. Thus, similar to an eye-tracking device, tracking mouse position allowed us to track which information participants were viewing on the screen.
First, we confirmed that the participants trained on α = 0 became in-context learners, whereas the participants trained on α = 2 did not, replicating once again the results of Experiment 1. Indeed, the training data strongly influenced performance on in-context test trials (effect of α on accuracy, β = −2.437 ± 0.636, P = 0.0, BF = 44.3, ‘strong’ evidence), in-weights test trials (β = 2.262 ± 0.129, P = 0.0, BF > 100, ‘decisive’ evidence) and arbitrage test trials (effect of α on accuracy with respect to in-context learning, β = 2.002 ± 0.264, P = 0.0, BF > 100, ‘decisive’ evidence). As in Experiment 1, we confirmed that the training distribution did not directly influence the performance at the end of training (β = −0.09 ± 0.408, P = 0.824, BF = 0.027, ‘strong’ evidence) but only the strategy used by the participants.
Mouse trajectories are depicted in Fig. 5b (top). In step 1, after looking at the query image, participants search for the target image in the context. In step 2, once they have found the target image, they aim for the target label located at +3 steps clockwise and give a response. Note that these two steps correspond exactly to the two attention heads of the transformer: step 1 is implemented by attention head 2 (the searching operation), and step 2 is implemented by attention head 1 (the binding operation). We quantified the occurrence of these two steps in humans by counting the number of times the participant’s trajectory hit the target image and the target label on in-context test trials. We confirmed that participants trained on α = 0 hit the target image more often (84.5 ± 5.4%) than those trained on α = 2 (43.0 ± 9.8%) (effect of α on the probability of a hit, β = −2.117 ± 0.591, P = 0.0, BF = 17.3, ‘strong’ evidence). Similarly, participants trained on α = 0 hit the target label more often (82.8 ± 4.1%) than those trained on α = 2 (32.6 ± 8.4%) (effect of α on the probability of a hit, β = −2.433 ± 0.582, P = 0.0, BF > 100, ‘decisive’ evidence). The mouse-tracking data thus suggested that participants trained on a uniform distribution (α = 0) were using a two-step process, perhaps implementing an induction head similar to transformer networks. However, one difference between humans and transformer networks is that transformers bind all items with their corresponding labels in the context, while humans only bind the target image with the target label. This is because transformers are parallel architectures, applying the same operation to all the tokens at the same time.
Finally, to test whether our findings generalize to more abstract forms of reasoning, we trained transformers on a transitive inference task. In this task, the model had to infer A > C from examples such as A > B and B > C presented in the context. As in the main task, performance depended on the training distribution: models trained on a uniform distribution (α = 0) solved the task using in-context learning, while models trained on a skewed distribution (α > 1) relied on in-weights learning. These results confirm that the link between training distribution and learning strategy holds even in tasks requiring more abstract generalization (Extended Data Fig. 9).
Discussion
Transformers are feed-forward neural networks augmented with self-attention that process long sequences of inputs in parallel. By contrast, the brain more closely resembles a recurrent neural network, where inputs are necessarily processed over sequential time steps. A priori, there is little reason to believe that humans and transformer networks would learn in comparable ways. We were thus quite surprised to find that their sensitivity to the distributional properties of the training data was so similar. Both humans and transformer networks show the same sensitivity to increasing skewness of the training distribution, with a transition between in-weights and in-context learning occurring in both cases at α = 1. Both humans and transformer networks traded in-weights for in-context learning when the training distribution was Zipfian, but both became double learners when trained on a composite distribution that jointly prioritized both diversity and redundancy in the training samples. Finally, both humans and transformer networks appear to use a binding-plus-searching operation to solve the task, as revealed by mechanistic interpretability analysis (in transformers) and analysis of viewing trajectories (in humans).
Previous studies using a similar methodology have argued that α = 1 represents a ‘sweet spot’ at which both in-weights and in-context learning are possible in transformers. We show here that what seems to be true at the level of the population is not true at the individual model level, as no single network learned both strategies in tandem using Zipfian distributions. At α = 1, some models converge to in-context learning and some to in-weights learning, but every model trades off one strategy for the other. We tried different model sizes and confirmed that this was also the case with larger and deeper models, with and without interleaved feed-forward layers between attention layers (up to four attention heads per layer, up to ten layers; Extended Data Figs. 10 and 11). Furthermore, we used mechanistic interpretability to confirm that attention heads were performing either in-context learning or in-weights learning but never both. To test this, we quantified the similarity between idealized attention patterns for in-context and in-weights learning and observed the attention patterns of models trained on different Zipfian distributions. The results are displayed in Extended Data Fig. 12 and show that models trained on α < 1 are similar to in-context learning heads, while models trained on α > 1 are similar to in-weights learning heads. Conversely, and in line with ref. 30, we show that composite, non-Zipfian distributions promote the learning of both strategies in tandem in transformers. While our results are based on relatively small transformer models trained from scratch, prior work suggests that many such behaviours generalize to larger-scale settings32,41,45. We nonetheless caution that scaling and pretraining introduce additional factors that may alter the dynamics of learning strategy selection.
Despite these striking similarities, transformers did not benefit from curricula that prioritized either diversity or redundancy in examples, whereas humans clearly did. This difference probably reflects a well-known limitation of neural networks: catastrophic interference. Once transformers settle on a strategy, they often forget earlier information—especially when training is blocked. In humans, early diversity boosts generalization, even when redundancy comes later. In transformer networks, later training tends to overwrite earlier strategies, making them less flexible to curriculum structure.
However, the broader failure of neural networks to benefit from structured training remains a puzzle in machine learning. For example, the BabyLM challenge (https://babylm.github.io/) is a competition in which machine learning researchers attempt to train language models with fewer than 100 million words. In its first iteration, many of the entrants attempted to use some sort of curriculum, but none were particularly successful46. Recent theoretical work suggests that curricula can help neural networks trained with gradient descent by guiding learning dynamics early on, especially by increasing diversity in input directions during the initial phase of training. This early diversity helps steer the model towards useful solutions more efficiently47. This implies that overparameterized deep neural networks (which typically already begin with a very high-dimensional initialization in weight space) are unlikely to benefit from curricula. However, this problem remains unsolved, and how to structure training examples to train neural networks more efficiently and effectively remains an open question.
Our findings have two potentially important implications for how people learn. The first is that for humans, as for transformers, a curriculum that promotes both redundancy and diversity allows people to learn strategies that rely on both memory and inference. This speaks to a long-standing debate in education research, which has asked whether schools should emphasize rote learning or critical thinking48. The answer implied by our data is that both are important. Presenting diverse examples that teach students how to tackle new problems is crucial, but being able to retrieve information about past experiences requires repetition. Of course, we cannot know whether insights from the simple, stylized setting employed here would translate to the classroom, but at least our work sets up a hypothesis that could be tested in more translational settings.
The second finding provides an interesting caveat to this claim: in humans, it is beneficial to provide diverse training examples early on. Early diversity does not seem to be overwritten by repetition that occurs later in training, whereas people that start learning from repeated examples never quite master the task. It is likely that early redundancy encourages learners to overfit to a specific strategy, making it more difficult to later embrace generalities. This result aligns with recent findings on asymmetries between in-context and in-weights learning. Specifically, Singh et al.49 showed that in-context learning tends to give way to in-weights learning asymptotically, but not the reverse. Furthermore, Singh et al.36 showed that once a model adopts an in-weights learning strategy, it struggles to recover in-context learning—while the reverse transition remains possible. We observed a similar pattern in humans: participants trained first on skewed data (favouring in-weights learning) failed to adopt in-context learning later, but those trained first on uniform data (favouring in-context learning) could shift strategies. These findings suggest that early learning conditions constrain later flexibility. We find this observation interesting, but we are unsure about its generality. It would be interesting to test whether this result replicates in other tasks involving a mixture of in-weights and in-context learning.
Our work compares humans and transformer networks. We found that in one interesting respect—the emergence of in-weights learning and in-context learning in response to the training data distribution—they show some striking similarities. However, this should not be taken to imply overlap between humans and transformers at the algorithmic level. Indeed, other classes of neural network, including simple multi-layer perceptrons, may in principle be capable of in-context learning34,35. Transformers are feed-forward networks with a highly structured architecture based on self-attention, diverging sharply from the recurrent, feedback-driven and biologically grounded computations of the human brain. Nevertheless, the way that they trade off memory-based strategies and inference-based strategies exhibits surprising commonalities with how this happens in human cognition.
Methods
Stimuli and paradigm
Participants
In total, we collected data from 530 participants (121 for Experiment 1, 50 for Experiment 2, 199 for Experiment 3, 40 for Experiment 4 and 120 for the replication of Experiment 1). The participants were recruited on the crowdsourcing platform Prolific (https://app.prolific.co/). The inclusion criteria included being between 18 and 40 years old, reporting no neurological condition, being an English speaker, being located in the USA or the UK, not having participated in another version of the task, having a minimal approval rate of 90% on Prolific and having a minimum of five previous submissions on Prolific. Participants received on average £10 per hour for their time and effort, including a bonus on performance (£8.5 per hour for random performances and £10.5 per hour for perfect performances). All experiments were approved by the Medical Sciences Research Ethics Committee of the University of Oxford (approval reference no. R50750/RE005). Before starting the experiment, informed consent was taken through an online form, and the participants indicated that they understood the goals of the study, how to raise any questions, how their data would be handled and that they were free to withdraw from the experiment at any time.
Stimuli
We selected 2,000 pictures from the Common Objects in Context dataset50. The pictures represented a large variety of items (animals, people, landscapes, food and objects). The images were cropped and scaled to 300 × 300 pixels.
Procedure
JavaScript online experiments
The experiments were written in JavaScript, using jsPsych (version 7.3.1, https://www.jspsych.org/7.3/)51, and hosted on a web server. The scripts are available at https://osf.io/xb43k.
Instructions
The participants were instructed that the task was deterministic. The exact instructions were “This task is a learning task. You may have poor performances at the beginning but you will improve over the course of the experiment. On each trial, you will see a sequence of images and numbers. Your task is to press on the correct number on your keyboard, from 0 to 9. The rule determining which number you have to choose is 100% deterministic. This means that once you have discovered the rule, you will have 100% of correct responses.”
Main task
On each trial, the participants were presented with an image at the centre of the screen (the query image) surrounded by seven images and seven labels arranged in a ring (the context). The participants were asked to select the correct label associated with the query image by pressing on their keyboard among ten possible labels: {‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’}. Trials consisted of the following events: (1) a black loading screen for 500 ms, (2) stimulus presentation (query image and context) and response recording until a response was made, and (3) trialwise feedback for 1,000 ms. The stimuli remained visible on screen during feedback. For trials without trialwise feedback, a black screen was presented for 1,000 ms instead of the feedback screen. In training blocks, the participants received blockwise feedback on their performance in the last block on top of trialwise feedback.
Four block types were presented:
-
Training blocks. The query images were sampled from a Zipfian distribution with parameter α (see below). A copy of the query image (the target image) was always present in the context. The location of this target image was sampled uniformly from the seven possible locations. The six other images in the context were sampled uniformly from our pool of 2,000 images. The correct label was always located three steps clockwise from the target image (the target label). The other six labels in the context also followed the same rule: each was located three steps clockwise from its corresponding context image. The three-steps-clockwise rule and the use of seven context images were chosen on the basis of pilot data to avoid trivial or symmetry-based rules that led to rapid learning. Fully informative trialwise feedback was provided during training: after each trial, the participants were shown whether their response was correct or incorrect, as well as the correct response. Blockwise feedback was also given after each block during training. The mapping between images and labels was arbitrary and not semantically meaningful.
-
In-context test blocks. The query images were novel images sampled uniformly from unseen images during training. A copy of this query image (the target image) was always present in the context. The location of the target image was sampled uniformly from the seven possible locations. The six other images in the context were sampled uniformly from our pool of 2,000 images. The correct label was always located three steps clockwise from the target image (the target label). The other six labels in the context were sampled uniformly between 0 and 9. No feedback was given during in-context test blocks.
-
In-weights test blocks. The query images were old images sampled from the same Zipfian distribution as the training. No target image was present in the context. The seven images in the context were sampled uniformly from our pool of 2,000 images. The seven labels in the context were sampled uniformly between 0 and 9. No feedback was given during in-weights test blocks.
-
Arbitrage test blocks. The query images were old images sampled from the same Zipfian distribution as the training. A copy of the query image (the target image) was always present in the context. The location of this target image was sampled uniformly from the seven possible locations. The six other images in the context were sampled uniformly from our pool of 2,000 images. The seven labels in the context were sampled uniformly between 0 and 9. No feedback was given during arbitrage test blocks.
Rank-frequency (Zipfian) distribution
In training blocks, query images were sampled from a rank-frequency (Zipfian) distribution of parameter α. A Zipfian distribution on N elements assigns to the element of rank k (counting from 1) the probability:
where HN,α is a normalization constant and is equal to the Nth generalized harmonic number. When α = 0, the distribution is the uniform distribution. When α > 0, the distribution is skewed, with larger values of α associated with a higher degree of skewness. On 150 trials, the frequency rankings were as follows:
-
For α = 0, the distribution was uniform, and the frequency of the images was [1, 1, 1, …, 1, 1] (all images are novel and appear once).
-
For α = 1, the distribution was skewed, and the frequency of the images sorted in decreasing order was [25, 13, 9, 7, 5, 5, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, …].
-
For α = 2, the distribution was skewed, and the frequency of the images sorted in decreasing order was [92, 23, 11, 6, 4, 3, 2, 2, 2, 1, 1, …].
-
For α = 4, the distribution was highly skewed, and the frequency of the images sorted in decreasing order was [139, 9, 2].
Experiment 1 and Experiment 1 replication
In Experiment 1, training consisted of five blocks of 30 trials (150 training trials in total). Participants were assigned randomly to one of four groups (between-participant design), corresponding to four distributions of the query images during training: a Zipfian distribution with α ∈ {0, 1, 2, 4}. After training, the participants performed the three test blocks: one in-context test block of 30 trials, one in-weights test block (30 trials) and one arbitrage test block (30 trials). The order of the test blocks was randomized across participants.
Experiment 2
In Experiment 2, training consisted of four blocks of 30 trials (120 training trials in total). Query images during training were sampled from a composite distribution—that is, 60 trials with query images sampled from a uniform distribution (α = 0) and 60 trials with query images sampled from a skewed distribution (α = 2). The order of all trials was shuffled for each participant, meaning both distributions were fully interleaved. After training, the participants performed the three test blocks: one in-context test block of 30 trials, one in-weights test block (30 trials) and one arbitrage test block (30 trials). Query images in the in-weights and arbitrage test blocks were sampled from the skewed distribution (α = 2). The order of the test blocks was randomized across participants.
Experiment 3
In Experiment 3, training consisted of four blocks of 30 trials (120 training trials in total). Two types of training blocks were presented: training blocks with query images sampled from a uniform distribution (α = 0) and training blocks with query images sampled from a skewed distribution (α = 2). Participants were assigned randomly to one of four groups (between-participant design), corresponding to four training curricula: C1 (the first block is skewed, the second block is skewed, the third block is uniform and the fourth block is uniform), C2 (uniform, uniform, skewed, skewed), C3 (skewed, uniform, skewed, uniform) and C4 (uniform, skewed, uniform, skewed). After training, the participants performed the three test blocks: one in-context test block of 30 trials, one in-weights test block (30 trials) and one arbitrage test block (30 trials). Query images in the in-weights and arbitrage test blocks were sampled from the skewed distribution (α = 2). The order of the test blocks was randomized across participants.
Experiment 4
In Experiment 4, training consisted of five blocks of 30 trials (150 training trials in total). Participants were assigned randomly to one of two groups (between-participant design), corresponding to two distributions of the query images during training: a Zipfian distribution with α ∈ {0, 2}. After training, the participants performed the three test blocks: one in-context test block of 30 trials, one in-weights test block (30 trials) and one arbitrage test block (30 trials). The order of the test blocks was randomized across participants. During the test blocks, we used MouseView.js52 to track the attention of the participants on the screen during stimulus presentation. For that, the display was blurred and obscured so that the locations of images and labels could be seen but not their content. The participants were allowed to move a sharp aperture with their mouse to reveal part of the screen. We used the default parameter values of MouseView.js, with an aperture of size 15% (roughly the size of an image on the screen).
Preregistrations
The replication of Experiment 1 was preregistered on AsPredicted (no. 231356, https://aspredicted.org/rqgz-rdfk.pdf). Experiment 3 was also preregistered on AsPredicted (no. 173550, https://aspredicted.org/yhvp-6y3y.pdf). All hypotheses and planned analyses are publicly available in the corresponding preregistration documents.
Neural networks
Our model was largely based on the work of Reddy30, which investigated the mechanistic basis of in-context learning in transformers.
Stimuli
The network was trained to predict the label ‘labelq’ of a query item ‘itemq’ given an alternating sequence of N images and N labels:
item1; label1; item2; label2; … ; itemN; labelN; itemq:?
The images and labels were embedded in D + P dimensions. The first D dimensions encoded content, while the latter P dimensions encoded positional information. Position was encoded by a one-hot P-dimensional vector. Images were D-dimensional vectors sampled independent and identically from a D-dimensional Gaussian distribution with mean 0 and variance 1. Each of the K images was assigned one of the L labels (L ≤ K). Labels were drawn prior to training and were also sampled independent and identically from a D-dimensional Gaussian distribution with mean 0 and variance 1.
Architecture
The inputs were passed through a two-layer attention-only network of intrinsic dimensionality DM followed by a classifier. Each attention layer had one attention head with a causal mask. The classifier was composed of two fully connected layers with ReLU activations and DM hidden units each. The last layer was a fully connected layer that predicted the probabilities of the L labels.
We also tested an interleaved MLP model (Extended Data Fig. 11), where each attention layer was followed by a feed-forward (MLP) block consisting of two dense layers with DM units (with ReLU activation), a residual connection and a layer normalization step.
Mimicking our human experiment, the dimensions of the problem were set to L = 10 and N = 7. The dimensions of the inputs were set to K = 214 and D = 8. The dimension of the model was set to DM = 16.
Training
The network was trained using a cross-entropy loss. For training, we used a batch size of 128 and the Adam optimizer with a learning rate of 0.01. The models were trained on 5,000 steps.
Alternative models
We compared the performance of the transformer network with two other architectures, keeping the number of layers and total number of parameters fixed: a two-layer feed-forward fully connected network with ReLU activations, and a two-layer LSTM network. All models were trained on the same data and evaluated using the same procedure as the transformer, including positional encodings in their input representations.
The feed-forward model received the entire input sequence flattened into a single vector. The standard LSTM received inputs one item at a time, with the query presented last, matching the set-up used for transformers and human participants. We also tested a query-first LSTM variant, where the query appeared at the start of the sequence, followed by the context items. This was designed to test whether knowing the target query early would help the model focus on relevant context and learn an in-context strategy. Despite these variations, none of the models showed reliable in-context learning (Extended Data Fig. 2).
Transitive inference task
We designed a second modelling task to test whether the effects of training distribution on learning strategy generalize beyond the image–label association setup. In this task, each training environment consisted of six unique images, each with an implicit rank. The model received ten training triplets per trial, each expressing a one-step comparison between images (for example, ‘image 4 > image 3’), followed by a query that required a two-step transitive inference (for example, ‘image 4 ? image 2’).
On each trial, the model received:
-
A context of one-step comparisons between image pairs from a single environment (for example, ‘image 4 > image 3’, ‘image 3 > image 2’).
-
A query requiring a two-step inference (for example, ‘image 4 ? image 2’), where the model had to choose the correct relational symbol (‘>’ or ‘<’).
We manipulated the training distribution by varying the skewness (Zipf exponent α) of how often each environment appeared during training, following the same logic as in our main task. At test, three types of blocks were used:
-
In-context test: the context came from a novel environment, so the only way to respond correctly was to use in-context learning.
-
In-weights test: the query pair had been seen during training, but the context came from a novel environment; accuracy relied on memorized pair–label associations.
-
Arbitrage test: environments were reused from training but with reversed item orders (for example, ‘image 4 < image 3’), to probe which strategy dominated when in-context and in-weights learning gave conflicting answers.
We used the same architecture, training set-up and evaluation metrics as in the image–label association task. The full results are presented in Extended Data Fig. 9.
Statistical analysis
Outliers
No outliers were removed from the analyses.
Model selection
Statistical analyses were done using R version 4.4.2 (ref. 53) and the package lme4 (ref. 54). For all analyses, model complexity was monitored using the Bayesian information criterion (BIC), a standard measure to arbitrate between complexity and accuracy. The reported P values are Satterthwaite approximations. We also report the BF for each effect as approximated using the difference between the BIC of the model with the effect BIC1 and the model without the effect BIC0 and defined as BF = exp((BIC0 − BIC1)/2). The BF quantifies the support of the data in favour of an effect. We followed ref. 55 for the interpretation of its values: BF > 3, BF > 10 and BF > 100 were respectively taken as substantial, strong and decisive evidence in favour of an effect (BF < 0.3, BF < 0.1 and BF < 0.01 as evidence in favour of the absence of an effect).
Accuracy
In Experiment 1 and Experiment 4, the probability of being correct (0, incorrect; 1, correct) was modelled as an independent logistic regression for each block type, with α as a fixed effect and one random intercept per participant.
In Experiment 3, the probability of being correct was modelled as an independent logistic regression for each block type and each group contrast, with the group as a fixed effect and one random intercept per participant. We applied a Bonferroni correction to correct for multiple comparisons.
Power analysis
The sample size for the replication of Experiment 1 was determined via power simulations based on data from Experiment 1, assuming a 50% smaller effect size than observed in that study. The simulations suggested a minimum of 10–20 participants per group, depending on the block. To ensure robust power across all analyses, we conservatively set the sample size to 30 per group.
Double learning index
We defined a double learning index as a value between 0 (no double learning) and 1 (perfect performance in both in-context and in-weights test trials). For each participant, it was defined as:
where mIC is the average performance of the participant in the in-context test trials, mIW is the average performance of the participant in the in-weights test trials and ‘scale’ is a linear mapping accounting for chance level (‘chance’, here 10%). Because it is a product, this index is 0 if either of the two performances is at chance (and thus non-zero only if both performances are above chance).
In Experiment 2 and Experiment 3, the double learning index was modelled as a linear regression with the group as a fixed effect.
Mouse trajectories
For visualization purposes, trial-by-trial cursor trajectories were first rotated in a common frame where the target image was located on the top of the context circle and then resampled to 100 time points between the start and the end of the trial using linear interpolation. A ‘hit’ trial was defined as the target image being at a minimum distance of 20% of the screen height at least one time during the trial. In Experiment 4, the probability of a hit (0, no hit; 1, hit) was modelled as a logistic regression for in-context test trials, with α as a fixed effect and one random intercept per participant.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The anonymized data, materials and preregistration documents are all available via OSF at https://osf.io/xb43k.
Code availability
The scripts for stimulus presentation and data analysis are available via OSF at https://osf.io/xb43k.
References
Brown, R. E. Hebb and Cattell: the genesis of the theory of fluid and crystallized intelligence. Front. Hum. Neurosci. 10, 606 (2016).
Stanovich, K. E. & West, R. F. Individual differences in reasoning: implications for the rationality debate?. Behav. Brain Sci. 23, 645–665 (2000).
Ashby, F. G. & Maddox, W. T. Human category learning. Annu. Rev. Psychol. 56, 149–178 (2005).
Sloman, S. A. The empirical case for two systems of reasoning. Psychol. Bull. 119, 3–22 (1996).
Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80, 312–325 (2013).
Pylyshyn, Z. W. in The Foundations of Cognitive Science (ed. Posner, M. I.) 51–92 (MIT Press, 1989); https://doi.org/10.7551/mitpress/3072.003.0004
Newell, A. Physical symbol systems. Cogn. Sci. 4, 135–183 (1980).
Fodor, J. A. The Language of Thought (Harvard Univ. Press, 1975).
Summerfield, C. Natural General Intelligence: How Understanding the Brain Can Help Us Build AI (Oxford Univ. Press, 2022); https://doi.org/10.1093/oso/9780192843883.001.0001
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Krotov, D. & Hopfield, J. J. Dense associative memory for pattern recognition. Adv. Neural Inf. Process. Syst. 29, 1172–1180 (2016).
Wang, J. X. et al. Prefrontal cortex as a meta-reinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).
von Oswald, J. et al. Transformers learn in-context by gradient descent. In Proc. International Conference on Machine Learning (eds Krause, A. et al.) 35151–35174 (PMLR, 2023).
Wang, J. X. Meta-learning in natural and artificial intelligence. Curr. Opin. Behav. Sci. 38, 90–95 (2021).
Binz, M. et al. Meta-learned models of cognition. Behav. Brain Sci. 47, e147 (2023).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Brown, T., Mann, B. & Ryder, N. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Anil, R. et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
Lake, B. M. & Baroni, M. Human-like systematic generalization through a meta-learning neural network. Nature 623, 115–121 (2023).
Saxe, A., Nelli, S. & Summerfield, C. If deep learning is the answer, what is the question? Nat. Rev. Neurosci. 22, 55–67 (2021).
Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).
Min, S. et al. Rethinking the role of demonstrations: what makes in-context learning work? In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 11048–11064 (Association for Computational Linguistics, 2022); https://doi.org/10.18653/v1/2022.emnlp-main.759
Ravaut, M. et al. A comprehensive survey of contamination detection methods in large language models. Preprint at https://arxiv.org/abs/2404.00699 (2024).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (ACM, 2021); https://doi.org/10.1145/3442188.3445922
Lampinen, A. K., Chan, S. C. Y. & Hermann, K. Learned feature representations are biased by complexity, learning order, position, and more. Preprint at https://arxiv.org/abs/2405.05847 (2024).
Garg, S., Tsipras, D., Liang, P. & Valiant, G. What can transformers learn in-context? A case study of simple function classes. Preprint at https://arxiv.org/abs/2208.01066 (2022).
Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J. & Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? Preprint at https://arxiv.org/abs/1912.10077 (2019).
Reddy, G. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. Preprint at https://arxiv.org/abs/2312.03002 (2023).
Chan, S. C. Y. et al. Transformers generalize differently from information stored in context vs in weights. Preprint at https://arxiv.org/abs/2210.05675 (2022).
Chan, S. C. Y. et al. Data distributional properties drive emergent in-context learning in transformers. Adv. Neural Inf. Process. Syst. 35, 18878–18891 (2022).
Raventós, A., Paul, M., Chen, F. & Ganguli, S. Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. Adv. Neural Inf. Process. Syst. 36, 14228–14246 (2023).
Lee, I., Jiang, N. & Berg-Kirkpatrick, T. Is attention required for ICL? Exploring the relationship between model architecture and in-context learning ability. Preprint at https://arxiv.org/abs/2310.08049 (2023).
Tong, W. L. & Pehlevan, C. MLPs learn in-context on regression and classification tasks. Preprint at https://arxiv.org/abs/2405.15618 (2024).
Singh, A. K. et al. Strategy coopetition explains the emergence and transience of in-context learning. Preprint at https://arxiv.org/abs/2503.05631 (2025).
McCloskey, M. & Cohen, N. J. in Psychology of Learning and Motivation Vol. 24 (ed. Bower, G. H.) 109–165 (Elsevier, 1989).
Ratcliff, R. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol. Rev. 97, 285–308 (1990).
Rai, D., Zhou, Y., Feng, S., Saparov, A. & Yao, Z. A practical review of mechanistic interpretability for transformer-based language models. Preprint at https://arxiv.org/abs/2407.02646 (2024).
Spivey, M. J. & Dale, R. Continuous dynamics in real-time cognition. Curr. Dir. Psychol. Sci. 15, 207–211 (2006).
Olsson, C. et al. In-context learning and induction heads. Preprint at https://arxiv.org/abs/2209.11895 (2022).
Elhage, N., Nanda, N., Olsson, C., Henighan, T. & Joseph, N. A mathematical framework for transformer circuits. Transformer Circuits Thread https://transformer-circuits.pub/2021/framework/index.html (2021).
Wang, K., Variengien, A., Conmy, A., Shlegeris, B. & Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. Preprint at https://arxiv.org/abs/2211.00593 (2022).
Bietti, A., Cabannes, V., Bouchacourt, D., Jegou, H. & Bottou, L. Birth of a transformer: a memory viewpoint. Adv. Neural Inf. Process. Syst. 36, 1560–1588 (2023).
Wortsman, M. et al. Small-scale proxies for large-scale transformer training instabilities. Preprint at https://arxiv.org/abs/2309.14322 (2023).
Warstadt, A., Mueller, A. & Choshen, L. Proc. BabyLM Challenge at the 27th Conference on Computational Natural Language Learning (CoNLL, 2023).
Mannelli, S. S., Ivashynka, Y., Saxe, A. & Saglietti, L. Tilting the odds at the lottery: the interplay of overparameterisation and curricula in neural networks. J. Stat. Mech. 2024, 114001 (2024).
Mayer, R. E. Rote versus meaningful learning. Theory Pract. 41, 226–232 (2002).
Singh, A. et al. The transient nature of emergent in-context learning in transformers. Adv. Neural Inf. Process. Syst. 36, 27801–27819 (2023).
Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Proc. European Conference on Computer Vision (eds Fleet, D. et al.) 740–755 (Springer, 2014).
de Leeuw, J. R. jsPsych: a JavaScript library for creating behavioral experiments in a Web browser. Behav. Res. Methods 47, 1–12 (2015).
Anwyl-Irvine, A. L., Armstrong, T. & Dalmaijer, E. S. MouseView.js: reliable and valid attention tracking in web-based experiments using a cursor-directed aperture. Behav. Res. Methods 54, 1663–1687 (2022).
R Core Team R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2024); https://www.R-project.org/
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48 (2015).
Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
Acknowledgements
We thank J. Drevet for enhancing the clarity and aesthetics of the figures. This work was supported by the Fondation Pour l’Audition RD-2021-2 (J.P.L.); the Institute for Language, Communication, and the Brain (J.P.L.); European Research Council Consolidator Grant No. 725937—CQR01290.CQ001 (C.S.); and an ATRAE award from the Spanish Ministry of Education to C.S. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization: J.P.L. and C.S. Data curation: J.P.L. Formal analysis: J.P.L. Funding acquisition: J.P.L. and C.S. Investigation: J.P.L. Methodology: J.P.L. and C.S. Project administration: C.S. Resources: C.S. Supervision: C.S. Visualization: J.P.L. Writing—original draft: J.P.L. and C.S. Writing—review and editing: J.P.L. and C.S.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Andrew Lampinen, Tom Verguts and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Example learning curves for multiple transformer networks.
Accuracy curves for multiple example transformer networks trained on different training distributions, uniform (α = 0, top row), moderately skewed (α = 1, middle row) and skewed (α = 2, bottom row). In-context test performance and arbitrage test performance (with respect to in-context learning) strongly overlap. Over the course of training, in-context test performance trade-off with in-weights test performance.
Extended Data Fig. 2 Feed-forward networks and LSTM networks do not become in-context learners in the same task.
a. (top) 2-layer feed-forward fully-connected network. (bottom) Scatter plots of the in-context vs in-weights test performances after training. b. (top) 2-layer LSTM network. (bottom) Scatter plots of the in-context vs in-weights test performances after training. Each dot is an individual network (N = 30 per training data distribution for each architecture).
Extended Data Fig. 3 Performance as a function of the image frequency during training.
a,b, Training and test performances for transformers (top, N = 30 per training data distribution) and human participants (bottom, Exp. 1, N = 30 per training data distribution) as a function of the frequency of the image during training. For each value of α, test items were grouped by how often they appeared during training. For example, in α = 2: ‘top 1’ corresponds to the image that was seen 92 times during training, ‘top 2–4’ to images that were seen ~13 times, and ‘top 5–10’ to images that were seen ~2 times. Large dots are group average. Errors are s.e.m.
Extended Data Fig. 4 Replication of Experiment 1.
a. Training and test performances of human participants (bottom, replication of Exp. 1, N = 30 per training data distribution). Small dots are individuals, large dots are group average. Our pre-registered effects (AsPredicted #231356, https://aspredicted.org/rqgz-rdfk.pdf) were all verified. In particular, there was a negative effect of α on accuracy in in-context test block (β = −1.145 ± 0.208, p = 0.0, BF > 100, ‘decisive’ evidence), a positive effect of α on accuracy in in-weights test block (β = 1.786 ± 0.138, p = 0.0, BF > 100, ‘decisive’ evidence), a negative effect of α on accuracy with respect to in-context learning in arbitrage blocks (β = −1.097 ± 0.168, p = 0.0, BF > 100, ‘decisive’ evidence), and a positive effect of α on accuracy with respect to in-weights learning in arbitrage blocks (β = 1.669 ± 0.128, p = 0.0, BF > 100, ‘decisive’ evidence). b. Training and test performances as a function of the frequency of the image during training. c. Scatter plots of the in-context vs in-weights test performances. Each dot is an individual model/human.
Extended Data Fig. 5 Performance of transformers trained on a wide range of composite distributions.
Scatter plots of the in-context vs in-weights test performances of transformers after training on different values of Pc (proportion of in-context trials during training) and αs (the rest of the trials). Dots are individual models.
Extended Data Fig. 6 Transformers do not benefit from structured curricula.
a. Test performances over the course of training of transformers trained on C1 (red) and C2 (blue). Bold lines are group average (N = 20 transformers per curriculum). Arrows were manually added to emphasise the direction of the trajectories. b. Accuracy curves for one example transformer network trained on C1. c. Accuracy curves for one example transformer network trained on C2.
Extended Data Fig. 7 Performance of human participants in all curricula.
a. Four groups of human participants (Exp. 3, N = 50 per group) were exposed to a composite distribution (Pc = 0.5, αs = 2) with different training curricula, that is different block order, denoted C1 to C4 (‘uniform’, α = 0; ‘skewed’, αs = 2). b. Performance during training per curriculum. c. Double learning index per curriculum. n.s. p > 0.05, * p < 0.05, ** p < 0.01, *** p< 0.001. d. Training and test performances for humans per curriculum. A curriculum that promotes learning first in-context and then in-weights improves the in-context performance without impairing in-weights learning. Small dots are individuals, large dots are group average. n.s. p > 0.05, * p < 0.05, ** p < 0.01, *** p< 0.001.
Extended Data Fig. 8 Cursor trajectories and performances of participants in all test blocks (Exp. 4, N = 20 per group).
a. Trajectories for participants trained on a uniform distribution (α = 0) in the (left) in-context test block, (middle) in-weights test block and (right) arbitrage block. b. Same for participants trained on a skewed distribution (α = 2). Trajectories were aligned trial-by-trial to a common frame where the target image is located on the top of the context circle. Small lines are individual average trajectories, diamonds are group average trajectories. c. Training and test performances. Small dots are individual, large dots are group average.
Extended Data Fig. 9 Modelling results in a transitive inference task.
a. We replicated our modelling results in a distinct task probing transitive inference. As in the image-label association task (Fig. 1), we manipulated the distribution of the training data: under a uniform distribution (α = 0), all environments are equally likely; under a skewed distribution (α >> 0), some environments are more frequent. Each environment consisted of six images ordered along an underlying dimension. b. Example training trial. The context presented ten triplets, each comprising two images and a symbol, corresponding to all one-step comparisons within a given environment (for example, ‘image 4 > image 3’). The query consisted of a two-step comparison (for example, ‘image 4 ? image 2’), and the model had to select the correct relational symbol (‘>’ or ‘<’). c. Paradigm overview. During training, two learning strategies are available. The ‘in-context’ learning strategy consists in using local comparison given in the context to infer the correct relational symbol via transitive inference (for example, relying on ‘image 4 > image 3’ and ‘image 3 > image 2’ to infer ‘image 4 > image 2’). The ‘in-weighs’ learning strategy consists in learning the association between pairs of images and relational symbols in memory using the feedback. Test blocks were designed to probe which strategy(ies) the model is using. On in-context test blocks, images from novel environments (depicted in grey) were presented, such that the only way to be accurate is to use information from the context, a.k.a. the in-context strategy. On in-weights test blocks, a training pair (depicted in blue) was presented as the query pair but images from novel environments (depicted in grey) were presented in the context, such that the only to be accurate is to use information stored in memory, a.k.a. the in-weights strategy. On ‘arbitrage’ test blocks, a trained environment was presented but the order of the images was reversed (for example, ‘image 4 < image 3’). d. Training and test performances for transformers (N = 30 per training data distribution). Small dots are individual transformers, large dots are group average.
Extended Data Fig. 10 Performance of transformers with varying architecture sizes.
Scatter plots of the in-context vs in-weights test performances for transformers with varying numbers of layers, number of heads per layers, and varying training distributions. Each dot represents a model trained with a specific number of layers, attention heads, and training data distribution. Dot color indicates the α exponent of the training distribution. Dotted lines indicate chance-level performance.
Extended Data Fig. 11 Performance of transformers with interleaved MLP with varying architecture sizes.
The MLP blocks consist of two dense layers with a ReLU activation, followed by a residual connection and layer normalization. Scatter plots of the in-context vs in-weights test performances for transformers with varying numbers of layers, number of heads per layers, and varying training distributions. Each dot represents a model trained with a specific number of layers, attention heads, and training data distribution. Dot color indicates the α exponent of the training distribution. Dotted lines indicate chance-level performance.
Extended Data Fig. 12 Similarity score with respect to idealised attention patterns.
(left) Similarity score between observed attention patterns (N = 10 transformers per training distribution) and idealised attention patterns performing in-context learning. (right) Same with idealised attention patterns performing in-weights learning. The similarity score was a dot product normalised by the ℓ1-norm of the idealised head. Models trained on α < 1 were similar to in-context learning heads while models trained on α > 1 were similar to in-weights learning. Results were less clear for in-weights learning head #1 because these heads tended to have more diverse patterns (attention spread to all tokens, or restricted to some tokens, and most of the time restricted to the last token).
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pesnot Lerousseau, J., Summerfield, C. Shared sensitivity to data distribution during learning in humans and transformer networks. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02359-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41562-025-02359-3







