Fig. 1: Paradigm.
From: Shared sensitivity to data distribution during learning in humans and transformer networks

a, We studied learning in an image–label association task by manipulating the distribution of the training data. Under a uniform distribution (α = 0), all images are equally likely to appear. In skewed distributions, some images are more likely than others (α > 0). b, Example training trial. In a given trial, agents were asked to select the label corresponding to the query image, presented at the centre of the screen. Seven images and seven labels were also presented in a surrounding circle (the context). During training, a copy of the query image (the target image) was always present in the context. The correct label was always located three steps clockwise relative to the target image (the target label). c, Paradigm overview. During training, two learning strategies are available. The in-context learning strategy consists in using the context to infer the correct label—that is, using the ‘+3 steps’ rule. The in-weights learning strategy consists in learning each image–label association in memory using the feedback. Test blocks were designed to probe which strategy (or strategies) the agent is using. On in-context test blocks, novel images (depicted in grey) were presented, such that the only way to be accurate was to use information from the context—that is, the in-context strategy. On in-weights test blocks, a training image (depicted in blue) was presented as the query image, but novel images (depicted in grey) were presented in the context, such that the only way to be accurate was to use information stored in memory—that is, the in-weights strategy. On arbitrage test blocks, a training image was presented as the query image, and the context indicated a different label than the one that was presented during training. This was done to reveal the dominant strategy used by the agent when presented with conflicting evidence for the two strategies. d, A minimal transformer model, composed of two attention-only layers of one attention head each, was trained on the task. e, Accuracy curves for two example transformers trained on two different training distributions, uniform (α = 0, left) and skewed (α = 4, right). When α < 1, transformers learn in-context but not in-weights. Conversely, when α > 1, transformers learn in-weights but not in-context.