Fig. 4: Humans, but not transformers, benefit from a training curriculum promoting in-context learning first (Experiment 3).
From: Shared sensitivity to data distribution during learning in humans and transformer networks

a, Training curricula based on the composite distribution (Pc = 0.5, αs > 0). The first curriculum (C1) involved maximally diverse exemplars in the first half of the training (the uniform part of the composite distribution, α = 0) and then more redundant exemplars in the second half of the training (the skewed part of the composite distribution, αs > 0). The second curriculum (C2) reversed this ordering. b, Two groups of human participants (Exp. 3, n = 50 per group) were exposed to two training curricula, C1 and C2 (composite distribution, Pc = 0.5, αs = 2). Human participants trained on C1 showed better performance on in-context trials than participants trained on C2 (logistic regression with the group as a fixed effect, β = −2.635 ± 0.784, P = 0.019, BF = 5.3, ‘substantial’ evidence, P value Bonferroni corrected). They also responded more using the in-context strategy in arbitrage trials (β = −2.676 ± 0.711, P = 0.004, BF = 12.1, ‘strong’ evidence, P value Bonferroni corrected). However, both groups had similar performance on in-weights trials (β = 0.096 ± 0.428, P = 1.0, BF = 0.019, ‘strong’ evidence, P value Bonferroni corrected). The small dots indicate data from individuals; the large dots indicate group averages. NS, P > 0.05; *P < 0.05; **P < 0.01. NS, not significant; prer., preregistered contrasts. c, Double learning index of human participants (linear regression with the group as a fixed effect, β = −0.162 ± 0.076, P = 0.036, BF = 0.969). *P < 0.05. The small dots indicate data from individuals; the large dots indicate group averages. d, Double learning index for transformers trained on the C1 curriculum (left) and the C2 curriculum (right). Transformers were trained with different values of αs for the skewed part of the composite distribution. e, Test performances over the course of training of transformers trained on C1 (red) and C2 (blue). The bold lines indicate group averages (n = 20 transformers per curriculum). The arrows were manually added to emphasize the direction of the trajectories.