Extended Data Fig. 1: Example learning curves for multiple transformer networks.
From: Shared sensitivity to data distribution during learning in humans and transformer networks

Accuracy curves for multiple example transformer networks trained on different training distributions, uniform (α = 0, top row), moderately skewed (α = 1, middle row) and skewed (α = 2, bottom row). In-context test performance and arbitrage test performance (with respect to in-context learning) strongly overlap. Over the course of training, in-context test performance trade-off with in-weights test performance.