Extended Data Fig. 10: Performance of transformers with varying architecture sizes. | Nature Human Behaviour

Extended Data Fig. 10: Performance of transformers with varying architecture sizes.

From: Shared sensitivity to data distribution during learning in humans and transformer networks

Extended Data Fig. 10

Scatter plots of the in-context vs in-weights test performances for transformers with varying numbers of layers, number of heads per layers, and varying training distributions. Each dot represents a model trained with a specific number of layers, attention heads, and training data distribution. Dot color indicates the α exponent of the training distribution. Dotted lines indicate chance-level performance.

Back to article page