Extended Data Fig. 11: Performance of transformers with interleaved MLP with varying architecture sizes.
From: Shared sensitivity to data distribution during learning in humans and transformer networks

The MLP blocks consist of two dense layers with a ReLU activation, followed by a residual connection and layer normalization. Scatter plots of the in-context vs in-weights test performances for transformers with varying numbers of layers, number of heads per layers, and varying training distributions. Each dot represents a model trained with a specific number of layers, attention heads, and training data distribution. Dot color indicates the α exponent of the training distribution. Dotted lines indicate chance-level performance.