Extended Data Fig. 12: Similarity score with respect to idealised attention patterns.
From: Shared sensitivity to data distribution during learning in humans and transformer networks

(left) Similarity score between observed attention patterns (N = 10 transformers per training distribution) and idealised attention patterns performing in-context learning. (right) Same with idealised attention patterns performing in-weights learning. The similarity score was a dot product normalised by the ℓ1-norm of the idealised head. Models trained on α < 1 were similar to in-context learning heads while models trained on α > 1 were similar to in-weights learning. Results were less clear for in-weights learning head #1 because these heads tended to have more diverse patterns (attention spread to all tokens, or restricted to some tokens, and most of the time restricted to the last token).