Fig. 8: Histogram of PL exponents and Log Alpha Norm for weight matrices from models of different sizes in the GPT2 architecture series.

(Plots omit the first 2 (embedding) layers, because they are normalized differently giving anomalously large values).