Table 2 Average value for the average Log Norm and Weighted Alpha metrics for pretrained OpenAI GPT and GPT2 models.

Series	#	\(\langle {\mathrm{log}}\,\parallel {\bf{W}}{\parallel }_{F}\rangle \)	\(\langle {\mathrm{log}}\,\parallel {\bf{W}}{\parallel }_{\infty }\rangle \)	\(\hat{\alpha }\)	\(\langle {\mathrm{log}}\,\parallel {\bf{X}}{\parallel }_{\alpha }^{\alpha }\rangle \)
GPT	49	1.64	1.72	7.01	7.28
GPT2-small	49	2.04	2.54	9.62	9.87
GPT2-medium	98	2.08	2.58	9.74	10.01
GPT2-large	146	1.85	1.99	7.67	7.94
GPT2-xl	194	1.86	1.92	7.17	7.51

Column # refers to number of layers treated. Averages do not include the first embedding layer(s) because they are not (implicitly) normalized. GPT has 12 layers, with 4 Multi-head Attention Blocks, giving 48 layer Weight Matrices, W. Each Block has 2 components, the Self Attention (attn) and the Projection (proj) matrices. Self-attention matrices are larger, of dimension (2304 × 768) or (3072 × 768). The projection layer concatenates the self-attention results into a vector (of dimension 768). This gives 50 large matrices. Because GPT and GPT2 are trained on different data sets, the initial Embedding matrices differ in shape. GPT has an initial Token and Positional Embedding layers, of dimension (40478 × 768) and (512 × 768), respectively, whereas GPT2 has input Embeddings of shape (50257 × 768) and (1024 × 768), respectively. The OpenAI GPT2 (English) models are: GPT2-small, GPT2-medium, GPT2-large, and GPT2-xl, having 12, 24, 36, and 48 layers, respectively, with increasingly larger weight matrices.

Quick links

Search