Table 1 Downstream task results

	ARC-E	ARC-C	WinoGrande	HellaSwag	LAMBADA	LAMBADA	PIQA	WikiText-2	Average	Average
	acc ↑	acc ↑	acc ↑	acc ↑	ppl ↓	acc ↑	acc ↑	ppl ↓	acc ↑	ppl ↓
Public software model	43.81	22.70	51.62	31.14	35.15	45.96	62.89	37.37	43.02	36.26
Software model trained from scratch	42.34	23.46	50.20	29.73	46.39	41.56	61.48	41.25	41.46	43.82
Linear hardware model	42.80	23.46	52.41	30.31	51.69	38.10	61.21	39.79	41.38	45.74
Nonlinear hardware model with adaptation	42.09	22.87	50.51	30.10	76.59	31.61	61.53	42.34	39.79	59.47
Nonlinear hardware model with adaptation and fine-tuning	43.94	22.78	51.14	30.18	43.08	40.16	62.62	39.97	41.80	41.52
Public software model-XL	58.29 (+14.48)	28.50 (+5.80)	58.33 (+6.71)	50.89 (+19.75)	9.68 (−25.47)	63.87 (+17.91)	70.84 (+7.95)	20.38 (−16.99)	55.12 (+12.10)	15.03 (−21.23)
Software model trained from scratch-XL	53.82 (+11.48)	25.76 (+2.30)	53.75 (+3.55)	42.54 (+12.81)	14.82 (−31.57)	56.33 (+14.77)	68.71 (+7.23)	24.98 (−16.27)	50.15 (+8.69)	19.90 (−23.92)
Linear hardware model-XL	54.08 (+11.28)	27.47 (+4.01)	57.93 (+5.52)	45.51 (+15.20)	12.32 (−39.37)	58.54 (+20.44)	68.01 (+6.80)	23.26 (−16.53)	51.92 (+10.54)	17.79 (−27.95)
Nonlinear hardware model-XL	53.79 (+9.85)	27.30 (+4.52)	54.70 (+3.56)	46.70 (+16.52)	12.17 (−30.91)	59.48 (+19.32)	68.17 (+5.55)	22.29 (−17.68)	51.69 (+9.89)	17.23 (−24.29)

The metrics are the percentage of accurate word predictions (acc), and the perplexity (ppl), a measure of prediction uncertainty. The last two columns average the accuracy results and the perplexity results for each model, respectively. Values in parentheses (±x) indicate the improvement of XL models relative to their smaller counterparts (GPT-2-XL results − GPT-2 results). Rows in bold correspond to our results.

Quick links

Search