Table 1 Downstream task results

From: Analog in-memory computing attention mechanism for fast and energy-efficient large language models

 

ARC-E

ARC-C

WinoGrande

HellaSwag

LAMBADA

LAMBADA

PIQA

WikiText-2

Average

Average

 

acc

acc

acc

acc

ppl

acc

acc

ppl

acc

ppl

Public software model

43.81

22.70

51.62

31.14

35.15

45.96

62.89

37.37

43.02

36.26

Software model trained from scratch

42.34

23.46

50.20

29.73

46.39

41.56

61.48

41.25

41.46

43.82

Linear hardware model

42.80

23.46

52.41

30.31

51.69

38.10

61.21

39.79

41.38

45.74

Nonlinear hardware model with adaptation

42.09

22.87

50.51

30.10

76.59

31.61

61.53

42.34

39.79

59.47

Nonlinear hardware model with adaptation and fine-tuning

43.94

22.78

51.14

30.18

43.08

40.16

62.62

39.97

41.80

41.52

Public software model-XL

58.29 (+14.48)

28.50 (+5.80)

58.33 (+6.71)

50.89 (+19.75)

9.68 (−25.47)

63.87 (+17.91)

70.84 (+7.95)

20.38 (−16.99)

55.12 (+12.10)

15.03 (−21.23)

Software model trained from scratch-XL

53.82 (+11.48)

25.76 (+2.30)

53.75 (+3.55)

42.54 (+12.81)

14.82 (−31.57)

56.33 (+14.77)

68.71 (+7.23)

24.98 (−16.27)

50.15 (+8.69)

19.90 (−23.92)

Linear hardware model-XL

54.08 (+11.28)

27.47 (+4.01)

57.93 (+5.52)

45.51 (+15.20)

12.32 (−39.37)

58.54 (+20.44)

68.01 (+6.80)

23.26 (−16.53)

51.92 (+10.54)

17.79 (−27.95)

Nonlinear hardware model-XL

53.79 (+9.85)

27.30 (+4.52)

54.70 (+3.56)

46.70 (+16.52)

12.17 (−30.91)

59.48 (+19.32)

68.17 (+5.55)

22.29 (−17.68)

51.69 (+9.89)

17.23 (−24.29)

  1. The metrics are the percentage of accurate word predictions (acc), and the perplexity (ppl), a measure of prediction uncertainty. The last two columns average the accuracy results and the perplexity results for each model, respectively. Values in parentheses (±x) indicate the improvement of XL models relative to their smaller counterparts (GPT-2-XL results − GPT-2 results). Rows in bold correspond to our results.