Extended Data Table 1 Hyperparameters underlying the results presented in Fig. 1

From: Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

Parameter

Dense

MoE

Num. decoder layers

48

48

Tiles

{98,194,581,773}

{98,194,581,773}

Tiers

256

256

Tier shape

(512,512)

(512,512)

Num. DPUs

5

5

Num. MHA Units

4

4

Num. sequences (batch size)

4

4

Context length

64

64

Num. generated tokens per input sequence

8

8

Dense Model-Specific

  

dmodel

{256,512,768,1’024}

nhead

8

Vocab. size

50’000

Share embedding

False

MoE Model-Specific

  

dmodel

{256,512,768,1’024}

nhead

8

Vocab. size

50’000

MoE layer-frequency

1

Num. experts

{4,8,16,64,128}

top-k

{1,2,3,4}

Share embedding

False

Additional Settings

  

Map strategy

Greedy (in-order)

Greedy (in-order)

Split FFN

True

True

Stack embedding

False

False

  1. Hyperparameters used for inference time comparison between dense and MoE models.