Extended Data Table 1 Hyperparameters underlying the results presented in Fig. 1
From: Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing
Parameter | Dense | MoE |
|---|---|---|
Num. decoder layers | 48 | 48 |
Tiles | {98,194,581,773} | {98,194,581,773} |
Tiers | 256 | 256 |
Tier shape | (512,512) | (512,512) |
Num. DPUs | 5 | 5 |
Num. MHA Units | 4 | 4 |
Num. sequences (batch size) | 4 | 4 |
Context length | 64 | 64 |
Num. generated tokens per input sequence | 8 | 8 |
Dense Model-Specific | ||
dmodel | {256,512,768,1’024} | – |
nhead | 8 | – |
Vocab. size | 50’000 | – |
Share embedding | False | – |
MoE Model-Specific | ||
dmodel | – | {256,512,768,1’024} |
nhead | – | 8 |
Vocab. size | – | 50’000 |
MoE layer-frequency | – | 1 |
Num. experts | – | {4,8,16,64,128} |
top-k | – | {1,2,3,4} |
Share embedding | – | False |
Additional Settings | ||
Map strategy | Greedy (in-order) | Greedy (in-order) |
Split FFN | True | True |
Stack embedding | False | False |