Extended Data Table 6 Hyperparameters used for training FP-32 dense and MoE models
From: Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing
Parameter | MoE FP-32 | MoE HW-aware | Dense FP-32 | Dense HW-aware |
|---|---|---|---|---|
LR | 0.00025 | 0.00025 | 0.00025 | 0.00025 |
Weight decay | 0.0 | 0.0 | 0.0 | 0.0 |
Seed | 0 | 0 | 0 | 0 |
FP-32 | True | False | True | False |
Batch size | 64 | 56 | 64 | 56 |
Max. gradient norm | 0.25 | 0.25 | 0.25 | 0.25 |
Dropout | 0.1 | 0.1 | 0.1 | 0.1 |
dmodel | 410 | 410 | 410 | 410 |
dffn | 2’053 | 2’053 | 2’053 | 2’053 |
Num. decoder layers | 16 | 16 | 16 | 16 |
Num. attention heads | 10 | 10 | 10 | 10 |
Context length | 512 | 512 | 512 | 512 |
Warmup steps | 0 | 0 | 0 | 0 |
Num. experts | 16 | 16 | – | – |
Expert size | 128 | 128 | – | – |
top-k | 4 | 4 | – | – |
MoE dropout | 0.0 | 0.0 | – | – |
Router activation | Sigmoid | Sigmoid | – | – |
Activation after top-k | False | False | – | – |
MoE bias | False | False | – | – |
Expert dropout | 0.0 | 0.0 | – | – |
Weight std scale | 1.0 | 1.0 | – | – |
Routing regularization | 0.001 | 0.001 | – | – |
Num. MoE layers | 16 | 16 | – | – |