Extended Data Table 6 Hyperparameters used for training FP-32 dense and MoE models

From: Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

Parameter

MoE FP-32

MoE HW-aware

Dense FP-32

Dense HW-aware

LR

0.00025

0.00025

0.00025

0.00025

Weight decay

0.0

0.0

0.0

0.0

Seed

0

0

0

0

FP-32

True

False

True

False

Batch size

64

56

64

56

Max. gradient norm

0.25

0.25

0.25

0.25

Dropout

0.1

0.1

0.1

0.1

dmodel

410

410

410

410

dffn

2’053

2’053

2’053

2’053

Num. decoder layers

16

16

16

16

Num. attention heads

10

10

10

10

Context length

512

512

512

512

Warmup steps

0

0

0

0

Num. experts

16

16

Expert size

128

128

top-k

4

4

MoE dropout

0.0

0.0

Router activation

Sigmoid

Sigmoid

Activation after top-k

False

False

MoE bias

False

False

Expert dropout

0.0

0.0

Weight std scale

1.0

1.0

Routing regularization

0.001

0.001

Num. MoE layers

16

16

  1. Hyperparameters for the training procedure of the models studied in Fig. 6.