Table 2 Experimental details with VGG-16 as the teacher network.

From: Counterclockwise block-by-block knowledge distillation for neural network compression

Training config

VGG-16 in Tiny-imagenet-200

VGG-16 in CIFAR-10

Base learning rate

1e-3

2e-3

Weight decay

0.05

0.05

Batch size

48

100

Training epoches (CBKD)

150,200,200,200

3,8,20,20

Learning rate schedule

Cosine decay

Cosine decay

Thaw training epoches

300

20

Warmup epoches

Max((training epoches)*0.05,1)

Max((training epoches)*0.05,1)

Training epoches(Teacher)

180

30

Training epoches(Student)

300

30

Training epoches(KD)

300

30

Training epoches(FitNets)

100,300

10,20

Training epoches(RKD)

300

30

Training epoches(DKD)

300

30

Training epoches(L-S-KD)

300

30