Table 2 Experimental details with VGG-16 as the teacher network.
From: Counterclockwise block-by-block knowledge distillation for neural network compression
Training config | VGG-16 in Tiny-imagenet-200 | VGG-16 in CIFAR-10 |
|---|---|---|
Base learning rate | 1e-3 | 2e-3 |
Weight decay | 0.05 | 0.05 |
Batch size | 48 | 100 |
Training epoches (CBKD) | 150,200,200,200 | 3,8,20,20 |
Learning rate schedule | Cosine decay | Cosine decay |
Thaw training epoches | 300 | 20 |
Warmup epoches | Max((training epoches)*0.05,1) | Max((training epoches)*0.05,1) |
Training epoches(Teacher) | 180 | 30 |
Training epoches(Student) | 300 | 30 |
Training epoches(KD) | 300 | 30 |
Training epoches(FitNets) | 100,300 | 10,20 |
Training epoches(RKD) | 300 | 30 |
Training epoches(DKD) | 300 | 30 |
Training epoches(L-S-KD) | 300 | 30 |