Table 3 Comparison with the state-of-the-art methods on AVA10 and MM-TBA’s teaching action detection sub-dataset.
From: A Multi-Modal Dataset for Teacher Behavior Analysis in Offline Classrooms
Model | Frame sampling strategy | Backbone | Pretrain | mAP(A) | mAP(T) |
|---|---|---|---|---|---|
Slowfast41 | 4 × 16 × 1 | ResNet50 | Kinetics-40011 | 24.32 | 29.45 |
Slowfast | 4 × 16 × 1 | ResNet50 (with context) | Kinetics-400 | 25.34 | 29.95 |
Slowfast | 8 × 8 × 1 | ResNet50 | Kinetics-400 | 25.8 | 28.84 |
Slowfast | 8 × 8 × 1 | ResNet50 (temporal-max) | Kinetics-400 | 26.41 | 29.27 |
VideoMAE40 | 16 × 4 × 1 | ViT Base | Kinetics-400 | 33.6 | 27.52 |
Slowonly41 | 4 × 16 × 1 | ResNet50 | Kinetics-400 | 20.72 | 28.96 |
Slowonly | 4 × 16 × 1 | ResNet50 (NonLocalEmbedGauss) | Kinetics-400 | 21.55 | 28.77 |
Slowonly | 8 × 8 × 1 | ResNet50 (NonLocalEmbedGauss) | Kinetics-400 | 23.77 | 29.88 |
Slowonly | 8 × 8 × 1 | ResNet101 | Kinetics-400 | 24.83 | 29.7 |
Slowonly | 4 × 16 × 1 | ResNet50 | Kinetics-700 | 25.87 | 29.6 |
Slowonly | 8 × 8 × 1 | ResNet50 (+context) | Kinetics-700 | 28.31 | 29.84 |
Slowonly | 8 × 8 × 1 | ResNet50 (+temporal max pooling) | Kinetics-700 | 28.48 | 30.04 |
Slowonly | 8 × 8 × 1 | ResNet50 (+focal loss) | Kinetics-700 | 30.33 | 30.2 |
ACRN | 8 × 8 × 1 | ResNet50 | Kinetics-400 | 27.65 | 30.51 |