Table 3 Comparison with the state-of-the-art methods on AVA¹⁰ and MM-TBA’s teaching action detection sub-dataset.

Model	Frame sampling strategy	Backbone	Pretrain	mAP(A)	mAP(T)
Slowfast⁴¹	4 × 16 × 1	ResNet50	Kinetics-400¹¹	24.32	29.45
Slowfast	4 × 16 × 1	ResNet50 (with context)	Kinetics-400	25.34	29.95
Slowfast	8 × 8 × 1	ResNet50	Kinetics-400	25.8	28.84
Slowfast	8 × 8 × 1	ResNet50 (temporal-max)	Kinetics-400	26.41	29.27
VideoMAE⁴⁰	16 × 4 × 1	ViT Base	Kinetics-400	33.6	27.52
Slowonly⁴¹	4 × 16 × 1	ResNet50	Kinetics-400	20.72	28.96
Slowonly	4 × 16 × 1	ResNet50 (NonLocalEmbedGauss)	Kinetics-400	21.55	28.77
Slowonly	8 × 8 × 1	ResNet50 (NonLocalEmbedGauss)	Kinetics-400	23.77	29.88
Slowonly	8 × 8 × 1	ResNet101	Kinetics-400	24.83	29.7
Slowonly	4 × 16 × 1	ResNet50	Kinetics-700	25.87	29.6
Slowonly	8 × 8 × 1	ResNet50 (+context)	Kinetics-700	28.31	29.84
Slowonly	8 × 8 × 1	ResNet50 (+temporal max pooling)	Kinetics-700	28.48	30.04
Slowonly	8 × 8 × 1	ResNet50 (+focal loss)	Kinetics-700	30.33	30.2
ACRN	8 × 8 × 1	ResNet50	Kinetics-400	27.65	30.51

Quick links

Search