Table 3 Comparison with the state-of-the-art methods on AVA10 and MM-TBA’s teaching action detection sub-dataset.

From: A Multi-Modal Dataset for Teacher Behavior Analysis in Offline Classrooms

Model

Frame sampling strategy

Backbone

Pretrain

mAP(A)

mAP(T)

Slowfast41

4 × 16 × 1

ResNet50

Kinetics-40011

24.32

29.45

Slowfast

4 × 16 × 1

ResNet50 (with context)

Kinetics-400

25.34

29.95

Slowfast

8 × 8 × 1

ResNet50

Kinetics-400

25.8

28.84

Slowfast

8 × 8 × 1

ResNet50 (temporal-max)

Kinetics-400

26.41

29.27

VideoMAE40

16 × 4 × 1

ViT Base

Kinetics-400

33.6

27.52

Slowonly41

4 × 16 × 1

ResNet50

Kinetics-400

20.72

28.96

Slowonly

4 × 16 × 1

ResNet50 (NonLocalEmbedGauss)

Kinetics-400

21.55

28.77

Slowonly

8 × 8 × 1

ResNet50 (NonLocalEmbedGauss)

Kinetics-400

23.77

29.88

Slowonly

8 × 8 × 1

ResNet101

Kinetics-400

24.83

29.7

Slowonly

4 × 16 × 1

ResNet50

Kinetics-700

25.87

29.6

Slowonly

8 × 8 × 1

ResNet50 (+context)

Kinetics-700

28.31

29.84

Slowonly

8 × 8 × 1

ResNet50 (+temporal max pooling)

Kinetics-700

28.48

30.04

Slowonly

8 × 8 × 1

ResNet50 (+focal loss)

Kinetics-700

30.33

30.2

ACRN

8 × 8 × 1

ResNet50

Kinetics-400

27.65

30.51