Table 3 Performance comparison with VisDrones state-of-the-art computer vision backbone on the VisDrone 2019 validation set.

From: End to end polysemantic cooperative mixed task trainer for UAV target detection

Backbone

res.

param#(M)

Flops

Accuracy(ALL-top-1)

Accuracy(ALL-top-5)

ResNet-50

224

24.1M

4.0G

31.39

48.23

LR-net-50

224

22.2M

4.2G

31.39

48.23

Stand-alone\(\uparrow\)

224

16.9M

3.5G

31.42

–

AA-ResNet-50

224

24.7M

4.1G

31.43

48.25

Botnet-s1-50

224

19.7M

4.2G

31.43

–

VIT-b/16

382

–

–

31.45

–

San19

224

19.4M

3.2G

31.48

48.53

Lambda-ResNet-50\(\uparrow\)

224

13.9M

–

31.5

–

PoT-50

224

21.1M

3.2G

33.83

49.13

PoT-50\(\uparrow\)

224

21.1M

3.2G

33.89

49.17

SE-PoT-50

224

22.0M

4.0G

33.89

49.15

SE-PoT-50\(\uparrow\)

224

22.0M

4.0G

33.96

49.83

ResNet-101

224

43.5M

7.8G

33.13

48.83

LR-net-101

224

40.9M

7.9G

33.13

48.84

AA-ResNet-101

224

44.3M

8.0G

33.15

48.85

PoT-101

224

37.2M

6.0G

34.63

49.53

PoT-101\(\uparrow\)

224

37.2M

6.0G

34.72

49.57

SE-PoT-101

224

39.8M

8.4G

34.68

49.55

SE-PoT-101\(\uparrow\)

224

39.8M

8.4G

36.03

50.23

  1. Keeping the same network depth structure (50/101) for group comparisons. \(\uparrow\) indicates that an exponential moving average was used in training, while note that the accuracy here is a comparison of the combined data values (ALL) for the 10 categories on the dataset.