Fig. 1

Overall architecture of audio-visual detector (AVDor). (a) the feature extraction part, including a vision backbone (ResNet15) combined with FPN17, and an audio backbone (VGGish16); (b) the feature fusion module, TPAVI+ ; (c) the detection head, which outputs the object location and class; (d) the post processing part, which uses Soft-NMS to filter the detection results.