Introduction

Deep learning technologies have facilitated the rapid development of a wide range of industries. With the increasing development of computer vision technology, more and more researchers are focusing their efforts on the application of specific technologies in campus settings. In typical classroom scenarios, the most prominent feature is crowding. This is especially evident in large classrooms, where hundreds of students may be attending class simultaneously. For faculty, managing a large number of students is more difficult to handle. For example, it can be difficult to find students who are talking in secret or making noise in the classroom, or to notice when a student is answering a question in his or her seat without the teacher noticing.

Audio and visual information are related. On the temporal level, they often occur simultaneously, and on the spatial level, they both come from the same location in the video frame. Therefore, they are similar in feature space. The goal of audio-visual detection (AVD) is to identify the location and category of specific events in classroom videos by using audio and visual information. In terms of practical application requirements and task logistics, what AVD needs to achieve is to find the sound-emitting students or other objects in the classroom, which represents the event. Therefore, from another point of view, AVD can be seen as a multimodal object detection task that utilizes both audio and visual information.

Many state-of-the-art detectors can be applied to various scenarios that require accurate object detection. There have been many applied studies based on object detection in intelligent education systems, such as student counting1 and student behavior detection2. One of the difficulties with object detection in the classroom is that the image features are not distinct. For example, when a student is answering or asking a question from his or her seat, it is difficult to capture robust visual information from the image alone, and in this case, it is beneficial to include sound features.

To efficiently combine object detection with sound, we propose an audio-visual detector (AVDor) based on multimodal learning. In this study, the main contributions are summarized as follows: (1) For intelligent education applications, we introduce the audio-visual detection (AVD) task which combines audio and visual information for detecting events of interest in classroom scenes. (2) To achieve AVD, we propose a novel multimodal-based AVDor that receives audio and visual information as input and outputs the object location and class. (3) For evaluation, we construct a benchmark for AVD, which provides object-level annotations based on the sound sources in the videos.

Related work

In the field of audio-visual multimodal research, there are two broad categories based on the coarseness of the localization results. Research such as audio-visual correspondence3 (AVC), audio-visual event localization4 (AVEL), and audio-visual video parsing5 (AVVP), which segment videos into events based on audio and visual information, can be considered coarse-grained. Considering the applications in the classroom, these methods match audio and video clips but do not meet the needs of teachers. Other fine-grained methods, such as sound source localization6 (SSL) and audio-visual segmentation7 (AVS), can locate the region of the video frame where the sound is made or the pixels of a specific instance in the video. SSL methods locate only the location of the sound source, not specifically an instance itself, which is still not enough to solve the problem.

For the majority of cases, finding a sounding object in a classroom scene, as in the case of an object detection task, is sufficient. Therefore, AVS is unnecessary at such a fine-grained level, as it segments objects down to their shapes. Also, the instances segmented by AVS do not contain their categories, thus their output does not reflect the type of the event.

As a reference task for the realization of AVD, a superior object detector also plays an important role. As a fundamental task in computer vision, object detection has been well developed. There are two main types of object detectors: (1) One-stage detectors, such as the YOLO (You Only Look Once) series detectors8 and SSD (Single-Shot Detector)9; (2) Two-stage detectors, such as Faster R-CNN10 and Cascade R-CNN11, etc. These mainstream models have demonstrated satisfactory performance on natural image data.

Object detection is also widely used in smart education, especially in classroom scenarios. Liu, et al.1 proposed a student counting system based on object detection, which focuses on solving the problem of small object detection. Rao and Chen12 developed a YOLOv5-based method for real-time student counting in classrooms, improving detection accuracy for small objects like students’ heads through optimized anchor boxes. This method achieved 91.59% accuracy on the CUHK Occlusion dataset. Lv et al.2 proposed a student behavior detection system based on SSD, which uses an improved SSD9 to recognize student behavior. Gan et al.13 explored IoT and multimodal analysis integration in smart education, combining audio, video, and sensor data. Their method demonstrated potential in enhancing learning analytics through comprehensive data fusion. Tian et al.14 proposed a Transformer-based framework for audio-visual event localization in videos. Though specific success rates were not provided, their method demonstrated a 61.56% mean attack success rate. These studies highlight the potential of multimodal approaches in enhancing the effectiveness of smart education systems.

Audio-visual detector

In this section, we provide a comprehensive overview and intricate architecture of the AVDor, which is designed to address the challenging task of combining audio and visual information for robust detection of objects or events that occur in Japanese-language teaching scenarios.

Figure 1 shows the overall architecture of the proposed AVDor. From the perspective of object detection combined with audio information, the audio can be considered as an enhancement to vision. Thus, in our study, the AVDor is designed based on a typical one-stage object detector paradigm, which is composed of (1) a feature extraction part, (2) a feature fusion module, and (3) a detection head, followed by post-processing.

Fig. 1
figure 1

Overall architecture of audio-visual detector (AVDor). (a) the feature extraction part, including a vision backbone (ResNet15) combined with FPN17, and an audio backbone (VGGish16); (b) the feature fusion module, TPAVI+ ; (c) the detection head, which outputs the object location and class; (d) the post processing part, which uses Soft-NMS to filter the detection results.

Feature extraction

We use a common visual backbone network (i.e. ResNet15) to extract visual features from images, and an audio backbone (i.e. VGGish16) to extract audio information from sounds. The ResNet is stage-wise designed, and the output of each stage is a feature map (for T images) with different resolutions (H × W) and channels (Ci). As with other common detectors, the features from the backbone are then fused by a feature pyramid network17 (FPN) for better detecting objects of different scales.

For audio feature extraction, we use VGGish16, which employs a deep convolutional neural network that processes input audio spectrograms to extract high-level representations for various audio analysis tasks.

Feature Fusion

For common 2D object detection, RGB images are fed to the detector, which processes 3-channel data. To combine audio information, which is single-channel data, the AVDor needs to include a feature fusion module. As shown in the Feature Fusion part in Fig. 1, the image feature and the audio feature are sent to the TPAVI + module together, which is an upgraded version of the TPAVI7 module that encodes audio-visual relations through a non-local18 block. The detailed structure of the TPAVI + module is illustrated in Fig. 2. Compared with TPAVI, the TPAVI + module is inserted with 2 enhancers. Figure 3 illustrates the structure of the enhancer.

Fig. 2
figure 2

Illustration of the Non-Local, TPAVI, and TPAVI + structure. Note that we use the same color to represent the same type of feature map. The reshape operation is omitted for simplicity.

Fig. 3
figure 3

Structure of the enhancer.

The features that go through the enhancer can be expressed by Eq. 1.

$$F^{\prime } = F_{eh} \times F$$
(1)

in which the F is the input feature of size Thiwi × Thiwi or T × hi × wi × C. Feh is the enhanced feature, and F′ is the output feature of the enhancer. Feh is calculated by Eq. 2.

$$F_{ch} = Sigmoid({\text{Re}} LU(Conv(MP(F))))$$
(2)

where MP denotes the max pooling layer. Conv denotes a 1 × 1 convolution. ReLU and Sigmoid are activation functions. The enhancer controls the relative importance of the audio-visual features, affecting their contribution to the final prediction. The ReLU-Sigmoid function ensures that the feature importance is scaled between 0.5 and 1, preventing drastic feature disappearance during training.

The shape of the image feature from each stage varies, while the dimension of the audio feature is fixed (T × 128). Therefore, the audio feature A is transformed into the same channel (dimension) as the image feature Vi by a linear layer (T × 128 to T × C). Then A is reshaped to the same shape as Vi by duplicating (T × C to T × hi × wi × C). In TPAVI + , the audio feature and the image feature of ith stage are fused by a non-local paradigm. The fusion can be expressed by Eq. 3.

$$Z_{i} = V_{i} + E_{1} \mu (E_{2} \alpha_{i} g(V_{i} ))$$
(3)

where g and µ are 1 × 1 × 1 convolutions. Zi  RT×hi×wi×C. E1 and E2 are 2 enhancers. αi denotes the audio-visual similarity, which can be calculated by Eq. 4.

$$\alpha_{i} = \frac{{\theta (V_{i} )\phi (\hat{A})^{ \top }} }{N}$$
(4)

in which θ, ϕ are 1 × 1 × 1 convolutions. N is a normalization factor, which equals T × hi × wi. Through the TPAVI + module, the audio information is fused with the image feature, and the audio-visual relation is encoded.

Loss function

The loss function is designed to reflect the relationship between sound and vision. In addition to the loss of the predicted box outputted by the detection head, it should also contain a distance between the event represented by the predicted box and the sound. We compute the loss function in two parts, as illustrated by Eq. 5.

$$L = L_{det} + \lambda L_{avd}$$
(5)

where Ldet is the common loss of the object detection task, which contains regression loss and classification loss. λ is a balance weight. Lavd is the loss of audio-visual relation. We refer to the loss function of AVS7, which uses Kullback–Leibler (KL) divergence to match the similarity between audio and visual features. The difference is that our losses are calculated and summarized head-wise. The Lavd is calculated by Eq. 6.

$$L_{avsd} = \sum\limits_{i = 1}^{n} {KL} (AvgPool(F_{i}^{tpavi + } ),A_{i} )$$
(6)

where n is the number of detection heads. AvgPool is the average pooling layer. Fitpavi+ is the output of the ith TPAVI + module. Ai is the audio feature of the ith stage. KL denotes the Kullback–Leibler divergence.

Benchmark

To the best of our knowledge, there is no previous work or benchmark for AVD in Japanese-language teaching rooms. Therefore, we construct a benchmark for AVD by collecting a dataset and designing evaluation metrics.

Dataset

The dataset is collected from our Japanese-language classroom in Xi’an International Studies University, Xi’an 710128, China. Some sample images are shown in Figure S1. The dataset contains 4500 annotated images, which are divided into 3500 training images and 1000 test images. We define a total of 6 behaviors associated with sounds, which are sneaky talk, clapping, organizing stuff, laughing, answering, and teacher speaking. The detailed statistics of the dataset are shown in Table 1.

Table 1 Detailed statistics of the dataset.

The format of the dataset should consider the needs of AVDor. Each event that produces a sound corresponds to a video clip as well as a sound clip. We isometrically sample T frames from the video clip as visual information and use the whole sound clip as audio information. Thus, the audio information will be repeated T times to match the visual information. We manually annotate the bounding box of the object that emits the sound, and the event category of the object.

Evaluation metric

For evaluation, we design a metric for the AVD task, which is based on the common metric for object detection, mean average precision (mAP), and audio-visual similarity.

In object detection, the mAP is calculated by Eq. 7.

$$mAP = \frac{1}{n}\sum\limits_{i = 1}^{n} A P_{i}$$
(7)

where n is the number of classes. AP is the average precision of each class, which is calculated by the enclosed area of PR curve (Precision-Recall curve). Specifically, the precision and recall can be computed by Eq. 8.

$$Precision = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}^{\prime } }},\quad Recall = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(8)

where TP, FP, and FN are the numbers of true positive, false positive, and false negative boxes, respectively.

Specifically, for AVD, the accuracy of audio-vision matching is important. For each sound clip Si, we calculate the accuracy of detected events corresponding to this clip. Therefore, we use Eq. 9 to calculate the audio-visual match rate (AVMR).

$$AVMR = \frac{1}{{N_{s} }}\sum\limits_{i = 1}^{{N_{s} }} {\left( {\frac{{N_{TP} }}{{N_{Pred} }} \cdot \frac{{N_{Pred} }}{{N_{gt} }}} \right)}$$
(9)

where Ns denotes the number of sound clips. NTP, NPred, Ngt are the numbers of true positive, predicted, and ground truth boxes, respectively. The predicted box with IoU and confidence score over the threshold is considered a true positive.

Combining mAP and AVMR, we use mean average matching precision (mAMP) as the evaluation metric, which is calculated as in Eq. 10.

$$mAMP = mAP \times AVMR$$
(10)

from which we can conclude that the closer the mAMP is to 1, the better the performance of the AVDor.

Experimental results and discussion

In this section, we introduce the implementation details and evaluate the proposed AVDor based on the constructed benchmark.

Implementation details

The experiments were carried out on a computer equipped with 2 NVIDIA RTX 3090 GPUs (24 GB memory). We use PyTorch-1.8.019 to implement the AVDor. We train the AVDor for 24 epochs using the AdamW optimizer with a batch size of 16 and an initial learning rate of 0.0001. The image size is set to 800 × 800.

As for the composition of the training data, for each event corresponding to a video with sound, we isometrically extract video frames with T = 5. The entire sound clip is used as audio information. The 5 frames of the video and the entire sound clip are paired as a training sample.

Comparison experiment

First, we compare the performance of the AVDor with other state-of-the-art object detectors that do not use audio information. The results are shown in Table 2. We report mAP@0.5:0.95 and AVMRthresh=0.5 to show the performance of the methods. We can observe that the AVDor outperforms other detectors by a large margin, even with a simple ResNet-50 backbone. The results show that the audio information is helpful for object detection in the classroom.

Table 2 Comparison experiment of the AVDor with other state-of-the-art object detectors.

Then, we compare the proposed TPAVI + module with the original TPAVI module and simple feature addition. The results are shown in Table 3. It is evident that the TPAVI + module outperforms the original version. With the enhancer, the audio-visual relationship is better encoded, which is beneficial to the AVDor. Through the experiments, we can conclude that the AVDor performs better than other object detectors in the classroom, which proves the feasibility of the AVD.

Table 3 Comparison experiment of the TPAVI + and TPAVI.

Discussion

Through the experiments, we constructed a benchmark and designed AVDor to demonstrate the feasibility of AVD. Combining audio information improves the object detection performance, reaching 56.19% mAP and 52.54% AVMR. For more intuitive illustration, in Figure S2, we visualize some detection results selected from the test set. As observed, AVDor is able to accurately detect various events that produce sound. This capability proves that AVD is particularly valuable for audio-visual detection in classroom scenarios, as instructors do not have complete control over all information. The application of multimodal AI algorithms contributes to the advancement of smart education systems.

We also acknowledge the limitations of the current study. The dataset remains relatively constrained in scale and may not fully represent the wide spectrum of classroom settings across different schools or teaching styles. Moreover, the model’s ability to distinguish overlapping or concurrent sound sources can be further improved. To address these issues, training on larger and more diverse datasets could significantly enhance both detection accuracy and resilience to noise.

Conclusion

In this study, we propose a novel multimodal task for intelligent education, namely audio-visual detection. AVD can be used to locate sound-emitting objects with unclear sources in online or physical classrooms. In order to accomplish AVD, we propose a brand-new multimodal-based AVDor that outputs the object location and class after receiving audio and visual input. We also construct a benchmark for AVD, which provides object-level annotations and an evaluation metric according to the sound sources in the videos. Through experiments, we demonstrate that the proposed AVD can better detect sound-producing persons or events in classroom settings compared to common object detectors, thereby effectively assisting lectures as one of the components of an intelligent education systems.