Introduction

Intelligent connected vehicles are the cornerstone of intelligent transportation systems, where environmental perception is the primary step for their smooth operation. Existing intelligent connected vehicles rely on sensors installed on the vehicles themselves to perceive the surrounding environment. However, they often encounter dangerous situations such as missed detections and false detections in diverse and complex environments, including near-distance vehicle occlusions, medium-distance infrastructure blind spots, and long-distance small target detection.

For example, when a vehicle approaches an intersection, its lateral fields of view may be obstructed by other vehicles, resulting in significant visual blind spots. During this period, pedestrians are at high risk of causing severe collision accidents, commonly referred to as “sudden crossing of pedestrians” incidents. In response, cooperative perception systems that utilize data exchange between vehicles and infrastructure sensors, along with communication devices, have become a research hotspot in intelligent transportation systems, as shown in Fig. 1a. Compared to onboard sensors, infrastructure sensors offer several advantages, including reduced occlusion, higher resolution, and wider field of view. They are capable of providing multi-angle heterogeneous perception data, which significantly enhances the vehicle’s ability to detect obstacles in blind spots and small distant targets. Currently, the perception accuracy of sensors such as LiDAR and cameras continues to improve1, while the communication bandwidth of wireless devices is gradually expanding2,3. The application of these new technologies greatly enhances the usability of vehicle-infrastructure cooperative perception.

Figure 1
figure 1

Advantages of Cooperative perception. (a) represents the Cooperative perception process; (b) and (c) illustrate point cloud diagrams from the vehicle’s perspective and the fused perspective, respectively.

Existing Cooperative perception strategies can be categorized into three types based on the content transmitted: early fusion of raw point clouds, intermediate fusion of point cloud features, and late fusion of perception results. To balance the trade-off between transmission bandwidth and perception accuracy, the intermediate fusion strategy, which transmits pre-processed features, has become the mainstream research direction in Cooperative perception. Typically, mid-term fusion schemes consist of two parts: feature extraction and data fusion. In the feature extraction phase, point cloud data from both vehicle and infrastructure ends are inputted, using CNN-based voxelization methods like VoxelNet4 or 2D pseudo-image algorithms such as PointPillar5 to extract features for data transmission. In the data fusion phase, attention mechanisms6 or Vision Transformer7 architectures are commonly employed for the fusion of heterogeneous data from vehicles and infrastructures, ultimately generating 3D bounding boxes.

However, in real-world scenarios, the sparsity and incompleteness of onboard sensor point clouds can lead to bounding boxes that do not accurately cover the vehicle’s contours, as shown in Fig. 1b. In contrast, fused perspective point clouds that incorporate infrastructure data provide clear and complete vehicle contour information, as illustrated in Fig. 1c. Nonetheless, obtaining fused point clouds requires the transmission of a large amount of point cloud data, which reduces detection real-time capability. Furthermore, existing feature fusion mechanisms can only fuse outputs, resulting in inefficient extraction and utilization of the differential information present in both vehicle and infrastructure features.

To address these issues, we propose a discrepancy information Cooperative perception network based on knowledge distillation, which comprises two main modules. To tackle the issue of low feature extraction from vehicle sensor point clouds, a knowledge distillation point cloud feature extractor based on sparse convolution is introduced. During the training process, the fused perspective model provides prior feature knowledge to the vehicle model, allowing the single-view vehicle model’s output point cloud features to approximate those of the fused perspective model, thereby enhancing the accuracy of the predicted bounding boxes. To address the loss of infrastructure specificity in the fusion of heterogeneous vehicle and infrastructure features, inspired by the infrared and visible light image fusion algorithm ATFusion8, a DFAF is designed to merge heterogeneous features from vehicles and infrastructures. This involves the creation of a Discrepancy Information Attention Module (DIAM) and a Shared Information Attention Mechanism (AIAM) to effectively extract and utilize heterogeneous differential information between vehicles and infrastructures, thereby improving the efficiency of point cloud information fusion. In summary, the core contributions of this paper include the following:

  1. 1.

    The introduction of a SKD that enhances the prior knowledge of fused perspective point clouds, addressing the issue of feature blurriness in single-view point cloud feature extraction models.

  2. 2.

    The design of a DFAF mechanism for heterogeneous information attention fusion, which resolves the loss of infrastructure specificity in traditional fusion algorithms and improves the extraction and utilization of differential information between vehicles and infrastructures.

  3. 3.

    Validation on two types of datasets: the real-world DAIR-V2X and the simulated V2X-Set. Experimental results demonstrate that the proposed algorithm effectively improves the accuracy of fused perception.

Related works

Cooperative object detection

Currently, enhancing the perception performance of individual vehicles using infrastructure devices has garnered increasing attention from researchers. Among the various strategies, mid-term fusion methods have emerged as a popular direction in Cooperative perception, effectively balancing scene information loss and transmission bandwidth requirements. For example, the F-cooper9 algorithm utilizes voxel methods to extract point cloud features for information transmission and employs the Maxout10 method for the voxel feature fusion of vehicle and infrastructure characteristics. The Opv2v11 algorithm introduces a self-attention mechanism to learn the interactions of features in the same spatial location. The V2Vnet12 algorithm leverages Graph Neural Networks (GNN)13to explore the transfer of features among multiple vehicles. Additionally, the V2X-ViT14 algorithm addresses the issue of spatiotemporal misalignment caused by real-world communication delays, utilizing a ViT architecture composed of multi-agent self-attention layers and multi-scale window self-attention layers for feature fusion from vehicle and infrastructure perspectives. The Where2comm15 algorithm establishes a spatial confidence map, allowing agents to share only spatially sparse information that is crucial for perception tasks. The MPDA16 algorithm introduces the inter-domain gap issue among different agents for the first time and employs a Transformer architecture to obtain fused features across agents. TransIFF17 designs an instance set feature transmission algorithm to reduce transmission bandwidth and incorporates a cross-domain adaptation module to mitigate the inter-domain gap among different agents. The Scope18 algorithm enhances the vehicle’s target detection capability by integrating temporal semantic information from preceding and succeeding frames of the vehicle with key spatial semantic information from surrounding heterogeneous agents. Finally, V2VFormer++19 establishes a multimodal vehicle-infrastructure Cooperative perception framework, employing a dynamic channel fusion algorithm to merge multimodal data and utilizing position-aware fusion (PAF) for the fusion of vehicle and infrastructure information. The CoBEVFlow20 algorithm addresses the information mismatch issue caused by asynchronous communication between different agents, compensating for the movements of agents based on bird’s-eye view to align asynchronous messages among multiple agents.

The aforementioned algorithm enhances detection performance by strengthening interactions between features; however, it overlooks the intrinsic feature deficiency of single-view data, which limits the upper bound of fusion-based object detection accuracy. In contrast, the proposed SKD module in this work fundamentally improves the feature extraction capability for point cloud data, while also offering considerable advantages in terms of model complexity and inference speed.

Knowledge distillation

Knowledge distillation networks21,22,23 use soft targets output by complex but accurate teacher networks to guide the training of streamlined student networks more suitable for inference. This method is widely used in semantic segmentation24, object detection25, and object re-identification26. Based on the type of soft target transmission, existing knowledge distillation networks can be divided into two categories: Feature Imitation and Prediction Mimicking, as shown in Fig. 2a, b, respectively. DiscoNet27 employs a feature imitation distillation network, where the output smooth distribution results better facilitate the student network’s learning of features, but cause the student model to remain consistent with the teacher model in latent features. In response, this paper utilizes an asymmetric knowledge distillation network, as shown in Fig. 2c. During training, data acquisition from fused perspectives is easy, and the teacher network using this data has a larger data volume, while the single-view student network requires fewer parameters. In the testing phase, the student network, with its smaller parameter size, can still extract point cloud feature representations from the fused perspective.

Figure 2
figure 2

Comparison of distillation modes. (a) and (b) represent prediction mimicking and feature imitation, respectively. (c) Illustrates the asymmetric knowledge distillation structure. Different colors represent different data flows.

3D point cloud object detection

3D object detection based on LiDAR takes raw point cloud data as input and outputs the position and category of objects. Existing algorithms can be divided into three categories based on the smallest unit of point cloud processing: point-based object detection28,29; 3D voxel-based detection methods30,31,32; and methods based on 2D pseudo-image features33,34,35. Point-based detection algorithms36,37,38,39 aggregate features to key points and use these features to predict bounding box classifications. Due to their simple detection architecture, they retain rich point cloud information and generally achieve high accuracy. 3D voxel grid algorithms40,41 divide space into spatial voxels and use sparse 3D convolution to extract features, but 3D convolution has high computational complexity and long inference times. Methods using 2D pseudo-image features convert 3D point clouds into 2D pseudo-image features and use low-complexity 2D convolution to extract features. However, due to the low information density within 2D pseudo-image grids, feature extraction efficiency is low. In response, this paper designs a plug-and-play network, Sparse Pillar, centered on sparse 2D convolution, which significantly reduces runtime while improving object detection accuracy.

Methodology

The cooperative perception network CPD-KD proposed in this work, as illustrated in Fig. 3, focuses on feature fusion for vehicle-infrastructure differences. It primarily consists of a teacher network trained with fused perspective data and a student network trained with single-perspective data. The teacher network outputs prior feature information of detected objects to guide the student network in producing high-quality object features. During the vehicle-infrastructure feature data fusion stage, traditional attention fusion mechanisms tend to lose vehicle-infrastructure specific information. To address this, we propose DFAF, which focuses on the differences in vehicle-road features, effectively enhancing the Cooperative efficiency of object detection information between vehicles and infrastructures. The following sections will gradually present the network architecture and data feature processing details.

Figure 3
figure 3

Pipeline of our proposed CPD-KD. It consists of two networks: a teacher network and a student network. The single arrow solid line and dotted line represent the forward propagation and backward propagation process, respectively. The details of each individual component are illustrated in Section "Methodology".

Student network

Each agent i \(\in\) {1,..., N} within the communication distance, agent categories \({{\varvec{c}}}_i\) \(\in\) {V, I}, V and I represent vehicles and infrastructure equipment, respectively. We assume that the data transmission is synchronized, which means that each agent i position is \({{\varvec{P}}}_i\), and the lidar data are \({{\varvec{L}}}_i\). Assuming that agent egois selected as the central agent, and ego accepts the positions from surrounding agents. Central agent ego can receive the original point cloud or feature from surrounding agents through coordinate system transformation.

Feature extraction

The point cloud feature extraction employs the 2D pillar grid feature extraction algorithm, Sparse Pillar, with its main structure shown in Fig. 4. Using an “Encoder-Neck” architecture from a bird’s eye view (BEV), it retains the traditional 2D pillar grid algorithm while utilizing a sparse convolution method better suited for point cloud features. This approach further reduces inference latency and enhances feature extraction capabilities.

Figure 4
figure 4

Pipline of sparse pillar. Yellow represents sparse convolution, while blue represents dense convolution.

The overall structure consists of three components: a 2D pseudo-image generation module, encoder, and bottleneck block Neck. Point Feature Net converts the point cloud into a stacked Pillar tensor, which is then projected into a 2D pseudo-image with a scale of H\(\times\) W\(\times\) C, where H and W represent the height and width of the pseudo-image canvas, respectively, and A indicates the number of channels of the pseudo-image. The 2D pseudo-image is then fed into the encoder. The Encoder adopts the VGGNet architecture. The objective is to extract sparse columnar features of varying depths from the projected sparse 2D pillar and to feed all sparse pillar features of varying scales into the bottleneck block. Since feature compression has been performed, sparse pillar features can be fused using standard dense 2D convolutions. The specific procedure is illustrated by formula (1):

$$\begin{aligned} F_i^s = student(L_i),F_i^s\in {\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}} \end{aligned}$$
(1)

where \(student(\bullet )\) represents the student feature extraction network, and \({\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\) indicates the scale of the feature space after convolution. SparsePillar employs sparse convolution to reduce computational complexity during the stage of feature extraction. In the feature upsampling procedure, dense convolution is used to integrate high-level abstract semantics and low-fine-grained spatial features to improve the accuracy of large objects. We use different feature extraction structures for the teacher and student models. The experimental section will demonstrate distinct structural differences.

Compression and decompression

To further reduce the required bandwidth for data transmission, each agent needs to perform compression before data transmission. We use \(1 \times 1\) convolutional layer to compress the features in the channel direction, as shown in formula (2):

$$\begin{aligned} F_i^{s'}= & Enc_{com}(F_i^s), F_i^{s'}\in {\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times \bar{{\bar{C}}}},\bar{{\bar{C}}}\ll {\bar{C}} \end{aligned}$$
(2)
$$\begin{aligned} & Data_i\leftarrow (F_i^{s'},P_i) \end{aligned}$$
(3)

where \(Enc_{Com}(\bullet )\) represents the compression function, formula (3) represents the information transmission data packet, \(Data_i\) transmits the compressed feature \(F_i^{s'}\) and its position \(P_{i}\), and decompresses after other agents receive the compressed feature \(F_i^{s'}\); the specific process is shown in formula (4):

$$\begin{aligned} F_i^{s''}=Dec_{com}(F_i^{s'}),F_i^{s''}\in \mathbb {R}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}} \end{aligned}$$
(4)

where \(Dec_{Com}(\bullet )\) represents the decompression function corresponding to the compression process, and the decoded feature space becomes \({\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\). The decompressed feature \(F_i^{s''}\) will be transmitted to the feature fusion part.

Teacher network

Multi-view data fusion

The input data of the teacher network are fused perspective data, which needs to be fused before inputting into the model. The process of fused point cloud is as shown in formula (5):

$$\begin{aligned} \begin{aligned} L_{mix} = A((L_{ego},P_{ego}),(L_1,P_1),(L_2,P_2),\dots ,(L_N,P_N)) \end{aligned} \end{aligned}$$
(5)

where \(L_{mix}\) represents the fused point cloud, N indicates the number of agents within the transmission range, and \(A(\bullet )\) represents the aggregation process of the surrounding point cloud data. To ensure that the input data of the teacher network of the fusion view and the student network of a single view are aligned, the coordinates are transformed into the coordinates centered on agent ego after the fusion point cloud is cropped.

Feature extraction

After collecting the multi-view data, we feed it into the network for feature extraction. The process of feature extraction for the teacher network is similar to that for the student network, but the input data differ, as shown in formula (6):

$$\begin{aligned} F_{ego}^t = teacher(L_{mix}),F_{ego}^t\in {\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}} \end{aligned}$$
(6)

\(teacher(\bullet )\) represents the feature extraction network of the teacher’s point cloud, \(F_{ego}^t\) indicates the feature extracted from the fusion point cloud centered on agent ego, and \(R^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\) represents the same feature space as the student network.

Knowledge distillation loss function

During the training process, the teacher network is more straightforward to converge than the student network training; therefore, the teacher and student networks can be jointly trained with random initialization parameters. In this paper, we jointly train the model using object detection loss and distillation loss. The loss \({{\mathcal {L}}}_{total}\) that needs to be minimized is shown in formula (7):

$$\begin{aligned} {{\mathcal {L}}}_{total} = {\lambda }_{det}{{\mathcal {L}}}_{det}+{\lambda }_{KD}{{\mathcal {L}}}_{KD} \end{aligned}$$
(7)

where hyperparameters \({\lambda }_{det}\) and \({\lambda }_{KD}\) control the weights of object detection loss \({{\mathcal {L}}}_{det}\) and knowledge distillation loss \({{\mathcal {L}}}_{KD}\), respectively. The target detection loss is shown in formula (15), including classification loss \({{\mathcal {L}}}_{class}(\bullet )\) and regression loss \({{\mathcal {L}}}_{reg}(\bullet )\). Classification loss \({{\mathcal {L}}}_{class}(\bullet )\) use focal loss to calculate classification \(Y_{class}(\bullet )\) and label classification value \({\hat{Y}}_{class}\), which is used to judge whether the object in the detection frame is background or target. Regression loss \({{\mathcal {L}}}_{reg}(\bullet )\) uses \({\ell }_1\) smooth loss to calculate regression value \(Y_{reg}\) and label regression value \({\hat{Y}}_{reg}\) , which are used to judge the detection frame’s position, size, and heading angle.

$$\begin{aligned} {{\mathcal {L}}}_{det} = {{\mathcal {L}}}_{reg}(Y_{reg},{\hat{Y}}_{reg})+{{\mathcal {L}}}_{class}(Y_{class},{\hat{Y}}_{class}) \end{aligned}$$
(8)

The distillation loss \({{\mathcal {L}}}_{KD}(F_{ego}^t,F_{ego}^s)\) is shown in formula (9), where KL((p(x)||(q(x)) represents the Kullback-Leibler(KL) divergence, which is used to describe the difference between the distributions p(x) and q(x). \(V(\bullet )\) indicates retaining the feature channel \({\hat{c}}\) and generating a one-dimensional feature (vector) from a feature of \({\hat{H}}\times {\hat{W}}\) size. The \(\tau (\bullet )\) function represents the softmax operation process with the distillation temperature T according to the number of channels \({\hat{c}}\), as shown in formula (10).

$$\begin{aligned} & {{\mathcal {L}}}_{KD}(F_{ego}^t,F_{ego}^s) = KL(\tau (V(F_{ego}^t))||\tau (V(F_{ego}^s))) \end{aligned}$$
(9)
$$\begin{aligned} & \tau (\bullet ) \leftarrow \frac{exp(z_i/T)}{\sum _jexp(z_j/T)} \end{aligned}$$
(10)

where \(z_i\) represents the i-th channel’s characteristics, and the hyperparameter T represents the distillation temperature.

Discrepancy feature attention fusion module

Existing attention mechanisms, such as cross-attention, focus on common information while neglecting the extraction and utilization of differential information. Additionally, a single Transformer structure cannot fully extract the shared information between vehicles and infrastructures. These limitations result in inefficient fusion of vehicle-to-infrastructure point cloud data (Fig. 5).

Figure 5
figure 5

Illustration of the discrepancy Feature attention fusion module. The purple color indicates the overall data flow of the DFAF module, the orange color represents the structure of the DIAM module, and the blue color denotes the structure of the AIAM module.

Inspired by8, this paper designs DIAM and AIAM, as shown in Fig. 5, the DIAM module computes differential feature information between the ego vehicle and other vehicles using attention mechanisms. Meanwhile, the AIAM module achieves effective fusion of common feature information through a cross-attention fusion mechanism.

In the specific implementation process, the DIAM module acquires the common features \(C_{QV}\) shared between the subject vehicle and other intelligent agent vehicles. By utilizing the features Q of other vehicles and the features K of the subject vehicle, a dot product is performed to obtain a common feature matrix between the subject vehicle and other vehicles. Multiplying this matrix by the subject vehicle’s features V yields the common feature information between the two, as shown in equation (11).

$$\begin{aligned} C_{QV}=softmax \left(\frac{QK^T}{\sqrt{d_K}} \right)V \end{aligned}$$
(11)

where represents the channel ratio factor. Subsequently, by eliminating the common information between the two, the differing information \(DI_{QV}\) can be obtained, as shown in equation (12).

$$\begin{aligned} DI_{QV}=Linear(V-C_{QV}) \end{aligned}$$
(12)

where Linear() indicates that the linear layer can enhance the feature representation capability. Subsequently, the differing information is injected into the other agents Q, as shown in equation (13) and (14).

$$\begin{aligned} F_{add}= & DI_{QV}+Q \end{aligned}$$
(13)
$$\begin{aligned} F_{DIAM}= & MLP(LN(F_{add}))+F_{add} \end{aligned}$$
(14)

Ultimately, the differing information between the two intelligent agents can be obtained. Subsequently, the cross-attention mechanism within the AIAM module is employed to merge the common features from the two images. The specific process is similar to that of the DIAM, with the removal of the differing information acquisition component. The detailed process is as follows:

$$\begin{aligned} C_{QV}= & softmax \left(\frac{QK^T}{\sqrt{d_K}} \right)V \end{aligned}$$
(15)
$$\begin{aligned} F_{add}= & DI_{QV}+Q \end{aligned}$$
(16)
$$\begin{aligned} F_{AIAM}= & MLP(LN(F_{add}))+F_{add} \end{aligned}$$
(17)

Finally, the AIAM process is applied once more, generating fused feature information by injecting the differing and common features between the vehicle and the infrastructure. Traditional information fusion modules based on the cross-attention mechanism merely concatenate the common features of the two images and often lack the capacity to integrate the differing features between the vehicle and the infrastructure. As a result, they output a singular image representation of the vehicle-to-infrastructure interaction, missing specific details from other modal features and proving unsuitable for image fusion in scenarios involving feature differences between the vehicle and the infrastructure. In response, this paper introduces the application of the DIAM and AIAM modules within the proposed differential feature attention fusion module, effectively addressing the shortcomings of previous attention mechanisms. The final output is the fused features \(M_{ego}\), which are then inputted into the detection head module.

3D dection head

After obtaining the final fused features \(M_{ego}\), two 1x1 convolutional layers are employed to generate classification and regression prediction results, forming the prediction bounding boxes as shown in equations (18) and (19):

$$\begin{aligned} Y_{class}= & {\xi }_{class}(M_{ego}) \end{aligned}$$
(18)
$$\begin{aligned} Y_{reg}= & {\xi }_{reg}(M_{ego}) \end{aligned}$$
(19)

where \({\xi }_{class}(\bullet )\) represents the classification layer, and \(Y_{class}\) outputs a score, which is used to indicate whether the preselected box is an object or a background. \({\xi }_{reg}(\bullet )\) represents the regression layer, and the output of \(Y_{reg}\) is seven dimensions \((x,y,z,w,l,h,\theta )\) , where x, y , and z represent the position of the prediction frame, w, l, and h represent the size of the prediction frame, and \(\theta\) represents the heading angle of the prediction box.

Additionally, to enhance the clarity of the data processing workflow, we provide the pseudocode of the overall CPD-KD algorithm, as illustrated in Algorithm 1. To ensure reproducibility, detailed information on the network architecture of the CPD algorithm is presented in Table 1.

Algorithm 1
figure a

CPD-KD process

Table 1 Detailed architectural specifications for CPD-KD.

Experiments

dataset

Real-world dataset

The DAIR-V2X dataset represents the first real-world vehicle-to-infrastructure cooperative perception dataset. The infrastructure unit is equipped with a 300-line solid-state LiDAR, while the vehicle unit features a 40-line rotating LiDAR, both operating at a sampling frequency of 10Hz. The dataset comprises a total of 6616 frames of cooperative data, which have been randomly divided into training, testing, and validation sets in a 7:2:1 ratio—4500 frames for the training set, 1500 frames for the test set, and 616 frames for the validation set. For annotations, we utilized the 360-degree detection annotation set as supplemented by Lu et al.42, which provides additional external viewpoint annotations missing from vehicle-side images. The perception range of the vehicle-to-infrastructure radars is defined as \(x\in [-102.4,+102.4]\), \(y\in [-38.4,+38.4]\), and \(z\in [-3,+1]\). Due to the extended detection range of the infrastructure radar, the communication distance between the infrastructure and vehicle units is set at 100 meters.

Simulated dataset

The V2XSet6 dataset is a large-scale Cooperative simulation dataset generated from simulated scenarios using CARLA43 and OpenCDA44. The dataset includes 11,447 frames, with each frame containing multiple intelligent agents. The amount of data for intelligent vehicles in the scene varies randomly, resulting in a total of 33,081 perception data samples for intelligent agents. Structurally, the training, validation, and test sets consist of 6,694, 1,820, and 2,833 frames, respectively. The perception range is defined as \(x\in [-140.8,+140.8]\), \(y\in [-38.4,+38.4]\), and \(z\in [-3,+1]\), and the communication distance between any two intelligent agents is set at 70 meters; agents beyond this distance will be disregarded.

Settings

This paper adopts the standard for 3D object detection, specifically using the Average Precision (AP) as the evaluation metric at Intersection-over-Union (IoU) thresholds of 0.5 and 0.7. Since all algorithms utilize voxel-based feature extraction networks, the dimensions of a single voxel are set to 0.4, 0.4, and 4 for length, width, and height, respectively. During the training process, all comparative methods employ the AdamW optimizer, training for 60 epochs with an initial learning rate of 0.001, which is reduced to 0.1 of the original value every 20 epochs. The algorithm is implemented on the Ubuntu operating system, utilizing Python as the programming language and employing the PyTorch deep learning framework for network training. All models are trained on a system equipped with an NVIDIA GeForce RTX 3090 GPU and an AMD 5900X CPU.

Comparison methods

The baseline method is the No Fusion algorithm, which uses only single-vehicle point cloud data. The fusion strategies include three types of algorithms: Early Fusion: Aggregates the raw point clouds from surrounding agents. Late Fusion: Acquires the perception results from all surrounding agents and uses non-maximum suppression to obtain the final results. The focus of this experiment is on comparing intermediate fusion strategies, evaluating the latest five methods: F-cooper, V2VNet, DiscoNet, OPV2V, V2X-ViT, and SCOPE algorithms. To ensure fairness, the same feature extraction method is used for evaluation, testing both the PointPillar[19] method and the proposed SparsePillar method as the foundational framework.

Table 2 Detection performance comparison on V2XSet and DAIR-V2X. we show average precision (AP) at IoU = 0.5,0.7 on point pillar and sparse pillar backbone, respectively.

Main methods results

Table 2 presents the comparative experimental results of existing 3D object detection algorithms. The proposed Sparse Pillar module, being a plug-and-play module, shows results using both PointPillar and Sparse Pillar as backbone networks. Experimental results indicate that fusion methods using intermediate strategies outperform single-vehicle detection algorithms and late fusion algorithms. Early fusion strategies, which can access complete raw point cloud data, yield better results than intermediate fusion strategies. Under the condition of IoU = 0.7, the proposed CPD-KD algorithm achieves the best results. Compared to the optimal SCOPE algorithm, using PointPillar and Sparse Pillar backbones improves accuracy by 1.0% and 2.6%, respectively. The Sparse Pillar module offers higher perception accuracy than the PointPillar module. At IoU = 0.7, the results for F-cooper, DiscoNet, OPV2V, V2X-ViT, and SCOPE methods increase by 5.7%, 3.1%, 9.0%, 3.2%, and 5.5%, respectively.

Distillation temperature comparison

Comparison of Distillation Temperature: The knowledge distillation framework uses soft training targets. Compared to hard target labels from manual annotations, soft target labels contain more information, increasing the variance between different features and enabling the student network to acquire more valuable information from the teacher network. Distillation temperature is a crucial hyperparameter in knowledge distillation. Table 3 shows the model’s average precision (AP) results under different distillation temperatures. When the distillation temperature is low, detection accuracy improves as the temperature increases. The highest detection accuracy is achieved at a distillation temperature of 10. Beyond this point, further increases in temperature lead to a decline in detection accuracy. The analysis indicates that as the distillation temperature rises, the class differences between soft targets diminish, hindering effective feature learning. Ultimately, we select an optimal temperature of 10.

In the real-world dataset, the target detection results are illustrated in Fig. 6. This study compares the results across four typical work scenarios using F-Cooper, OPV2V, V2VNet, V2X-ViT, SCOPE, and our proposed CPD-KD algorithm. Green and red boxes represent the actual and predicted results, respectively. Even compared to the optimal algorithm SCOPE, our algorithm maintains an advantage. Overall, our method shows better alignment between predicted and actual boxes compared to other algorithms, demonstrating superior accuracy as seen in Scenes 1 and 2. In more complex scenarios, the advantages of the CPD-KD algorithm are more pronounced. For example, in Scene 3, where the vehicle and infrastructure are at a considerable distance, other algorithms exhibit varying degrees of missed detections within the red circle, whereas our CPD-KD algorithm successfully detects all targets. Scene 4 illustrates a close-range scenario between the vehicle and infrastructure, where our algorithm accurately identifies all targets without any false detections.

Figure 6
figure 6

The qualitative analysis results. We selected four different intersection scenarios to compare existing algorithms: F-Cooper, OPV2V, V2VNet, V2X-ViT, SCOPE, and CPD-KD. Green boxes indicate ground truth detection boxes, while red boxes indicate algorithm detection boxes.

Model parameter and inference time comparison

Performance is determined jointly by the number of model parameters, inference time and the accuracy of prediction. Figure 7 depicts the relationship between inference time and accuracy of model parameter. The inference time of the CPD-KD algorithm proposed in this article is 12 ms, which is slightly larger than the 8ms of the Opv2v algorithm, but much lower than the 85ms of the SCOPE algorithm and the 91ms of the V2X-ViT algorithm with similar performance. The algorithm has the advantage of high real-time performance, and an important reason is the size of the network parameters. The figure clearly demonstrates that the parameter amount utilized by our algorithm is merely 5.8 MB, significantly lower than that of other comparable algorithms. Notably, even the state-of-the-art algorithm, SCOPE, exhibits a parameter volume as high as 39.7M, yet its detection performance remains inferior to that of the CPD-KD algorithm. This observation underscores the effectiveness of our algorithmic detection framework, highlighting its superior performance despite the reduced parameter volume.

Figure 7
figure 7

The figure depicting model parameter count and inference time. The orange line represents the number of model parameters, while the green line indicates the inference time for a single scene.

Table 3 Comparison of AP at different distillation temperatures on V2Xset and DAIR-V2X.

Ablation experiment

Ablation experiments demonstrate the accuracy improvements contributed by each module of the CPD-KD framework, as shown in Table 4. The baseline comparison method is OPV2V, utilizing a PointPillar-based feature extraction method with an intermediate fusion strategy using self-attention. The individual modules, SparsePillar and SKD, improve AP by 9% and 7.2% at IoU = 0.7, respectively. The results confirm that the plug-and-play feature extraction network and asymmetric distillation architecture significantly enhance perception accuracy. The DFAF module improves accuracy by 9.7%, indicating that even with high perception accuracy, the DFAF fusion module can still enhance feature representation compared to the self-attention mechanism. Overall, the proposed method improves over the baseline by 7.8% at IoU = 0.5 and 11.2% at IoU = 0.7. The greater improvement at IoU = 0.7 is attributed to the acquisition of multi-view features from a single view, meeting higher accuracy requirements. These ablation results validate the effectiveness of each component.

Table 4 Ablation experimental results of different components in the V2XSet and DAIR-V2X dataset.

To further demonstrate the mechanism by which each module enhances results, we present intermediate feature maps from different vehicle and infrastructure perspectives, as shown in Fig. 8. In Scene 1, our proposed CPD-KD utilizes the SKD point cloud feature extraction module based on SparsePillar, resulting in clearer vehicle features within the scene. The highlighted yellow and orange points are more concentrated compared to traditional algorithms, which display more blurred features. This leads to multiple undetected targets in the final fusion results. In Scene 2, both algorithms detect multiple vehicle targets from the infrastructure perspective at the intersection. However, since the vehicle perspective failed to detect them, our CPD-KD algorithm ultimately identifies three vehicles in the final detection results, while the SOTA algorithm fails to detect any. This demonstrates that our proposed heterogeneous information attention fusion module, DFAF, effectively retains distinct information when faced with conflicting features from vehicle and infrastructure perspectives, unlike traditional self-attention mechanisms that discard such differences (Fig. 8).

Figure 8
figure 8

Ablation experiment figure. We selected Scene 1 and Scene 2 to illustrate the vehicle view and infrastructure view feature maps, showcasing the intermediate point cloud features from both the vehicle and infrastructure perspectives. The objection detection results display the final target detection after perspective fusion. Green boxes indicate ground truth detection boxes, while red boxes indicate the algorithm’s detection boxes.

Conclusion

In this work, we propose a novel differential feature fusion Cooperative perception algorithm, CPD-KD, based on knowledge distillation networks. To address the issue of target feature ambiguity from a vehicular perspective, we introduce a SKD feature extractor, which enhances point cloud feature prior knowledge from a fused viewpoint, thereby mitigating the ambiguity associated with single-view point cloud features. To tackle the problem of specificity information loss during feature fusion, we innovatively design DFAF, which enhances the Cooperative efficiency in distinguishing vehicle-infrastructure differential information. The effectiveness of the proposed CPD-KD algorithm is validated through both real-world and simulated dataset scenarios, with quantitative and qualitative analyses demonstrating its efficacy.