Abstract
The environmental perception of intelligent connected vehicles is a new perception paradigm centered around roadside sensing devices and 5G wireless communication. This approach expands the perception range of onboard sensors from localized observation to a global view, significantly enhancing the accuracy of vehicle perception. Traditional perception algorithms often face issues in real-world datasets, such as blurriness of target features from the vehicle’s perspective and loss of specific information when integrating vehicle and infrastructure features. To address these issues, we propose a Cooperative Perception network for Discrepancy feature fusion through Knowledge Distillation, named CPD-KD. The contributions of this work are threefold: First, we introduce a Sparse convolution-based Knowledge Distillation network (SKD) to alleviate blurring of single-view point cloud features by incorporating prior knowledge from fused viewpoint point cloud features. Second, we design a Discrepancy Feature Attention Fusion module to enhance Cooperative efficiency of differential information between vehicles and infrastructures. Finally, we validate the CPD-KD algorithm on the real-world dataset DAIR-V2X and the simulated dataset V2X-Set. Experimental results show that the proposed CPD-KD algorithm effectively enhances the accuracy of vehicular and infrastructural Cooperative perception.
Similar content being viewed by others
Introduction
Intelligent connected vehicles are the cornerstone of intelligent transportation systems, where environmental perception is the primary step for their smooth operation. Existing intelligent connected vehicles rely on sensors installed on the vehicles themselves to perceive the surrounding environment. However, they often encounter dangerous situations such as missed detections and false detections in diverse and complex environments, including near-distance vehicle occlusions, medium-distance infrastructure blind spots, and long-distance small target detection.
For example, when a vehicle approaches an intersection, its lateral fields of view may be obstructed by other vehicles, resulting in significant visual blind spots. During this period, pedestrians are at high risk of causing severe collision accidents, commonly referred to as “sudden crossing of pedestrians” incidents. In response, cooperative perception systems that utilize data exchange between vehicles and infrastructure sensors, along with communication devices, have become a research hotspot in intelligent transportation systems, as shown in Fig. 1a. Compared to onboard sensors, infrastructure sensors offer several advantages, including reduced occlusion, higher resolution, and wider field of view. They are capable of providing multi-angle heterogeneous perception data, which significantly enhances the vehicle’s ability to detect obstacles in blind spots and small distant targets. Currently, the perception accuracy of sensors such as LiDAR and cameras continues to improve1, while the communication bandwidth of wireless devices is gradually expanding2,3. The application of these new technologies greatly enhances the usability of vehicle-infrastructure cooperative perception.
Existing Cooperative perception strategies can be categorized into three types based on the content transmitted: early fusion of raw point clouds, intermediate fusion of point cloud features, and late fusion of perception results. To balance the trade-off between transmission bandwidth and perception accuracy, the intermediate fusion strategy, which transmits pre-processed features, has become the mainstream research direction in Cooperative perception. Typically, mid-term fusion schemes consist of two parts: feature extraction and data fusion. In the feature extraction phase, point cloud data from both vehicle and infrastructure ends are inputted, using CNN-based voxelization methods like VoxelNet4 or 2D pseudo-image algorithms such as PointPillar5 to extract features for data transmission. In the data fusion phase, attention mechanisms6 or Vision Transformer7 architectures are commonly employed for the fusion of heterogeneous data from vehicles and infrastructures, ultimately generating 3D bounding boxes.
However, in real-world scenarios, the sparsity and incompleteness of onboard sensor point clouds can lead to bounding boxes that do not accurately cover the vehicle’s contours, as shown in Fig. 1b. In contrast, fused perspective point clouds that incorporate infrastructure data provide clear and complete vehicle contour information, as illustrated in Fig. 1c. Nonetheless, obtaining fused point clouds requires the transmission of a large amount of point cloud data, which reduces detection real-time capability. Furthermore, existing feature fusion mechanisms can only fuse outputs, resulting in inefficient extraction and utilization of the differential information present in both vehicle and infrastructure features.
To address these issues, we propose a discrepancy information Cooperative perception network based on knowledge distillation, which comprises two main modules. To tackle the issue of low feature extraction from vehicle sensor point clouds, a knowledge distillation point cloud feature extractor based on sparse convolution is introduced. During the training process, the fused perspective model provides prior feature knowledge to the vehicle model, allowing the single-view vehicle model’s output point cloud features to approximate those of the fused perspective model, thereby enhancing the accuracy of the predicted bounding boxes. To address the loss of infrastructure specificity in the fusion of heterogeneous vehicle and infrastructure features, inspired by the infrared and visible light image fusion algorithm ATFusion8, a DFAF is designed to merge heterogeneous features from vehicles and infrastructures. This involves the creation of a Discrepancy Information Attention Module (DIAM) and a Shared Information Attention Mechanism (AIAM) to effectively extract and utilize heterogeneous differential information between vehicles and infrastructures, thereby improving the efficiency of point cloud information fusion. In summary, the core contributions of this paper include the following:
-
1.
The introduction of a SKD that enhances the prior knowledge of fused perspective point clouds, addressing the issue of feature blurriness in single-view point cloud feature extraction models.
-
2.
The design of a DFAF mechanism for heterogeneous information attention fusion, which resolves the loss of infrastructure specificity in traditional fusion algorithms and improves the extraction and utilization of differential information between vehicles and infrastructures.
-
3.
Validation on two types of datasets: the real-world DAIR-V2X and the simulated V2X-Set. Experimental results demonstrate that the proposed algorithm effectively improves the accuracy of fused perception.
Related works
Cooperative object detection
Currently, enhancing the perception performance of individual vehicles using infrastructure devices has garnered increasing attention from researchers. Among the various strategies, mid-term fusion methods have emerged as a popular direction in Cooperative perception, effectively balancing scene information loss and transmission bandwidth requirements. For example, the F-cooper9 algorithm utilizes voxel methods to extract point cloud features for information transmission and employs the Maxout10 method for the voxel feature fusion of vehicle and infrastructure characteristics. The Opv2v11 algorithm introduces a self-attention mechanism to learn the interactions of features in the same spatial location. The V2Vnet12 algorithm leverages Graph Neural Networks (GNN)13to explore the transfer of features among multiple vehicles. Additionally, the V2X-ViT14 algorithm addresses the issue of spatiotemporal misalignment caused by real-world communication delays, utilizing a ViT architecture composed of multi-agent self-attention layers and multi-scale window self-attention layers for feature fusion from vehicle and infrastructure perspectives. The Where2comm15 algorithm establishes a spatial confidence map, allowing agents to share only spatially sparse information that is crucial for perception tasks. The MPDA16 algorithm introduces the inter-domain gap issue among different agents for the first time and employs a Transformer architecture to obtain fused features across agents. TransIFF17 designs an instance set feature transmission algorithm to reduce transmission bandwidth and incorporates a cross-domain adaptation module to mitigate the inter-domain gap among different agents. The Scope18 algorithm enhances the vehicle’s target detection capability by integrating temporal semantic information from preceding and succeeding frames of the vehicle with key spatial semantic information from surrounding heterogeneous agents. Finally, V2VFormer++19 establishes a multimodal vehicle-infrastructure Cooperative perception framework, employing a dynamic channel fusion algorithm to merge multimodal data and utilizing position-aware fusion (PAF) for the fusion of vehicle and infrastructure information. The CoBEVFlow20 algorithm addresses the information mismatch issue caused by asynchronous communication between different agents, compensating for the movements of agents based on bird’s-eye view to align asynchronous messages among multiple agents.
The aforementioned algorithm enhances detection performance by strengthening interactions between features; however, it overlooks the intrinsic feature deficiency of single-view data, which limits the upper bound of fusion-based object detection accuracy. In contrast, the proposed SKD module in this work fundamentally improves the feature extraction capability for point cloud data, while also offering considerable advantages in terms of model complexity and inference speed.
Knowledge distillation
Knowledge distillation networks21,22,23 use soft targets output by complex but accurate teacher networks to guide the training of streamlined student networks more suitable for inference. This method is widely used in semantic segmentation24, object detection25, and object re-identification26. Based on the type of soft target transmission, existing knowledge distillation networks can be divided into two categories: Feature Imitation and Prediction Mimicking, as shown in Fig. 2a, b, respectively. DiscoNet27 employs a feature imitation distillation network, where the output smooth distribution results better facilitate the student network’s learning of features, but cause the student model to remain consistent with the teacher model in latent features. In response, this paper utilizes an asymmetric knowledge distillation network, as shown in Fig. 2c. During training, data acquisition from fused perspectives is easy, and the teacher network using this data has a larger data volume, while the single-view student network requires fewer parameters. In the testing phase, the student network, with its smaller parameter size, can still extract point cloud feature representations from the fused perspective.
3D point cloud object detection
3D object detection based on LiDAR takes raw point cloud data as input and outputs the position and category of objects. Existing algorithms can be divided into three categories based on the smallest unit of point cloud processing: point-based object detection28,29; 3D voxel-based detection methods30,31,32; and methods based on 2D pseudo-image features33,34,35. Point-based detection algorithms36,37,38,39 aggregate features to key points and use these features to predict bounding box classifications. Due to their simple detection architecture, they retain rich point cloud information and generally achieve high accuracy. 3D voxel grid algorithms40,41 divide space into spatial voxels and use sparse 3D convolution to extract features, but 3D convolution has high computational complexity and long inference times. Methods using 2D pseudo-image features convert 3D point clouds into 2D pseudo-image features and use low-complexity 2D convolution to extract features. However, due to the low information density within 2D pseudo-image grids, feature extraction efficiency is low. In response, this paper designs a plug-and-play network, Sparse Pillar, centered on sparse 2D convolution, which significantly reduces runtime while improving object detection accuracy.
Methodology
The cooperative perception network CPD-KD proposed in this work, as illustrated in Fig. 3, focuses on feature fusion for vehicle-infrastructure differences. It primarily consists of a teacher network trained with fused perspective data and a student network trained with single-perspective data. The teacher network outputs prior feature information of detected objects to guide the student network in producing high-quality object features. During the vehicle-infrastructure feature data fusion stage, traditional attention fusion mechanisms tend to lose vehicle-infrastructure specific information. To address this, we propose DFAF, which focuses on the differences in vehicle-road features, effectively enhancing the Cooperative efficiency of object detection information between vehicles and infrastructures. The following sections will gradually present the network architecture and data feature processing details.
Pipeline of our proposed CPD-KD. It consists of two networks: a teacher network and a student network. The single arrow solid line and dotted line represent the forward propagation and backward propagation process, respectively. The details of each individual component are illustrated in Section "Methodology".
Student network
Each agent i \(\in\) {1,..., N} within the communication distance, agent categories \({{\varvec{c}}}_i\) \(\in\) {V, I}, V and I represent vehicles and infrastructure equipment, respectively. We assume that the data transmission is synchronized, which means that each agent i position is \({{\varvec{P}}}_i\), and the lidar data are \({{\varvec{L}}}_i\). Assuming that agent egois selected as the central agent, and ego accepts the positions from surrounding agents. Central agent ego can receive the original point cloud or feature from surrounding agents through coordinate system transformation.
Feature extraction
The point cloud feature extraction employs the 2D pillar grid feature extraction algorithm, Sparse Pillar, with its main structure shown in Fig. 4. Using an “Encoder-Neck” architecture from a bird’s eye view (BEV), it retains the traditional 2D pillar grid algorithm while utilizing a sparse convolution method better suited for point cloud features. This approach further reduces inference latency and enhances feature extraction capabilities.
The overall structure consists of three components: a 2D pseudo-image generation module, encoder, and bottleneck block Neck. Point Feature Net converts the point cloud into a stacked Pillar tensor, which is then projected into a 2D pseudo-image with a scale of H\(\times\) W\(\times\) C, where H and W represent the height and width of the pseudo-image canvas, respectively, and A indicates the number of channels of the pseudo-image. The 2D pseudo-image is then fed into the encoder. The Encoder adopts the VGGNet architecture. The objective is to extract sparse columnar features of varying depths from the projected sparse 2D pillar and to feed all sparse pillar features of varying scales into the bottleneck block. Since feature compression has been performed, sparse pillar features can be fused using standard dense 2D convolutions. The specific procedure is illustrated by formula (1):
where \(student(\bullet )\) represents the student feature extraction network, and \({\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\) indicates the scale of the feature space after convolution. SparsePillar employs sparse convolution to reduce computational complexity during the stage of feature extraction. In the feature upsampling procedure, dense convolution is used to integrate high-level abstract semantics and low-fine-grained spatial features to improve the accuracy of large objects. We use different feature extraction structures for the teacher and student models. The experimental section will demonstrate distinct structural differences.
Compression and decompression
To further reduce the required bandwidth for data transmission, each agent needs to perform compression before data transmission. We use \(1 \times 1\) convolutional layer to compress the features in the channel direction, as shown in formula (2):
where \(Enc_{Com}(\bullet )\) represents the compression function, formula (3) represents the information transmission data packet, \(Data_i\) transmits the compressed feature \(F_i^{s'}\) and its position \(P_{i}\), and decompresses after other agents receive the compressed feature \(F_i^{s'}\); the specific process is shown in formula (4):
where \(Dec_{Com}(\bullet )\) represents the decompression function corresponding to the compression process, and the decoded feature space becomes \({\mathbb {R}}^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\). The decompressed feature \(F_i^{s''}\) will be transmitted to the feature fusion part.
Teacher network
Multi-view data fusion
The input data of the teacher network are fused perspective data, which needs to be fused before inputting into the model. The process of fused point cloud is as shown in formula (5):
where \(L_{mix}\) represents the fused point cloud, N indicates the number of agents within the transmission range, and \(A(\bullet )\) represents the aggregation process of the surrounding point cloud data. To ensure that the input data of the teacher network of the fusion view and the student network of a single view are aligned, the coordinates are transformed into the coordinates centered on agent ego after the fusion point cloud is cropped.
Feature extraction
After collecting the multi-view data, we feed it into the network for feature extraction. The process of feature extraction for the teacher network is similar to that for the student network, but the input data differ, as shown in formula (6):
\(teacher(\bullet )\) represents the feature extraction network of the teacher’s point cloud, \(F_{ego}^t\) indicates the feature extracted from the fusion point cloud centered on agent ego, and \(R^{{\bar{H}}\times {\bar{W}}\times {\bar{C}}}\) represents the same feature space as the student network.
Knowledge distillation loss function
During the training process, the teacher network is more straightforward to converge than the student network training; therefore, the teacher and student networks can be jointly trained with random initialization parameters. In this paper, we jointly train the model using object detection loss and distillation loss. The loss \({{\mathcal {L}}}_{total}\) that needs to be minimized is shown in formula (7):
where hyperparameters \({\lambda }_{det}\) and \({\lambda }_{KD}\) control the weights of object detection loss \({{\mathcal {L}}}_{det}\) and knowledge distillation loss \({{\mathcal {L}}}_{KD}\), respectively. The target detection loss is shown in formula (15), including classification loss \({{\mathcal {L}}}_{class}(\bullet )\) and regression loss \({{\mathcal {L}}}_{reg}(\bullet )\). Classification loss \({{\mathcal {L}}}_{class}(\bullet )\) use focal loss to calculate classification \(Y_{class}(\bullet )\) and label classification value \({\hat{Y}}_{class}\), which is used to judge whether the object in the detection frame is background or target. Regression loss \({{\mathcal {L}}}_{reg}(\bullet )\) uses \({\ell }_1\) smooth loss to calculate regression value \(Y_{reg}\) and label regression value \({\hat{Y}}_{reg}\) , which are used to judge the detection frame’s position, size, and heading angle.
The distillation loss \({{\mathcal {L}}}_{KD}(F_{ego}^t,F_{ego}^s)\) is shown in formula (9), where KL((p(x)||(q(x)) represents the Kullback-Leibler(KL) divergence, which is used to describe the difference between the distributions p(x) and q(x). \(V(\bullet )\) indicates retaining the feature channel \({\hat{c}}\) and generating a one-dimensional feature (vector) from a feature of \({\hat{H}}\times {\hat{W}}\) size. The \(\tau (\bullet )\) function represents the softmax operation process with the distillation temperature T according to the number of channels \({\hat{c}}\), as shown in formula (10).
where \(z_i\) represents the i-th channel’s characteristics, and the hyperparameter T represents the distillation temperature.
Discrepancy feature attention fusion module
Existing attention mechanisms, such as cross-attention, focus on common information while neglecting the extraction and utilization of differential information. Additionally, a single Transformer structure cannot fully extract the shared information between vehicles and infrastructures. These limitations result in inefficient fusion of vehicle-to-infrastructure point cloud data (Fig. 5).
Inspired by8, this paper designs DIAM and AIAM, as shown in Fig. 5, the DIAM module computes differential feature information between the ego vehicle and other vehicles using attention mechanisms. Meanwhile, the AIAM module achieves effective fusion of common feature information through a cross-attention fusion mechanism.
In the specific implementation process, the DIAM module acquires the common features \(C_{QV}\) shared between the subject vehicle and other intelligent agent vehicles. By utilizing the features Q of other vehicles and the features K of the subject vehicle, a dot product is performed to obtain a common feature matrix between the subject vehicle and other vehicles. Multiplying this matrix by the subject vehicle’s features V yields the common feature information between the two, as shown in equation (11).
where represents the channel ratio factor. Subsequently, by eliminating the common information between the two, the differing information \(DI_{QV}\) can be obtained, as shown in equation (12).
where Linear() indicates that the linear layer can enhance the feature representation capability. Subsequently, the differing information is injected into the other agents Q, as shown in equation (13) and (14).
Ultimately, the differing information between the two intelligent agents can be obtained. Subsequently, the cross-attention mechanism within the AIAM module is employed to merge the common features from the two images. The specific process is similar to that of the DIAM, with the removal of the differing information acquisition component. The detailed process is as follows:
Finally, the AIAM process is applied once more, generating fused feature information by injecting the differing and common features between the vehicle and the infrastructure. Traditional information fusion modules based on the cross-attention mechanism merely concatenate the common features of the two images and often lack the capacity to integrate the differing features between the vehicle and the infrastructure. As a result, they output a singular image representation of the vehicle-to-infrastructure interaction, missing specific details from other modal features and proving unsuitable for image fusion in scenarios involving feature differences between the vehicle and the infrastructure. In response, this paper introduces the application of the DIAM and AIAM modules within the proposed differential feature attention fusion module, effectively addressing the shortcomings of previous attention mechanisms. The final output is the fused features \(M_{ego}\), which are then inputted into the detection head module.
3D dection head
After obtaining the final fused features \(M_{ego}\), two 1x1 convolutional layers are employed to generate classification and regression prediction results, forming the prediction bounding boxes as shown in equations (18) and (19):
where \({\xi }_{class}(\bullet )\) represents the classification layer, and \(Y_{class}\) outputs a score, which is used to indicate whether the preselected box is an object or a background. \({\xi }_{reg}(\bullet )\) represents the regression layer, and the output of \(Y_{reg}\) is seven dimensions \((x,y,z,w,l,h,\theta )\) , where x, y , and z represent the position of the prediction frame, w, l, and h represent the size of the prediction frame, and \(\theta\) represents the heading angle of the prediction box.
Additionally, to enhance the clarity of the data processing workflow, we provide the pseudocode of the overall CPD-KD algorithm, as illustrated in Algorithm 1. To ensure reproducibility, detailed information on the network architecture of the CPD algorithm is presented in Table 1.
Experiments
dataset
Real-world dataset
The DAIR-V2X dataset represents the first real-world vehicle-to-infrastructure cooperative perception dataset. The infrastructure unit is equipped with a 300-line solid-state LiDAR, while the vehicle unit features a 40-line rotating LiDAR, both operating at a sampling frequency of 10Hz. The dataset comprises a total of 6616 frames of cooperative data, which have been randomly divided into training, testing, and validation sets in a 7:2:1 ratio—4500 frames for the training set, 1500 frames for the test set, and 616 frames for the validation set. For annotations, we utilized the 360-degree detection annotation set as supplemented by Lu et al.42, which provides additional external viewpoint annotations missing from vehicle-side images. The perception range of the vehicle-to-infrastructure radars is defined as \(x\in [-102.4,+102.4]\), \(y\in [-38.4,+38.4]\), and \(z\in [-3,+1]\). Due to the extended detection range of the infrastructure radar, the communication distance between the infrastructure and vehicle units is set at 100 meters.
Simulated dataset
The V2XSet6 dataset is a large-scale Cooperative simulation dataset generated from simulated scenarios using CARLA43 and OpenCDA44. The dataset includes 11,447 frames, with each frame containing multiple intelligent agents. The amount of data for intelligent vehicles in the scene varies randomly, resulting in a total of 33,081 perception data samples for intelligent agents. Structurally, the training, validation, and test sets consist of 6,694, 1,820, and 2,833 frames, respectively. The perception range is defined as \(x\in [-140.8,+140.8]\), \(y\in [-38.4,+38.4]\), and \(z\in [-3,+1]\), and the communication distance between any two intelligent agents is set at 70 meters; agents beyond this distance will be disregarded.
Settings
This paper adopts the standard for 3D object detection, specifically using the Average Precision (AP) as the evaluation metric at Intersection-over-Union (IoU) thresholds of 0.5 and 0.7. Since all algorithms utilize voxel-based feature extraction networks, the dimensions of a single voxel are set to 0.4, 0.4, and 4 for length, width, and height, respectively. During the training process, all comparative methods employ the AdamW optimizer, training for 60 epochs with an initial learning rate of 0.001, which is reduced to 0.1 of the original value every 20 epochs. The algorithm is implemented on the Ubuntu operating system, utilizing Python as the programming language and employing the PyTorch deep learning framework for network training. All models are trained on a system equipped with an NVIDIA GeForce RTX 3090 GPU and an AMD 5900X CPU.
Comparison methods
The baseline method is the No Fusion algorithm, which uses only single-vehicle point cloud data. The fusion strategies include three types of algorithms: Early Fusion: Aggregates the raw point clouds from surrounding agents. Late Fusion: Acquires the perception results from all surrounding agents and uses non-maximum suppression to obtain the final results. The focus of this experiment is on comparing intermediate fusion strategies, evaluating the latest five methods: F-cooper, V2VNet, DiscoNet, OPV2V, V2X-ViT, and SCOPE algorithms. To ensure fairness, the same feature extraction method is used for evaluation, testing both the PointPillar[19] method and the proposed SparsePillar method as the foundational framework.
Main methods results
Table 2 presents the comparative experimental results of existing 3D object detection algorithms. The proposed Sparse Pillar module, being a plug-and-play module, shows results using both PointPillar and Sparse Pillar as backbone networks. Experimental results indicate that fusion methods using intermediate strategies outperform single-vehicle detection algorithms and late fusion algorithms. Early fusion strategies, which can access complete raw point cloud data, yield better results than intermediate fusion strategies. Under the condition of IoU = 0.7, the proposed CPD-KD algorithm achieves the best results. Compared to the optimal SCOPE algorithm, using PointPillar and Sparse Pillar backbones improves accuracy by 1.0% and 2.6%, respectively. The Sparse Pillar module offers higher perception accuracy than the PointPillar module. At IoU = 0.7, the results for F-cooper, DiscoNet, OPV2V, V2X-ViT, and SCOPE methods increase by 5.7%, 3.1%, 9.0%, 3.2%, and 5.5%, respectively.
Distillation temperature comparison
Comparison of Distillation Temperature: The knowledge distillation framework uses soft training targets. Compared to hard target labels from manual annotations, soft target labels contain more information, increasing the variance between different features and enabling the student network to acquire more valuable information from the teacher network. Distillation temperature is a crucial hyperparameter in knowledge distillation. Table 3 shows the model’s average precision (AP) results under different distillation temperatures. When the distillation temperature is low, detection accuracy improves as the temperature increases. The highest detection accuracy is achieved at a distillation temperature of 10. Beyond this point, further increases in temperature lead to a decline in detection accuracy. The analysis indicates that as the distillation temperature rises, the class differences between soft targets diminish, hindering effective feature learning. Ultimately, we select an optimal temperature of 10.
In the real-world dataset, the target detection results are illustrated in Fig. 6. This study compares the results across four typical work scenarios using F-Cooper, OPV2V, V2VNet, V2X-ViT, SCOPE, and our proposed CPD-KD algorithm. Green and red boxes represent the actual and predicted results, respectively. Even compared to the optimal algorithm SCOPE, our algorithm maintains an advantage. Overall, our method shows better alignment between predicted and actual boxes compared to other algorithms, demonstrating superior accuracy as seen in Scenes 1 and 2. In more complex scenarios, the advantages of the CPD-KD algorithm are more pronounced. For example, in Scene 3, where the vehicle and infrastructure are at a considerable distance, other algorithms exhibit varying degrees of missed detections within the red circle, whereas our CPD-KD algorithm successfully detects all targets. Scene 4 illustrates a close-range scenario between the vehicle and infrastructure, where our algorithm accurately identifies all targets without any false detections.
Model parameter and inference time comparison
Performance is determined jointly by the number of model parameters, inference time and the accuracy of prediction. Figure 7 depicts the relationship between inference time and accuracy of model parameter. The inference time of the CPD-KD algorithm proposed in this article is 12 ms, which is slightly larger than the 8ms of the Opv2v algorithm, but much lower than the 85ms of the SCOPE algorithm and the 91ms of the V2X-ViT algorithm with similar performance. The algorithm has the advantage of high real-time performance, and an important reason is the size of the network parameters. The figure clearly demonstrates that the parameter amount utilized by our algorithm is merely 5.8 MB, significantly lower than that of other comparable algorithms. Notably, even the state-of-the-art algorithm, SCOPE, exhibits a parameter volume as high as 39.7M, yet its detection performance remains inferior to that of the CPD-KD algorithm. This observation underscores the effectiveness of our algorithmic detection framework, highlighting its superior performance despite the reduced parameter volume.
Ablation experiment
Ablation experiments demonstrate the accuracy improvements contributed by each module of the CPD-KD framework, as shown in Table 4. The baseline comparison method is OPV2V, utilizing a PointPillar-based feature extraction method with an intermediate fusion strategy using self-attention. The individual modules, SparsePillar and SKD, improve AP by 9% and 7.2% at IoU = 0.7, respectively. The results confirm that the plug-and-play feature extraction network and asymmetric distillation architecture significantly enhance perception accuracy. The DFAF module improves accuracy by 9.7%, indicating that even with high perception accuracy, the DFAF fusion module can still enhance feature representation compared to the self-attention mechanism. Overall, the proposed method improves over the baseline by 7.8% at IoU = 0.5 and 11.2% at IoU = 0.7. The greater improvement at IoU = 0.7 is attributed to the acquisition of multi-view features from a single view, meeting higher accuracy requirements. These ablation results validate the effectiveness of each component.
To further demonstrate the mechanism by which each module enhances results, we present intermediate feature maps from different vehicle and infrastructure perspectives, as shown in Fig. 8. In Scene 1, our proposed CPD-KD utilizes the SKD point cloud feature extraction module based on SparsePillar, resulting in clearer vehicle features within the scene. The highlighted yellow and orange points are more concentrated compared to traditional algorithms, which display more blurred features. This leads to multiple undetected targets in the final fusion results. In Scene 2, both algorithms detect multiple vehicle targets from the infrastructure perspective at the intersection. However, since the vehicle perspective failed to detect them, our CPD-KD algorithm ultimately identifies three vehicles in the final detection results, while the SOTA algorithm fails to detect any. This demonstrates that our proposed heterogeneous information attention fusion module, DFAF, effectively retains distinct information when faced with conflicting features from vehicle and infrastructure perspectives, unlike traditional self-attention mechanisms that discard such differences (Fig. 8).
Ablation experiment figure. We selected Scene 1 and Scene 2 to illustrate the vehicle view and infrastructure view feature maps, showcasing the intermediate point cloud features from both the vehicle and infrastructure perspectives. The objection detection results display the final target detection after perspective fusion. Green boxes indicate ground truth detection boxes, while red boxes indicate the algorithm’s detection boxes.
Conclusion
In this work, we propose a novel differential feature fusion Cooperative perception algorithm, CPD-KD, based on knowledge distillation networks. To address the issue of target feature ambiguity from a vehicular perspective, we introduce a SKD feature extractor, which enhances point cloud feature prior knowledge from a fused viewpoint, thereby mitigating the ambiguity associated with single-view point cloud features. To tackle the problem of specificity information loss during feature fusion, we innovatively design DFAF, which enhances the Cooperative efficiency in distinguishing vehicle-infrastructure differential information. The effectiveness of the proposed CPD-KD algorithm is validated through both real-world and simulated dataset scenarios, with quantitative and qualitative analyses demonstrating its efficacy.
Data availability
The V2XSet dataset used in this study can be accessed here: https://ucla.app.box.com/v/UCLA-MobilityLab-V2XVIT/folder/280422450461.The DAIR-V2X dataset used in this study can be accessed here:https://drive.google.com/drive/folders/1gnrw5llXAIxuB9sEKKCm6xTaJ5HQAw2e.
References
Tobin, R., Halimi, A., McCarthy, A., Soan, P. J. & Buller, G. S. Robust real-time 3d imaging of moving scenes through atmospheric obscurant using single-photon lidar. Sci. Rep. 11, 11236 (2021).
Si, H. et al. Unsupervised localization toward crowdsourced trajectory data: A deep reinforcement learning approach. IEEE Trans. Wireless Commun. https://doi.org/10.1109/TWC.2025.3563766 (2025).
Guo, X. et al. Automated valet parking and charging: A novel collaborative ai-empowered architecture. IEEE Commun. Mag. 63, 131–137 (2025).
Zhou, Y. & Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4490–4499 (2018).
Lang, A. H. et al. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12697–12705 (2019).
Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems (2017).
Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. 12124–12134 (2022).
Jian, L. et al. Rethinking cross-attention for infrared and visible image fusion. arXiv preprint arXiv:2401.11675 (2024).
Chen, Q. et al. F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds. 88–100 (2019).
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A. & Bengio, Y. Maxout networks. 1319–1327 (PMLR, 2013).
Xu, R. et al. Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. 2583–2589 (2022).
Wang, T.-H. et al. V2vnet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction 605–621 (Springer, 2020).
Schlichtkrull, M. et al. Modeling Relational Data with Graph Convolutional Networks 593–607 (Springer, 2018).
Xu, R. et al. V2x-vit: Vehicle-to-Everything Cooperative Perception with Vision Transformer 107–124 (Springer, 2022).
Hu, Y., Fang, S., Lei, Z., Zhong, Y. & Chen, S. Where2comm: Communication-efficient collaborative perception via spatial confidence maps. Adv. Neural. Inf. Process. Syst. 35, 4874–4886 (2022).
Xu, R., Li, J., Dong, X., Yu, H. & Ma, J. Bridging the domain gap for multi-agent perception. 6035–6042 (IEEE, 2023).
Chen, Z., Shi, Y. & Jia, J. Transiff: An instance-level feature fusion framework for vehicle-infrastructure cooperative 3d detection with transformers. 18205–18214 (2023).
Yin, H. et al. V2vformer \(++\): Multi-modal vehicle-to-vehicle cooperative perception via global-local transformer. IEEE Trans. Intell. Transp. Syst. 25, 2153 (2023).
Wei, S. et al. Asynchrony-robust collaborative perception via bird’s eye view flow. Adv. Neural Inf. Process. Syst. 36, 28462 (2024).
Yang, K. et al. Spatio-temporal domain awareness for multi-agent collaborative perception. 23383–23392 (2023).
Anil, R. et al. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
Hinton, G. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Romero, A. et al. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
Liu, Y. et al. Structured knowledge distillation for semantic segmentation. 2604–2613 (2019).
Wang, Y., Fathi, A., Wu, J., Funkhouser, T. & Solomon, J. Multi-frame to single-frame: Knowledge distillation for 3d object detection. arXiv:2009.11859 (2020).
Jin, X., Lan, C., Zeng, W. & Chen, Z. Uncertainty-aware multi-shot knowledge distillation for image-based object re-identification. Proc. Conf. Artif. Intell. 34, 11165–11172 (2020).
Li, Y. et al. Learning distilled collaboration graph for multi-agent perception. Adv. Neural. Inf. Process. Syst. 34, 29541–29552 (2021).
Qi, C. R., Su, H., Mo, K. & Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. 652–660 (2017).
Qi, C. R., Yi, L., Su, H. & Guibas, L. J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems30 (2017).
Le, T. & Duan, Y. Pointgrid: A deep network for 3d shape understanding. 9204–9214 (2018).
Maturana, D. & Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. 922–928 (IEEE, 2015).
Yan, Y., Mao, Y. & Li, B. Second: Sparsely embedded convolutional detection. Sensors 18, 3337 (2018).
Wang, B., An, J. & Cao, J. Voxel-fpn: Multi-scale voxel feature aggregation in 3d object detection from point clouds. arXiv:1907.05286 (2019).
Ye, M., Xu, S. & Cao, T. Hvnet: Hybrid voxel network for lidar based 3d object detection. 1631–1640 (2020).
Noh, J., Lee, S. & Ham, B. Hvpr: Hybrid voxel-point representation for single-stage 3d object detection. 14605–14614 (2021).
Shi, S. et al. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. 10529–10538 (2020).
Shi, S., Wang, X. & Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. 770–779 (2019).
Yang, Z., Sun, Y., Liu, S. & Jia, J. 3dssd: Point-based 3d single stage object detector. 11040–11048 (2020).
Yang, Z., Sun, Y., Liu, S., Shen, X. & Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. 1951–1960 (2019).
Deng, J. et al. Voxel r-cnn: Towards high performance voxel-based 3d object detection. 35, 1201–1209 (2021).
He, C., Zeng, H., Huang, J., Hua, X.-S. & Zhang, L. Structure aware single-stage 3d object detection from point cloud. 11873–11882 (2020).
Lu, Y. et al. Robust collaborative 3d object detection in presence of pose errors. 4812–4818 (IEEE, 2023).
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. Carla: An open urban driving simulator. 1–16 (PMLR, 2017).
Xu, R. et al. Opencda: An open cooperative driving automation framework integrated with co-simulation. 1155–1162 (IEEE, 2021).
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 52472433, 52225212.
Author information
Authors and Affiliations
Contributions
C.H.: Contributed to the conceptualization and design of the CPD-KD algorithm, focusing on the integration of knowledge distillation techniques for feature extraction and the overall framework of the cooperative perception network. H.W.: Served as the lead author, overseeing the research direction, methodology development, and coordination of the team. He was instrumental in the validation of the algorithm through extensive experiments and data analysis. T.L.: Assisted in the implementation of the Discrepancy Feature Attention Fusion (DFAF) mechanism and contributed to the optimization of the algorithm, enhancing its performance in real-world scenarios. S.Z.: Played a key role in data collection and preprocessing, ensuring the robustness of the datasets used for validation. He also contributed to the experimental setup and evaluation of the results. L.C.: Conducted comparative analysis with existing methods, providing critical insights into the performance improvements offered by the CPD-KD algorithm. His contributions were essential for contextualizing the findings within the broader field of intelligent transportation systems. Y.C.: Contributed to the writing and revision of the manuscript, ensuring clarity and coherence in presenting the research findings. He also assisted in the literature review, situating the work within the current state of research in cooperative perception strategies.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
He, C., Wang, H., Luo, T. et al. CPD-KD: a cooperative perception network for discrepancy feature fusion through knowledge distillation. Sci Rep 15, 27258 (2025). https://doi.org/10.1038/s41598-025-08482-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-08482-5











