Abstract
With the advance of the industrial Internet of Things, automated environments such as smart cities are foreseeable. Vehicle re-identification (ReID) tries to recall images of the same vehicle in large-scale image data sets. The main challenges of vehicle ReID are intra-identity differences caused by different views of the same vehicle and inter-identities similarity caused by the same view of similar vehicles. To deal with these challenges, we propose a novel Transformer-based Vehicle ReID model with View Information (TVRVI). First, we annotate and divide the vehicle images into five different view parts and train a parsing network with view information to separate images and give out their corresponding view labels. Second, we design a dual branches transformer-based parsing network to extract features of different views separately and reduce their entanglement. With the help of view information, the local branch of the transformer network can learn fine-grained representations of different views and the overall feature of the whole image is learned in the global branch at the same time. By comparing features from the same view and leaving out features of different views, the TVRVI makes good use of common-view features and avoids interference from uncommon-view features, which helps reduce intra-identity differences and inter-identities similarities between vehicle images. Experiments on four public vehicle datasets show the effectiveness and state-of-the-art results of our method, and ablation studies are specifically designed to testify to the effectiveness of view information.
Similar content being viewed by others
Introduction
With the advance of technology, as well as social and environmental incentives, vehicles are gradually recognized as intelligent industrial Internet of Things devices in smart cities. Vehicle re-identification (ReID) plays a more important role in smart transportation, traffic time estimation, vehicle behavior modeling, and criminal investigation. Given a query image, vehicle ReID tries to recall all the images of a particular vehicle identity from large-scale image datasets. Different from people’s identities in person re-identification1,2, vehicle identities are factory line products, the intra-identity difference and inter-identities similarity challenges have not been well solved. As can be seen in Fig. 1, which is selected from VeRi-776 dataset3, images 1 and 3 look more like the same car than images 1 and 2. Images 1 and 2 are both selected from vehicle 0742. The most direct way is to use license plates for ReID3,4,5, but license plates are invisible in some images, and they are not always accessible for privacy reasons, as presented in Fig 1. So, vehicle ReID methods need to focus on other image features.
To tackle these challenges in vehicle ReID, a variety of methods based on image features regardless of license plates have been proposed. For example, Qian et al. conducted a systematic review on transformer-based vehicle ReID and discussed its applications, challenges, and evaluation strategies6 . Kishore et al. integrated vehicle pose estimation to enhance feature extraction and classification of vehicle attributes such as color and type, making it invariant to camera viewpoints7. Qian et al. used the transformer to combine deep feature-based methods with attention-based methods, aiming to improve the accuracy of vehicle re-identification8. Li et al. combined the advantages of CNN and transformer to build a Transformer-Based Vehicle-Graph ReID model9 . Qian et al. proposed a Unstructured Feature Decoupling Network (UFDN) for Vehicle ReID10 . The UFDN aims to address the misalignment of features caused by pose and viewpoint variances in Vehicle ReID without the need for additional manual annotation. Lu et al. designed a Mask-Aware Reasoning Transformer (MART) for vehicle re-identification. These methods all try to learn view-specific features based on the transformer model11 , which is the most popular method in recent research. Besides transformer, there are also a lot of deep learning methods for vehicle ReID. Khorramshahi et al. designed a self-supervised residual generation network based on an encoder-decoder architecture for vehicle ReID12 . Meanwhile, Cui et al. fused features from multiple deep convolutional neural networks (DCNNs) with different objectives13. Peng et al. eliminated cross-camera bias through an image-to-image translation model14 , and Yao et al. implemented a data augmentation method by simulating content-consistent vehicle datasets through a graphics engine15 . These approaches mainly focus on the global features of images.
Vehicle ReID is essentially an image retrieval problem and the fine-grained features are vital in the ReID procedure in addition to global features. Wang et al. annotated 20 vehicle landmarks covering most of the corners of vehicle body lines and aggregated them based on a deep learning framework16. Khorramshahi et al. used a two-stack hourglass network to classify the view location of the landmarks and invented a double-path model with adaptive attention covering those 20 landmarks but did not include any other discriminative features17. Liu et al. divided vehicle images into three overlapping regions (top, middle, and bottom) and made use of the model, the color, the maker, etc., to obtain a region-aware deep model for identifying vehicles18. Zhou et al. introduced a GAN mechanism, which can infer the invisible views of input images19. Bai et al. designed an “odd-one-out” adversarial method to learn discriminative view-independent features, this scheme uses two images from the same view and one image from a different view20. He et al. made use of three near-duplicate vehicle parts (window, headlight, brand) for ReID, this method is good for front and back views rather than side views21. Chen et al. and Meng et al. divided vehicles according to camera views (front, back, side, etc.) and employed view-aware masks to learn local features, but features of different views are entangled. Besides the global feature of vehicle images, these methods go a step further into local features22,23 .
With the development of intelligent manufacturing, it becomes possible to use aerial devices such as unmanned aerial vehicles (UAVs) to supervise and track vehicles besides fixed supervision cameras24,25. Compared with traditional fixed roadside cameras, UAVs are more flexible and aerial datasets like VRAI24 have more view variance and wider resolution ranges, which poses a greater challenge for vehicle ReID. When views are changing, entangled local features from uncommon views between images may also cause interference during the ReID procedure.
All of the above methods are based on convolutional neural network (CNN) structure. Recently, the transformer has been widely used in computer vision tasks26. Previous CNN-based networks introduce image-specific inductive biases such as translation equivariance and locality, but vision transformers (VIT) learn the relevant patterns directly from data. Research has shown that convolutional inductive biases are beneficial for smaller datasets; however, learning directly from data proves to be sufficient and often more advantageous26. In addition, some methods were proposed to reduce the reliance of VIT on large-scale datasets27 and to reduce the computational complexity of VIT28. Meanwhile, thanks to the wide-range deployment of Industrial Internet of Things(IIOT) devices, vehicle ReID datasets are usually in large volumes3,24,29,30,31, they are adequate to be used in downstream tasks of VIT.
With these assumptions, we design a novel transformer-based model with view information (TVRVI) for vehicle ReID, as displayed in Fig. 2. Specifically, we first divide and annotate vehicle images into five separate views, namely front, left-side, back, right-side, and top views. Then, we train a view parsing network based on the view annotations and give out their corresponding view labels, as displayed in Fig. 3. Third, we design a dual branches transformer-based parsing network with view information to extract global and local image features. In the global branch, images are separated into small equal-sized patches and linearly projected into the transformer network. In the local branch, all of the images are divided and shuffled into five patches with the information of five different views. To reduce the entanglement of features from different views, we introduce the soft nearest neighbor loss (SNNL)32 in our local branch, which can measure the entanglement of class manifolds in the learning space. As a result, our TVRVI can separate local view features and avoid interference from uncommon-view features. In conclusion, the contribution of our work can be concluded as follows:
-
Firstly, we divide images into five different view parts and train a view parsing network to separate images and give out their view labels. The outputs of the view parsing network are treated as view information.
-
Then we design a Transformer-based Vehicle Re-identification model with View Information (TVRVI). In the global branch of TVRVI, images are separated into small patches and linearly projected as usual. In the local branch of TVRVI, images are divided and shuffled into five patches with the help of view information output from the view parsing network, SNNL is also introduced to reduce the entanglement of features from different views.
-
Through extensive experiments on four commonly used datasets, we prove the effectiveness of our TVRVI model, which achieves state-of-the-art results and has better generalization ability.
Related work
Vehicle ReID is a multi-target-multi-camera tracking task. Previous work on vehicle ReID can be coarsely classified into two kinds of methods, namely supervised feature representation learning and unsupervised feature representation learning.
Supervised feature representation learning
The first main type is supervised feature representation learning. This approach mainly focuses on developing feature construction strategies. Liu et al. matched features between a query set and a gallery set in a deep learning approach3. Chen et al. pretrained a DCNN network on a balanced VeRi-7763 subset with the same number of images in each viewpoint and extracted the centers of feature representations, so it could generate three vehicle foreground masks for part feature extraction22. Meng et al. parsed vehicle local features into four views (front, back, side, and top) and employed a common visible attention scheme to calculate local feature distance23. However, they treated the left side and right side as the same, and the view features are entangled. The part-regularized near-duplicate model contains two components: a global module based on CNN to conduct ReID categorization and a local part-regularizing model based on YOLO to encourage the correct classification of four identified parts21 . Zhang et al. used a fixed number of image local parts and combined a part-guided attention network to extract both local and global features35. Feature representation learning methods mainly focus on feature learning and network designing. To address the challenges of intra-instance difference and inter-instance similarity in vehicle ReID, Li et al. designed a super-resolution-based part collaboration network (SPCN) by utilizing discriminative local parts and applying super-resolution techniques to enhance their qualities48.46 addressed the domain generalization problem in vehicle ReID by creating a specialized dataset and employing dual disentanglement techniques to enhance the generalization ability of ReID models. Lian et al. proposed a Multi-Branch Enhanced Discriminative Network (MED) for vehicle ReID, which is separated into three branches: the global branch, the horizontal branch, and the vertical branch36. Similarly, Chen et al. constructed a Global-Local Discriminative Representation Learning Network (GLNet) for viewpoint-aware vehicle ReID37 .
However, most existing methods focus primarily on visual representation learning and neglect the potential of semantic features during training, which often leads to poor generalization capability when adapted to new domains. Recent works have begun to explore multi-modal feature learning to overcome these limitations. For instance, Xiang et al. proposed a unified perspective for visual-semantic embedding learning in generalizable person re-identification, demonstrating the effectiveness of combining visual and semantic cues49. Similarly, Singh et al. introduced FLAVA, a foundational language and vision alignment model that targets vision, language, and their multi-modal combination simultaneously, showing impressive performance across a wide range of tasks50. These approaches highlight the importance of multi-modal learning and provide strong motivation for our work, which integrates view information with transformer-based architectures to improve vehicle re-identification.
Unsupervised feature representation learning
Unlike supervised feature representation learning, unsupervised feature representation learning assumes that the data has no labels or only a few labels are available. Shen et al. designed a Triplet Contrastive Representation Learning (TCRL) for unsupervised vehicle re-identification38 . The TCRL framework consists of a feature encoder module, a clustering module, and three memory banks for storing updated features. Zhu et al. introduces two novel self-attention modules, namely the static self-attention module and the dynamic self-attention module, designed to capture distinct dependencies within vehicle images39 . The static self-attention module aims to capture global long-range dependencies, while the dynamic self-attention module focuses on capturing local location-dependent dependencies. By incorporating both modules into the feature learning process, this approach effectively addresses the prevalent issue of feature misalignment between vehicle image embeddings. Khorramshahi et al. proposed a training framework that incorporates self-supervisory signals to enrich the learning of a ReID model without explicit attention mechanisms, achieving a balance of accuracy and efficiency40 . To deal with the challenges of distinguishing vehicles with different IDs under various perspectives and backgrounds, Lu et al. designed three modules: Multi-Attentional Feature Enhancement (MAFE), Adaptive Threshold Neighborhood Consistency (ATNC), and Compact Loss (CL) to improve the feature representation and filter out pseudo-label noise41 . Except for local features, vehicle images can also be processed in a coarse-to-fine process. Huang et al. proposed a coarse-to-fine sparse self-attention mechanism (CFSA) for vehicle ReID42. They first introduced a patch-level pairwise self-attention mechanism, which reduces the computation and memory resources by transforming the dense connections into sparse connections. Additionally, they employed a coarse-to-fine strategy to capture global and local information, improving the discriminative power of the attention map.
Proposed method
Vehicle ReID plays a significant role in smart cities, and the intra-identity difference and inter-identities similarity challenges have not been well solved. In this section, we explain our transformer-based model with view information (TVRVI) for vehicle ReID. The overall structure of TVRVI is shown in Fig. 2. It mainly includes three steps: parse vehicle images into five views and give out their corresponding view labels, use a dual transformer network to extract image features, and then divide and shuffle every input image into five patches in the local branch with the help of view parsing output. By comparing features from the same view and leaving out features from different views, the TVRVI makes good use of common-view features and avoids interference from uncommon-view features, which helps reduce intra-identity difference and inter-identities similarity between vehicle images.
View information parsing
In the realm of transportation, there exists a vast array of vehicle types, each exhibiting its unique characteristics and functionalities. However, despite their apparent diversity, vehicles can be conceptually abstracted as polyhedrons composed of multiple surfaces. This abstraction allows us to categorize vehicle images based on different camera perspectives. Given that IIOT supervisory cameras are typically positioned at elevated vantage points relative to vehicles, an image captured by such a camera often encompasses three or more distinct views of a particular vehicle, as depicted in Fig. 1. In addition to roadside fixed cameras, UAV mounted cameras are much more flexible, enabling them to capture nearly all possible vehicle views within a single image, as illustrated in the input image in Fig. 3. Consequently, there exists at least one shared view between different images.
During the procedure of ReID, we can focus on common-view features and leave out uncommon-view features that may introduce interference. In this paper, we define vehicle surfaces into five distinct views, namely front, left-side, back, right-side, and top. Notably, the bottom surface is excluded from consideration as it is typically unobservable. By adopting this categorization, we can extract and compare local features of vehicles based on their respective views, facilitating more accurate and discriminating ReID analyses.
View information annotation
Based on the annotation work in PVEN16, which annotated vehicle images into four views, namely the front, back, top, and side views, we further separated the side view in PVEN into left-side and right-side, and add images captured by UAVs, which are not limited to roadside cameras. Concretely, we carefully curated a subset of the VeRi-776 dataset3 for annotation purposes using the Labelme tool. The selection process aimed to encompass a wide range of vehicle views and enhance the robustness of our parsing network. To achieve this, we specifically targeted vehicle identities that exhibited significant variations in their observed views. Furthermore, we pick up the aerial view images from the VRAI24 dataset and combine them with the subset of the VeRi-776, effectively expanding the coverage of our annotated dataset. As for rare vehicle types such as pickups, trucks, or cement tanks, we annotate all of the images from a particular vehicle identity. In annotation procedure, we follow a consistent labeling protocol of those five views: Front (grille, headlights, front windshield), back (taillights, trunk, rear window), left/right side (side windows, doors, wheels, rearview mirror), top (roof, sunroof). Images with ambiguous views were discussed among authors to reach consensus.
In total, our annotation efforts resulted in the labeling of 4305 vehicle images. From this annotated dataset, we divide the images into two distinct sets: a training set and a testing set. For training our view parsing network, we utilized 3505 images, while the remaining 800 images were reserved for testing the network’s performance and evaluating its effectiveness in view parsing tasks. This partitioning ensures that our network learns from a substantial number of labeled examples while preserving a separate, independent dataset for unbiased evaluation.
View parsing network
Once the vehicle view information was annotated, we proceeded to train a view parsing network based on the U-net segmentation model43. The primary objective of this network was to accurately segment vehicle images and assign appropriate view labels to each segmented region, as illustrated in Fig. 3. The 1, 2, 3, 4, and 5 in Fig. 3 correspond to the front, left-side, back, right-side, and top view of a vehicle image. To enhance the network’s performance, we incorporate batch normalization layers, replace the activation function in the last layer with a Rectified Linear Unit (ReLU)44, and modify the loss function to employ the mean square deviation (MSE) loss. Following rigorous training, we evaluate the effectiveness of the view parsing network using a separate testing set. The results were highly promising, with an achieved Intersection over Union (IoU)45 score of 84.3%. This score indicates the network’s ability to accurately delineate and separate vehicle views from the input images. Additionally, the view classification accuracy reached an impressive 98.6%, demonstrating the network’s proficiency in assigning the correct view labels to the segmented regions.
These performance metrics demonstrate the network’s effectiveness in extracting and representing view information. Consequently, the transformer embeddings that integrate such view information are reliable and suitable for subsequent stages of analysis and processing.
Transformer based feature extraction
After training the view parsing network, we design a dual transformer-based network that incorporates both global features and local view features, leveraging the obtained view information. As the backbone of our network, we use the swin-transformer (SwinT) structure28, which has demonstrated superior performance in various computer vision tasks. Furthermore, we augment these embeddings by concatenating them with an additional vehicle ID class token, enabling the network to incorporate identity information. In the global branch of our network, we introduce layer normalization after the Multi-Layer Perceptron (MLP) head in the last swin-transformer block. This normalization step aids in stabilizing the training process and improving the network’s overall performance.
As for the loss function, we optimize our network in the global branch using a combination of different loss components. Firstly, we employ the ID loss \(L_{ID}\), which is akin to the standard cross-entropy loss for vehicle ID classification. This loss component ensures that the network learns to accurately classify vehicle identities. Additionally, we incorporate a triplet loss to train the global features. For a triplet of example \(\left\{ anchor,positive,negative \right\}\), we employ a soft margin triplet loss30 \(L_T\) with batch-hard mining:
This triplet loss compares the features of an anchor example \(f_a\), a positive example \(f_p\), and a negative example \(f_n\), to increase the intra-identity compactness and inter-identities separability of the learned embeddings. A soft margin is utilized to allow for a degree of flexibility in the triplet loss calculation. However, triplet loss only measures the relative distance between a triplet of images. To minimize the absolute distance between image pairs, we introduce center loss \(L_C\)30, which can help enhance intra-identity compactness and bring the features of instances belonging to the same identity closer in the embedding space:
where B is the batch size, \(c_j\) is the features center of images in the same identity as the jth image, and \(f_j\) is the embedding features of the jth image.
By incorporating these loss functions, our dual transformer-based network can effectively learn both global features and local view features, enabling a comprehensive representation of the vehicles in our dataset.
View feature separation
In the local branch of our network, we employ a view-based approach to capture the local view features of the vehicles. To achieve this, we divided and shuffled each input image into five patches based on the output obtained from the view parsing network, as depicted in the right portion of Fig. 2. Unlike the global branch, where the vehicle images were evenly divided into patches of the same size, we adopted a different strategy in the local branch. Instead, we utilize the view parsing results to separate the vehicle images into five distinct parts. Subsequently, these parts were projected into \(32 \times 32\) patches through an average pooling layer, ensuring that the patch size matches that of the input for the final block of the swin-transformer.
In cases where a vehicle image had fewer than five views, we fill the missing views with patches containing all zeros. This approach maintained consistency in the input dimensions and facilitated uniform processing across all vehicle images. The separated five view patches were then embedded into the final block of the swin-transformer, incorporating an absolute position embedding that corresponds to their respective view labels. This embedding process allowed the network to associate spatial information with specific views, aiding in the discrimination of local features.
With the help of view information, we can reduce the entanglement of vehicle local features by using the soft nearest neighbor loss (SNNL)32. This loss function leverages the added view information to encourage the network to identify similar local features across different vehicle instances while minimizing the interference caused by dissimilar features. By incorporating the SNNL loss, our network gains the ability to effectively reduce feature entanglement and enhance the discriminative power of the learned representations. For a batch of B image samples, SNNL32 is calculated as:
where x is the input embedding in the first layer of the last block in the local branch or the feature representations in some hidden layers, y is the view label of x, T is a temperature parameter which adjusts the impact degree given to the distance between input x. Because the swin-transformer has large hidden layers, we replace the Euclidean distance with cosine distance \(\left( 1-cos \left( x,y \right) \right)\) to ensure stable calculations:
To simplify the training process, we minimize the SNNL overall temperatures to eliminate temperature as a hyperparameter:
In the final layer in the last block of swin-transformer, we reduce the entanglement of five local view features by minimizing the \(l_{sn}^{\prime }\) in Eq. (6). In light of findings that suggest the benefit of diverse feature representations in hidden layers for model robustness and generalization32, we design a total SNNL loss \(L_S\) by minimizing \(l_{sn}^{\prime }\) in the last layer and maximizing \(l_{sn}^{\prime }\) in hidden layer:
where \(\alpha\) is a negative parameter, \(f^k\) is the feature extracted from the final layer, \(f^i\) are features extracted from hidden layers. In our experiments, \(\alpha\) is set to -0.01. As a result, our TVRVI can focus on common-view features and avoid interference from uncommon-view features, which helps to reduce intra-identity differences and inter-identities similarities between vehicle images.
Finally, the learning loss of our TVRVI is:
within which \(\rho\) is set to 0.05. ID Loss and Center Loss work together to improve intra-class compactness while Triplet Loss enhances inter-class separability, and SNNL further disentangles view-specific features in the local branch.
The detailed procedure of our proposed TVRVI model is described in Algorithm 1.
Experimental results
In this section, we explain our experimental settings and evaluate our TVRVI model on three commonly used datasets: VeRi-7763, VRAI24, VERI-Wild29, and test its generalization ability on VehicleID dataset30. Ablation studies are also conducted to prove the effectiveness of our approaches.
Datasets
VeRi-776
This dataset is collected from 20 cameras within an area of 1 km2 in real city scene, each vehicle is captured by 2 to 18 fixed roadside cameras from different views. Therefore, the view changes of images are quite rich. VeRi-776 includes 49,357 images of 776 vehicles in total, its training set includes 37,778 images of 576 vehicles, and its testing set includes 11,579 images of 200 vehicles. All the images in its testing set are contained in its gallery set, and its query set consists of 1678 images extracted from the testing set.
VRAI
Each image in the VRAI dataset is taken simultaneously by two Dajiang UAVs (DJI phantom 4 UAVs) with no overlapping field of view, and the shooting height of the UAVs ranges from 15m to 80m. The training set of VRAI includes 66,113 images of 6302 vehicles, and its testing set includes 71,500 images of 6720 vehicles. Its gallery set consists of 75% of its testing set, and its query set consists of 25% of its testing set. Compared with roadside cameras, UAV mounted cameras are much more agile, which brings greater challenges for vehicle ReID.
VERI-Wild
It is a large-scale vehicle image dataset photographed by 174 roadside cameras covering an area of more than 200 km2 city districts without interruption for one month. Its training set includes 277,797 images of 30,671 vehicles, and its testing set includes 228,517 images of 10,000 vehicles. Its testing set is further divided into small, medium, and large subsets, each of which contains 41,816 images of 3000 vehicles, 69,389 images of 5000 vehicles and 138,517 images of 10,000 vehicles, respectively.
VehicleID
Captured by 12 cameras distributed in a small city of China during the daytime, it includes 221,763 images of 26,267 vehicles in total. Images in VehicleID are in front and back views, compared with the above-mentioned three datasets, view changes in VehicleID are simpler. There are 110,178 images in its training set and 111,585 images in its testing set, and the testing set is further divided into small, medium and large subsets, with 7332 images of 800 vehicles, 12,995 images of 1600 vehicles and 20,038 images of 2400 vehicles, respectively.
A description of these four datasets is presented in Table 1.
Evaluation metrics
To assess the effectiveness of our TVRVI approach, we employ widely-used standard metrics, namely the mean average precision (mAP) and the cumulative matching curve (CMC). These metrics have been extensively utilized in previous works within the field of vehicle ReID.
The mAP metric provides a comprehensive evaluation of the overall performance of our TVRVI system. It takes into account both the precision and recall values across all possible rank positions. By considering the precision-recall trade-off, the mAP metric offers a holistic assessment of the system’s ability to accurately match and identify vehicles based on their view information. The CMC is reported in terms of Rank-1, Rank-5, and Rank-10 accuracies, which indicate the percentages of correct matches found within the top-ranked results. The CMC demonstrates the system’s performance across various ranks, providing insights into its ability to accurately match and rank vehicles based on their view information.
By utilizing these standard metrics, we can evaluate the effectiveness of our TVRVI approach in a consistent and widely accepted manner. These metrics offer objective measures of performance that allow for meaningful comparisons with other state-of-the-art methods in the field of vehicle ReID.
Experiments setup
Training
We use the U-net43 structure to train our view parsing network for 64 epochs on 3505 annotated images with a batch size of 8 and Adam optimizer. The learning rate is set to 1e-4 for the first 40 epochs and degraded to 1e-5 for the rest 24 epochs. Finally, the parsing network reaches 84.3% IoU score and 98.6% view classification accuracy in the 800 testing images.
We initial the dual transformer network with weights pretrained on ImageNet-22K and set the learning rate to 0.01. AdamW is employed as the optimizer for 300 epochs training. In each training batch, we select P vehicle identities and K images for every vehicle identity1, and the batch size \(B=P \times K=8 \times 4=32\). We warm up the learning rate linearly at the first 20 epochs and use a cosine decay learning rate scheduler. Input images are resized into \(224 \times 224\), and regular data enhancement methods such as padding, random cropping and random erasing are also used. In the global branch, we add the relative position embedding28 with image patches and concatenate them with an ID class token. In the local branch, we use the view labels output from parsing network as the hard positional embedding of the local view patches.
All of our experiments are carried out on Ubuntu 20.04 operation system with NVIDIA GTX 4090 GPU using PyTorch toolbox, and Labelme is used to annotate our view parsing data, as described in Table 2.
Inference
When inferring, we use the trained dual network to extract both global features and local view features of images and use these features to calculate their Euclidean distance to re-identify vehicle images:
where \(D_g\) is global feature distance, \(D_{l_i}\) is local feature distance of the ith view, N is the number of correct matches of view labels. Here, we only use the common-view features to calculate local distance, and leave out uncommon-view features in order to prevent the interference introduced by those uncommon views, as explained in Fig. 4. We can observe from Fig. 4 that the global feature is captured by our TVRVI network from the whole image directly. The common view of these two images is top and right, so the specific view features of the top and right sides are extracted in the local branch. During the inference stage, both the global and local features are combined according to Eq. 8.
Experiments on VeRi-776 dataset
To evaluate the performance of our TVRVI model, we conduct experiments on the VeRi-776 dataset and measured its effectiveness using three standard evaluation metrics: mAP, Rank-1, and Rank-5. It is worth noting that Rank-10, which is not commonly considered in most previous works, was not included in our evaluation.
Table 3 presents the results of our TVRVI and several other latest methods, from which we can observe that transformer-based methods perform better than most CNN-based methods. Compared with existing transformer-based methods, our TVRVI has an improvement of 2.4% in mAP and 0.7% in Rank-1, reaching state-of-the-art results.
Both TransReID47 and TVRVI are benefited from the embedding of view information and leverage transformer architectures for ReID. However, there is a significant difference in the way features are learned. Unlike TransReID, which implicitly incorporates viewpoint and camera information via Side Information Embedding (SIE) through additive embeddings, TVRVI employs an explicit view parsing network to segment vehicle images into five distinct views and assigns hard view labels. This allows TVRVI to explicitly disentangle view-specific features in a structured manner. Furthermore, TVRVI utilizes a dual-branch architecture where the local branch processes view-separated patches guided by parsed labels instead of simply shuffling patches as in Jigsaw Patch Module (JPM), and introduces the SNNL to explicitly minimize feature entanglement across views. During inference, TVRVI only compares features from common views, significantly reducing cross-view interference. These innovations enable TVRVI to achieve stronger generalization, particularly in challenging scenarios with large view variations, such as in aerial datasets like VRAI, where it demonstrates superior performance over TransReID.
As for comparison with MART11, while MART utilizes mask-aware reasoning to handle background clutter and occluded parts via a unified transformer with GCN-based refinement, TVRVI introduces a dedicated view parsing network that explicitly annotates and separates vehicle images into five distinct views. This allows TVRVI to extract view-specific local features in its local branch, reducing feature entanglement across different views via SNNL, which is not present in MART. Furthermore, TVRVI’s dual-branch design enables simultaneous learning of global features and view-aware local features, and during inference, it strategically compares only common-view features while ignoring uncommon views, thereby minimizing intra-identity variance and inter-identities similarity. In contrast, MART relies on mask-based part segmentation and cross-image reasoning without explicit view separation, making TVRVI more effective in handling viewpoint variations and achieving state-of-the-art performance across multiple vehicle ReID datasets.
Experiments on VRAI dataset
The VRAI dataset stands out as the representative aerial dataset available so far, differentiating it from traditional datasets such as VeRi-776, VERI-Wild, and VehicleID. One of the key advantages of the VRAI dataset is its wide variety of view changes, which presents additional challenges for vehicle ReID. Due to its recent release, there have been limited experiments conducted on the VRAI dataset, resulting in a relatively small body of research in this domain. We compare mAP, Rank-1, Rank-5 and Rank-10 results of VRAI dataset on Table 4.
In the third row first column of Table 4, the “VRAI” refers to the multi-task method used in24, and the “VRAI+DP” indicates adding the annotated discriminative parts into their model in24, because they have manually annotated some discriminative parts within vehicle images after collection of the VRAI dataset. In the “VIT Baseline” method, we use bare swin-transformer network without the local branch of TVRVI. From Table 4 we can observe that benefit from comparing common-view features, TVRVI can effectively improve the mAP and Rank-1 on VRAI, with an improvement of 4.5% and 4.14%, respectively, compared to vision transformer baseline. This improvement is particularly challenging due to the larger intra-identity difference and inter-identities similarity presented in images captured by UAVs.
The results underscore the effectiveness of TVRVI in leveraging common-view features to enhance the performance of vehicle re-identification on the VRAI dataset. The challenges posed by the larger variations in view and the unique characteristics of aerial images are successfully addressed by TVRVI, demonstrating its capability to tackle the complexities introduced by UAV-based vehicle surveillance scenarios.
Experiments on VERI-Wild dataset
Similar to the VeRi-776 dataset, the VERI-Wild dataset captures real-world city scenes using fixed roadside cameras but on a significantly larger scale. To evaluate the performance of our TVRVI model, we conduct tests on all three testing subsets of VERI-Wild, namely the small, medium, and large subsets. Table 5 presents the comparison results of mAP, and Table 6 presents the comparison results of Rank-1 and Rank-5.
Analyzing the results in Table 5 and Table 6, we can observe that the improvements of PVEN and our TVRVI prove the effectiveness of local view information. However, our TVRVI model outperforms PVEN by achieving higher scores. This improvement can be attributed to the reduction in the entanglement of local view features through the integration of outputs from the view parsing network. Furthermore, TVRVI effectively narrows the performance gap between the small and large subsets of VERI-Wild. Specifically, when compared to PVEN, it reduces the mAP gap from 12.8 to 11.33 between small and large subsets, from 7.3 to 5.56 between medium and large subsets; As for Rank-1, it reduces the Rank-1 gap from 3.3 to 2.8 between small and large subsets, from 1.3 to 0.7 between small and medium subsets; and the Rank-5 gap is reduced from 1.4 to 1.1 between small and large subsets, from 1.0 to 0.7 between medium and large subsets, respectively.
These results provide compelling evidence that our TVRVI model exhibits robust generalization ability. This enhanced performance can be attributed to the learning of view-independent features within the hidden layers, as described by Eq. (6) in our methodology. The ability to extract and utilize view-independent features contributes to the model’s robustness and enables it to generalize effectively across different subsets of the VERI-Wild dataset. These findings highlight the effectiveness of our TVRVI model in leveraging local view information and demonstrate its superior performance compared to existing methods such as PVEN.
Ablation study
The effectiveness of view feature separation
To validate the effectiveness of view feature separation using the outputs from the view parsing network, we design comparison experiments involving two methods. The first method involved removing the local branch from the transformer backbone, while the second method entailed cropping input images into five randomly sized patches, projecting them into \(32 \times 32\) size, and embedding them into the local branch without view labels. The ablation results on the VeRi-776 dataset, presented in Table 7, provide evidence supporting the effectiveness of local feature separation.
Table 7 showcases the outcomes of the ablation experiments and their impact on the performance of the vehicle ReID task. By removing the local branch from the transformer backbone, the model loses the ability to separate view-specific features, resulting in a decline in performance. This demonstrates the importance of the local branch in effectively handling view variations and extracting discriminative features. Furthermore, when input images are divided into multiple random-sized patches and embedded into the local branch without view labels, the performance of the model also suffers. This indicates that without the guidance of view labels provided by the view parsing network, the model struggles to effectively separate and utilize view-specific features.
We plot the view feature disentanglement in the final layer before classifier in Fig. 5 to further present the effectiveness of SNNL in the local branch. It can be observed from Fig. 5 that the local view features are flock together without using SNNL, but after using SNNL, they are separated and easier to be classified.
The effectiveness of view label embedding
In the local branch of our TVRVI model, we utilize the view labels obtained from our view parsing network as hard positional embeddings for local view patches. This approach aids the last block of our network in reducing the entanglement of view features. To assess the effectiveness of these hard positional embeddings, we conduct comparison experiments by replacing the view labels with randomly initialized values during the embedding of the local view patches. The results of these experiments are presented in Table 8. The results from Table 8 affirm the effectiveness of incorporating view labels as hard positional embedding for the local view patches in our TVRVI model. By leveraging the parsing output view labels, we can effectively guide the network in separating view features and mitigate the challenges associated with vehicle re-identification. This approach proves beneficial in handling the complexities introduced by varying views and contributes to the improved performance of our model.
The effectiveness of maximizing feature entanglement in hidden layer
The conventional approach of maximizing view feature entanglement in hidden layers while minimizing it in the final layer may initially seem counterintuitive. However, to substantiate the effectiveness of this setting, as described in Eq. (6), we conduct experiments specifically focusing on minimizing view feature entanglement in the final layer. In these experiments, we set the term in Eq. (6) to zero. The results from Table 9 validate the effectiveness of the proposed setting, where view feature entanglement is maximized in hidden layers and minimized in the final layer. By allowing higher entanglement in the earlier layers, the model can capture and leverage more comprehensive and discriminative information related to different views. This approach enhances the model’s ability to generalize across varying views and improves its robustness in real-world vehicle re-identification scenarios.
Generalization ability
Intra-identity differences are caused by the changes of views for a particular vehicle, and inter-identities similarities are caused by images captured from the same camera view of similar vehicle identities. The robustness of a vehicle ReID model is mainly depending on the generalization ability between datasets with different view distributions. Compared with the other three datasets used in this paper, almost all of the images in VehicleID are in front view or back view, which has less view variance. To measure the generalization ability of TVRVI, we conduct a cross-dataset transferring experiment by evaluating the trained TVRVI models on the testing set of VehicleID directly without training on its training dataset. The cross-dataset transferring results are shown in Table 10.
In Table 10, those methods without notation with “trained on” are all trained and tested on the VechicleID dataset, but we test and compare our method on the small testing set of VechicleID without training on its training set. Compared with PVEN trained on VERI-Wild, our TVRVI attains an improvement of 6.89% on Rank-1, which is proved to be more robust to cross-dataset transfers and view distribution changes. Both PVEN and TVRVI are benefiting from view parsing methods, but TVRVI performs better when transferring because it can learn view-independent features and reduce the entanglement of different view features at the same time. VANet gets the highest score on the VehicleID dataset, but when both trained on VeRi-776, our TVRVI is at a much better score with an improvement of 18.36%, 8.02% and 2.71% on mAP, Rank-1 and Rank-15 respectively, as can be observed from Table 3.
Conclusion
In this paper, we propose a novel transformer-based model with view information (TVRVI) to deal with the intra-identity difference and inter-identities similarity challenges in vehicle ReID. Firstly, we annotate vehicle images into five view parts and train a view parsing network based on the annotations. Then we divide and shuffle vehicle images into five separate patches with the help of view labels output from the view parsing network. With the benefits of embedding view labels and reducing the entanglement of features from different views, TVRVI reaches state-of-the-art results on four commonly used datasets. Extensive experiments prove that the generalization ability of TVRVI is stronger than previous view parsing methods.
In the future, we will conduct further research using surveillance videos. By utilizing temporal information from video sequences, it becomes possible to capture and analyze vehicle appearance and motion patterns over time, leading to more reliable and consistent identification results. Another important area of focus is cross-domain ReID, particularly addressing the challenges posed by varying illumination conditions between day and night. Furthermore, integrating 3D reconstruction techniques into vehicle ReID can offer valuable insights and improve the accuracy of identification. By leveraging in-depth information and reconstructing the 3D structure of vehicles, it becomes possible to extract additional discriminative features that are invariant to viewpoint changes and occlusions.
Data availability
The data used in this paper are all available for non-commercial research purposes and properly referred to. i.e., VeRi-7763 (https://github.com/JDAI-CV/VeRidataset), VRAI24 (https://github.com/JiaoBL1234/VRAI-Dataset), VERI-Wild29(https://github.com/PKU-IMRE/VERI-Wild), and VehicleID30 dataset(https://www.pkuml.org/resources/pku-vehicleid.html)
References
Hermans, A., Beyer, L. & Leibe, B. In defense of the triplet loss for person re-identification. ArXiv Preprint ArXiv:1703.07737 (2017).
Luo, H., Gu, Y., Liao, X., Lai, S. & Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019).
Liu, X., Liu, W., Mei, T. & Ma, H. A deep learning-based approach to progressive vehicle re-identification for urban surveillance. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part II 14 869–884 (2016).
Kong, X. et al. A federated learning-based license plate recognition scheme for 5G-enabled internet of vehicles. IEEE Trans. Ind. Inf. 17, 8523–8530 (2021).
Lubna, Mufti, N. & Shah, S. Automatic number plate recognition: A detailed survey of relevant algorithms. Sensors 21, 3028 (2021).
Qian, Y., Barthélemy, J., Du, B. & Shen, J. Paying Attention to Vehicles: A Systematic Review on Transformer-Based Vehicle Re-Identification (ACM Transactions On Multimedia Computing, Communications And Applications, 2024).
Kishore, R., Aslam, N. & Kolekar, M. PATReId: Pose apprise transformer network for vehicle re-identification. IEEE Transactions On Emerging Topics In Computational Intelligence (2024).
Qian, Y., Barthelemy, J., Iqbal, U. & Perez, P. V2reid: Vision-outlooker-based vehicle re-identification. Sensors 22, 8651 (2022).
Li, Z. et al. TVG-ReID: Transformer-based vehicle-graph re-identification. IEEE Trans. Intell. Vehic. 8(11), 4644–52 (2023).
Qian, W., Luo, H., Peng, S., Wang, F., Chen, C. & Li, H. Unstructured feature decoupling for vehicle re-identification. European Conference On Computer Vision 336–353 (Springer Nature Switzerland, Cham, 2022)
Lu, Z., Lin, R. & Hu, H. MART: Mask-aware reasoning transformer for vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 24, 1994–2009 (2022).
Khorramshahi, P., Peri, N., Chen, J. & Chellappa, R. The devil is in the details: Self-supervised attention for vehicle re-identification. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 369–386 (2020).
Cui, C., Sang, N., Gao, C. & Zou, L. Vehicle re-identification by fusing multiple deep neural networks. 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) 1–6 (2017).
Peng, J., Jiang, G., Chen, D., Zhao, T., Wang, H. & Fu, X. Eliminating cross-camera bias for vehicle re-identification. Multimedia Tools And Applications 1–17 (2022).
Yao, Y., Zheng, L., Yang, X., Naphade, M. & Gedeon, T. Simulating content consistent vehicle datasets with attribute descent. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 775–791 (2020).
Wang, Z., Tang, L., Liu, X., Yao, Z., Yi, S., Shao, J., Yan, J., Wang, S., Li, H. & Wang, X. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. Proceedings of the IEEE International Conference on Computer Vision 379–387 (2017).
Khorramshahi, P., Kumar, A., Peri, N., Rambhatla, S., Chen, J. & Chellappa, R. A dual-path model with adaptive attention for vehicle re-identification. Proceedings of the IEEE/CVF International Conference on Computer Vision 6132–6141 (2019).
Liu, X., Zhang, S., Huang, Q. & Gao, W. Ram: A region-aware deep model for vehicle re-identification. 2018 IEEE International Conference on Multimedia and Expo (ICME) 1–6 (2018).
Zhou, Y. & Shao, L. Aware attentive multi-view inference for vehicle re-identification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6489–6498 (2018).
Bai, Y., Lou, Y., Dai, Y., Liu, J., Chen, Z., Duan, L. & Pillar, I. Disentangled feature learning network for vehicle re-identification. IJCAI 474–480 (2020).
He, B., Li, J., Zhao, Y. & Tian, Y. Part-regularized near-duplicate vehicle re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3997–4005 (2019).
Chen, T., Liu, C., Wu, C. & Chien, S. Orientation-aware vehicle re-identification with semantics-guided part attention network. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 330–346 (2020).
Meng, D., Li, L., Liu, X., Li, Y., Yang, S., Zha, Z., Gao, X., Wang, S. & Huang, Q. Parsing-based view-aware embedding network for vehicle re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7103–7112 (2020).
Wang, P., Jiao, B., Yang, L., Yang, Y., Zhang, S., Wei, W. & Zhang, Y. Vehicle re-identification in aerial imagery: Dataset and approach. Proceedings of the IEEE/CVF International Conference on Computer Vision 460–469 (2019).
Jiao, B., Yang, L., Gao, L., Wang, P., Zhang, S. & Zhang, Y. Vehicle re-identification in aerial images and videos: Dataset and approach. IEEE Transactions on Circuits and Systems for Video Technology (2023).
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929 (2020).
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Jégou, H. Training data-efficient image transformers and distillation through attention. International Conference on Machine Learning 10347–10357 (2021).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
Lou, Y., Bai, Y., Liu, J., Wang, S. & Duan, L. Veri-wild: A large dataset and a new method for vehicle re-identification in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3235–3243 (2019).
Liu, H., Tian, Y., Yang, Y., Pang, L. & Huang, T. Deep relative distance learning: Tell the difference between similar vehicles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2167–2175 (2016).
Liu, X., Liu, W., Ma, H. & Fu, H. Large-scale vehicle re-identification in urban surveillance videos. 2016 IEEE International Conference on Multimedia and Expo (ICME) 1–6 (2016).
Frosst, N., Papernot, N. & Hinton, G. Analyzing and improving representations with the soft nearest neighbor loss. International Conference on Machine Learning 2012–2020 (2019).
Schroff, F., Kalenichenko, D. & Philbin, J. Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 815–823 (2015).
Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C. & Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. Proceedings of the IEEE/CVF International Conference on Computer Vision 8282–8291 (2019).
Zhang, X. et al. Part-guided attention learning for vehicle instance retrieval. IEEE Trans. Intell. Transp. Syst. 23, 3048–3060 (2020).
Lian, J., Wang, D., Wu, Y. & Zhu, S. Multi-branch enhanced discriminative network for vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 25(2), 1263–1274 (2023).
Chen, X., Yu, H., Zhao, F., Hu, Y. & Li, Z. Global-local discriminative representation learning network for viewpoint-aware vehicle re-identification in intelligent transportation. IEEE Trans. Instrum. Meas. 72, 1–13 (2023).
Shen, F., Du, X., Zhang, L., Shu, X. & Tang, J. Triplet contrastive representation learning for unsupervised vehicle re-identification. ArXiv Preprint ArXiv:2301.09498 (2023).
Zhu, W. et al. A dual self-attention mechanism for vehicle re-identification. Pattern Recogn. 137, 109258 (2023).
Khorramshahi, P., Shenoy, V. & Chellappa, R. Robust and scalable vehicle re-identification via self-supervision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5295–5304 (2023).
Lu, Z., Lin, R., He, Q. & Hu, H. Mask-aware pseudo label denoising for unsupervised vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 24, 4333–4347 (2023).
Huang, F., Lv, X. & Zhang, L. Coarse-to-fine sparse self-attention for vehicle re-identification. Knowl.-Based Syst. 270, 110526 (2023).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18 234–241 (2015).
Wang, X., Jing, S., Dai, H. & Shi, A. High-resolution remote sensing images semantic segmentation using improved UNet and SegNet. Comput. Electr. Eng. 108, 108734 (2023).
Beers, F., Lindström, A., Okafor, E. & Wiering, M. Deep neural networks with intersection over union loss for binary image segmentation. Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods 438–445 (2019).
Kuang, Z., He, C., Huang, Y., Ding, X. & Li, H. Joint image and feature levels disentanglement for generalizable vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 24(12), 15259–15273 (2023).
He, S., Luo, H., Wang, P., Wang, F., Li, H. & Jiang, W. Transreid: Transformer-based object re-identification. Proceedings of the IEEE/CVF International Conference on Computer Vision 15013–15022 (2021).
Li, J., Cong, Y., Zhou, L., Tian, Z. & Qiu, J. Super-resolution-based part collaboration network for vehicle re-identification. World Wide Web. 26, 519–538 (2023).
Xiang, S. et al. Learning Visual-Semantic Embedding for Generalizable Person Re-identification: A Unified Perspective (ACM Transactions on Multimedia Computing, Communications and Applications, 2025).
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. Flava: A foundational language and vision alignment model. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15638–15650 (2022).
Funding
This work was supported by Henan Province Science and Technology Research Project [252102211028, 232102210094].
Author information
Authors and Affiliations
Contributions
M.Z. conceived the manuscript and designed the methodology. Q.F. contributed to the experiments development and validation process. All authors contributed to the article and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhu, M., Feng, Q. Transformer-based vehicle re-identification with view information. Sci Rep 15, 40576 (2025). https://doi.org/10.1038/s41598-025-24392-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-24392-y








