Multi-Attention-Based Soft Partition Network for Vehicle Re-Identiﬁcation

Vehicle re-identiﬁcation (Re-ID) distinguishes between the same vehicle and other vehicles in images. It is challenging due to signiﬁcant intra-instance differences be-tween identical vehicles from different views and subtle inter-instance differences of similar vehicles. Researchers have tried to address this problem by extracting features robust to variations of viewpoints and environments. More recently, they tried to improve performance by using additional metadata such as key points, orientation, and temporal information. Although these attempts have been relatively successful, they all require expensive annotations. Therefore, this paper proposes a novel deep neural network called a multi-attention-based soft partition (MUSP) network to solve this problem. This network does not use metadata and only uses multiple soft attentions to identify a speciﬁc vehicle area. This function was performed by meta-data in previous studies. Experiments veriﬁed that MUSP achieved state-of-the-art (SOTA) performance for the Vehi-cleID dataset without any additional annotations and was comparable to VeRi-776 and VERI-Wild.


Introduction
Vehicle re-identification (Re-ID) identifies the same vehicle from a large number of images. It finds the same car in gallery images as depicted in a given query image. This task received considerable attention recently because Re-ID technology could be used to analyze traffic flow to build smart cities and is an essential technology for surveillance systems. Vehicle Re-ID is particularly challenging because vehicle exteriors can be captured in a wide variety of environments, and different lights and viewpoints can cause significant intra-instance differences. Other vehicles can also look similar due to matching colors and general vehicle types.
Recent studies [34,29,37,10,20,8,13] have used convolutional neural networks (CNNs) and metric learning methods. In metric learning, vehicle images are encoded to a representative vector in embedding space, and distances between the vectors are compared. Thus, it is critical to select robust features to accommodate variations in environments, light conditions, and viewpoints.
Previous studies [30,29,27,8] adopted metadata attributes (e.g., orientation, color, type, key point(s), viewpoint, and spatio-temporal information) to identify the same vehicles. More recent studies semantically divided a vehicle into parts to extract features. He et al. [3] proposed a part detection model and extracted features from the part area. Chen et al. [2] leveraged vehicle orientation and mask, using a model to predict the vehicle mask, with each vehicle part segmented differently depending on orientation. Meng et al. [24] used part segmentation, separating the vehicle into four parts and extracting view-aware features from segmentation regions.
These methods can compare not only global appearance but also vehicle parts so that they can embed and compare subtle vehicle parts. However, they have one major draw-back that they require extensive image annotation. In particular, labeling vehicle parts, including segmentation and bounding box creation, requires much more time than labeling images. According to a report [15], segmenting takes 15 times longer than spotting object locations and 60 times longer than image labeling. segmenting takes 15 times longer than spotting object locations and 60 times longer than image labeling. Therefore, we propose a multi-attention-based soft partition (MUSP) network to identify vehicles efficiently without additional annotation work. As illustrated in Figure  1, we introduced multiple attentions to obtain weighted feature maps focusing on different vehicle regions. Each weighted feature map is abstracted to a feature vector using average pooling. We also introduced soft partitioning of vehicle images based on the soft attention method. In contrast to hard attention approaches where the region mask is Boolean, our model provides continuous values [0,1] of the region mask, allowing softer partitioning. Thus, the activated region of a feature can include any area without restrictions. MUSP operates by taking as input the feature map extracted from a backbone network such as ResNet. Therefore, it can be applied to all types of backbone networks, and performance can be significantly improved by merely attaching MUSP at the end of a backbone. Our study has three main contributions: • We propose a multi-attention-based soft attention network called MUSP to provide part-aware attention weights and extract more representative and robust features for vehicle Re-ID.
• In contrast to previous approaches, our method does not require any additional annotation for vehicle parts.
Ours is the first study that exploits part-aware features without additional annotation or metadata attributes to the best of our knowledge.
• Our approach achieved state-of-the-art (SOTA) performance for the VehicleID dataset and comparable performance for VeRi-776 and VERI-Wild datasets, compared with other methods using additional annotation.

Related Works
Vehicle Re-ID technology has advanced enormously, strongly driven by access to several large datasets [21,9,18], enabling models to be trained and tested on more closely real-world environments. Deep learning and metric learning have been used for the vehicle Re-ID task. Additional representative features must be extracted when embedding vehicle images in the feature space to increase metric learning performance. Consequently, many attempts have been introduced that use the metadata of vehicles, such as orientation, color, type, key points, viewpoint, and spatio-temporal data.
Temporal data have been adopted by several studies [27,18,8]. Shen et al. [27] use temporal information to track gradual vehicle changes from different cameras, enabling them to recognize the same vehicle that looks different and overcome the method's limitation using only spatial information. However, there is a disadvantage-a continuous stream of images is required. Liu et al. [18] perform re-ranking using temporal information after vehicle detection from images. This approach requires the temporal information of each vehicle even in the inference stage. Jiang et al. [8] also use temporal information together with spatial information for re-ranking.
Vehicle key points are used by several previous studies [30,10]. Wang et al. [30] estimated orientation using key points and extracted orientation-invariant features to improve performance. They also used temporal information. Khorramshahi et al. [10] used key points to exploit local features. The key-point-based method has a disadvantage: it is difficult to cope with various types of vehicles that do not exist in the training data, and additional key point labels are required.
Recent studies [20,2,24] introduced a method of segmenting and comparing vehicle parts using metadata. This method is similar to the way humans identify objects by segmenting the parts of a vehicle and comparing each part separately. Liu et al. [20] used a detection model to segment the vehicle parts. Chen et al. [2] proposed a model that segments the parts of a vehicle in a weakly-supervised method using the vehicle's orientation to improve performance. Meng et al. [24] uses a supervised segmentation model to divide vehicles. These methods have achieved performance improvements but have the drawback of requiring additional annotations or models. Detection and segmentation require many resources for data, and the model is heavy.
Finally, various methods of using generative adversarial networks (GANs) have also been proposed [36,37]. However, there exists a large gap between the generated features and reality because of the limitations of the generation ability of existing GANs and the lack of adversarial samples.
Our approach follows the latest part recognition methodology [28,24,2], except for soft partitioning. We introduced multiple soft attentions for soft partitioning and recognition to obtain weighted feature maps focused on various vehicle regions. Because no annotation is required, it is cost-effective while improving performance through part recognition.

Proposed Method
The proposed multi-attention network comprises a backbone network to encode a convolutional feature map for a given image and an attention-based network to extract a set of weighted feature vectors, each of which focuses on a specific vehicle region. The attention-based network consists of two modules: the spatial attention module for soft partitioning of vehicle regions and the channel-wise attention module based on the squeeze and excitation (SE) method. The weighted feature vectors compare the distance between images for metric learning and are fed to a classifier to predict vehicle ID. The classifier includes batch normalization (BN) [7] and linear layers [23]. n − 1 classifiers are applied to n−1 weighted feature vectors, excluding the background vector, respectively. The overall architecture of MUSP is dipicted in Figure 2 and its components are described in the following subsections in more detail.

Feature extractor
We selected ResNet-50 [4] as a backbone for feature extraction, removed the last fully connected (FC) layer, and used the last convolution layer's output. Thus, the feature extraction process is where CN N B is the base network, M is a feature map extracted from B, and h, w, d are dimensions of M , which depend on the feature extractor and input image I.

Spatial attention module
We use vehicle partitioning to extract subtle vehicle parts for vehicle Re-ID [28,24,2], with an attention method to refine the embedded features. Khorramshahi et al. [11] proposed a method of detecting and re-cropping a vehicle during preprocessing to reduce background regions. They used a detection model and bounding box annotation to depress noisy background. We assume that the same function can be processed within the deep learning model without additional model or artificial intervention. Figure 1(b) illustrates that the vehicle area was accurately recognized without additional annotation or detection. Meng et al. [24] found that subtle vehicle components significantly impact part division. However, they cannot be captured accurately with single attention because attention focuses on easily compared features, such as headlights and bumpers. Therefore, we use multiple attentions that are spatially separated and focus on different vehicle areas. This distributed attention can consider different parts, so the model can see and compare more vehicle details. Consequently, we designed a spatial multiple attention mechanism.
We apply convolution layers to feature M encoded by the backbone network to extract two feature maps for attention weights A and values V . An attention feature map has n channels with size h × w. Each channel corresponds to each vehicle part. A value feature map has c channels with size h × w. Attention weights that passed softmax are multiplied by the corresponding value to obtain n weighted values to which average pooling is applied to extract final weighted feature vectors {f i } i=0,n .
We compute softmax along with the last dimension n, rather than spatial dimension hw. The attention weight passing softmax has exclusive activation at each spatial point of the value map. We discard the final weighted feature vector f n . Due to the properties of softmax, activation is also given to the background. However, if the final vector is discarded, the model is trained to assign the background region to the discarded vector, which is noise, as depicted in Figure 3. The entire process can be summarized as where CN N V E and CN N AE are value and attention extractors, i.e., a simple single convolution layer with a 3 × 3 kernel with 1 × 1 padding. F is the set of extracted feature vectors, and f is a single vector in feature set F , F d is the final feature set with the last background feature discarded, | · | is the matrix size, and σ is softmax operation.

Channel-wise attention module
A set of the weighted feature vectors F d with spatial attention are recalibrated by capturing and applying channelwise attention, as depicted in the SE block that modulates channel activation [6]. Channel-wise attention adjusts the activation intensity according to each channel's importance. This attention reduces unnecessary feature element intensities, hence reducing their influence on distance calculations. Because each feature vector relates to a feature map highlighted on a specific vehicle area, channel-wise attention should be applied to all n − 1 feature vectors, in contrast to the original SE that controls one feature with FC layers.
We propose a channel-wise attention network based on an extended SE (ESE) algorithm. We reshape a set of the weighted feature vectors F d into one vector and then feed that vector to the SE block to modulate channel activation.
The ESE module comprises two linear layers, where the first layer is followed by a rectified linear unit, and the second by a sigmoid operation. The ESE input dimension is 2,048, and its output dimensions are 128 and 2048 in sequence. The result from ESE is channel-wise attention, which is then multiplied by the original F d . Thus, ESE can be summarized as where r is |c × (n − 1)|, M LP is a multi-layer perceptron as described above, ρ is the sigmoid operation, and F e is a set of n − 1 final recalibrated features. f d and f d are 1-D vectors, while F d and F d are 2-D matrices.

Distance computation
The feature vector set F e extracted from the attentionbased network and one feature vector f g obtained by global average pooling of M are used to calculate losses. We apply triplet loss to each feature vector separately to train the model and adopt a multi-feature re-weighting function called a co-occurrence attentive Module (CAM) [2] with some modifications to calculate the distance by integrating these features for inference.
The distance weights between two vehicles are calculated as where AR a,i is an area ratio with the i th attention weight for the a th image, and AR is calculated by averaging the attention weights. The original paper used weight = 1 for the global feature, whereas we use 1 n−1 for the global feature weight w (a,b),g . Hence, the distance between two vehicles is where f a,i is the i th feature, f a,g is a global feature for the a th image, and · 2 is the Euclidean distance.

Loss Function
We use three loss functions to train the model: crossentropy loss for vehicle id prediction (L id ), triplet loss for distance learning (L tri ), and production loss to separate each attention feature (L div ). The overall loss function is

Cross-entropy loss
We apply cross-entropy loss following the vehicle ID prediction layer: where n is the number of features, K is the number of images in a mini-batch, C is the number of classes, y ijl is the j th element for the one-shot encoded vector describing ground-truth for the i th sample in a mini-batch and l th feature vector, andŷ ijl is the j th element of the output vector of the softmax FC layer for the i th image and l th feature vector.

Triplet loss
The proposed network is optimized with triplet loss for metric learning, which trains the network to minimize the distance between features from the same image classes and simultaneously maximize the distance between features from different image classes. In a mini-batch that contains P identities and Q images for each identity, each image (anchor) has Q − 1 images of the same identity (positives) and (P − 1) × Q images of different identities (negatives). Triplet loss is defined as [5]: where v a,i is the prediction vector for the a th image of the i th identity group, and m is the margin to control the difference between positive and negative pair distances, which helps cluster the distribution more densely.

Spatial diversity loss
We adopt spatial diversity loss [2] to restrict overlapped areas and hence ensure each attention weight acts on a different position: where a i n is n th attention weight for the i th image in the mini-batch. Spatial diversity loss is the summation of space-wise production of attention weights.

Implementation details
Preprocessing resizes all images to 256×256 pixels and applies random erasing and translation effects. We use the Adam [12] optimizer with weight decay of 5e-4 and a momentum of 0.9. The proposed model was trained with a batch size of 64, 16 unique vehicle IDs, a training epoch of 90, and an initial learning rate of 0.00035, divided by 10 at 30 and 60 epochs, where we used the warmup method with initial 10 epochs of 0.000035 to 0.00035. Label smoothing was also applied to avoid overfitting. Training required 6 and 2 h on the VehicleID and VeRi-776 datasets, respectively, using an NVIDIA Quadro RTX 6000 GPU system. The training code was written in PyTorch [25].
The training phase used weighted feature vectors from the spatial attention module and vehicle ID prediction vector as described in Section 4 in the loss function. The inference phase only used F e and f g with a re-weighting method to compute distances between vehicles.

Experiments on VERI-Wild dataset
The VERI-Wild dataset is the largest vehicle Re-ID dataset and includes various weather environments, in contrast to the previous two datasets. Like VeRi-776, VERI-Wild also defines small, medium, and large test datasets with 3,000, 5,000, and 10,000 vehicle IDs, respectively. Table 3 compares the performance of MUSP, baseline, and various relevant previous models using mAP. MUSP exhibits a remarkable performance improvement compared with the baseline, achieving 4.8%, 5.6%, and 6.6% improvement for the small, medium, and large datasets, respectively. Like the VehicleID dataset, the performance improvement is particularly noticeable for complex test sets with many IDs. SOTA performance was achieved with 2.1%, 2.6%, and 2.9% improvement over PVEN, the current SOTA method, for the small, medium, and large datasets. Compared with the baseline and PVEN, the MUSP improvement increases with increasing test dataset size. Table 4 compares the performance of MUSP, the baseline, and various relevant previous models using the CMC metric. MUSP achieves 0.9%, 1.2%, and 1.8% improvement on CMC@1 for the small, medium, and large datasets, respectively, compared with the baseline, and comparable performance to PVEN [24], the current SOTA. Thus, MUSP consistently improves performance across all test sets and confirms that increasing model representation capability is sufficient to achieve comparable or superior performance.

Number of attentions
Experiments were conducted to determine how the number of attentions affects performance. Table 5 presents optimal performance using five attentions. Thus, increasing the number of attentions increases the number of areas vehicles that can be segmented, which can improve model performance. Attentions of less than three produce a significantly lower performance than four or more attentions. It is difficult to segment a vehicle semantically with a small number of attentions. The experiments illustrate that the desired part recognition and comparison can only be performed if four or more attentions are used.

Activation functions of spatial attention module
CBAM [31] and SENet [6] used a sigmoid-based attention module, whereas the proposed spatial attention module is based on softmax. Softmax satisfies our spatial partition purposes more closely because it has a normalization effect that sets the sum of the dimension elements equal to 1. Combining softmax and spatial diversity loss produces exclusively spatial activation. Gradient vanishing can occur for the sigmoid approach as training progresses, degrading performance. We compared the softmax and sigmoid-based attention modules to verify that sofmax is the more suitable activation function. The spatial attention module discards the last attention weight, so we retained four attentions for the sigmoid-based and five for the softmax-based module. Table 7 illustrates that the softmax-based attention achieves by 0.8%, 0.3%, and 0.2% improvement compared with sigmoid-based attention for mAP, CMC@1, and CMC@5 metrics, respectively. Thus, overall performance improvement from softmax-based attention is superior to sigmoid-based attention.

Cross-domain experiment
These experiments confirm that the proposed MUSP outperforms particularly well for larger test datasets close to  real-world environments. Another problem that arises in real-world environments is to recognize previously unseen vehicles. Therefore, we conducted a cross-domain experiment comparing RAM [20] and EALN [22], trained and tested on VehicleID, with PVEN [24] and MUSP trained on VERI-Wild and tested on VehicleID. Table 8 presents overwhelming MUSP performance for cross-domain tests: the attention partition operates effectively even for vehicles not previously learned. MUSP exceeded models trained on the same dataset, and also achieved 6% and 2.5% improvement compared with the baseline and PVEN, respectively. These results are consistent with Figure 5, where vehicle parts on the VeRi-776 and VERI-Wild datasets were equally identified even though MUSP was trained on the VehicleID dataset.

Conclusions
In this paper, we propose MUSP, a network that divides vehicle areas and extracts features without metadata using attention. The visualization in Figure 5 illustrates that the attention parts selected by the the spatial attention module also operate effectively on unseen data and perform invariant orientations. The spatial and channel-wise attention modules are vital MUSP components and were verified experimentally on three datasets. The experiments demonstrated that the proposed method was comparable to or superior to current SOTA methods.
For future research, we will consider applying MUSP to feature maps extracted from various levels of the layers of a backbone network such as SENet [6]. MUSP is currently applied only to the resulting feature map of the last layer of the backbone. However, MUSP can also be applied to feature maps extracted from the middle layers of the backbone. We expect this approach can improve performance significantly.