Multi-Attention-Based Soft Partition Network for Vehicle Re-Identification

Vehicle re-identification helps in distinguishing between images of the same and other vehicles. It is a challenging process because of significant intra-instance differences between identical vehicles from different views and subtle inter-instance differences between similar vehicles. To solve this issue, researchers have extracted view-aware or part-specific features via spatial attention mechanisms, which usually result in noisy attention maps or otherwise require expensive additional annotation for metadata, such as key points, to improve the quality. Meanwhile, based on the researchers' insights, various handcrafted multi-attention architectures for specific viewpoints or vehicle parts have been proposed. However, this approach does not guarantee that the number and nature of attention branches will be optimal for real-world re-identification tasks. To address these problems, we proposed a new vehicle re-identification network based on a multiple soft attention mechanism for capturing various discriminative regions from different viewpoints more efficiently. Furthermore, this model can significantly reduce the noise in spatial attention maps by devising a new method for creating an attention map for insignificant regions and then excluding it from generating the final result. We also combined a channel-wise attention mechanism with a spatial attention mechanism for the efficient selection of important semantic attributes for vehicle re-identification. Our experiments showed that our proposed model achieved a state-of-the-art performance among the attention-based methods without metadata and was comparable to the approaches using metadata for the VehicleID and VERI-Wild datasets.


Introduction
Vehicle re-identi cation (Re-ID) identi es the same vehicle from numerous images. It searches for the same car in gallery images as depicted in a given query image. This task received considerable attention recently because the Re-ID technology can be used to analyse traf c ow, which is used to build smart cities and is an essential technology for surveillance systems. Vehicle Re-ID is particularly challenging because their exteriors can be captured in numerous environments, and different lights and viewpoints can cause signi cant intra-instance differences. Other vehicles can also look similar because of matching colours and general vehicle types.
Recent studies have used convolutional neural networks (CNNs) and metric learning methods (Jiang et al., 2018;Khorramshahi et al., 2019a;Lee et al., , 2022Liu et al., 2018;Zhang et al., 2017;Zhou & Shao, 2018). In the latter, vehicle images are encoded to a representative vector in embedding space, and the distances between the vectors are compared. Thus, it is crucial to select robust features to accommodate variations in environments, light conditions, and viewpoints.
To effectively identify the same vehicles, several studies have adopted metadata attributes (e.g., orientation, colour, type, keypoint, viewpoint, and spatiotemporal information) (Jiang et al., 2018;Shen et al., 2017;Tang et al., 2017;Wang et al., 2017). Recent studies have semantically divided vehicles into its parts using image segmentation and mask them; furthermore, features are extracted from the segmented regions (Chen et al., 2020b;He et al., 2019;Meng et al., 2020). These methods can compare not only global appearance but also vehicle parts; thus, subtle vehicle parts can be embedded and compared. However, they have a major drawback because they require extensive image annotation. Particularly, labelling vehicle parts, such as segmentation and bounding box creation, requires signi cant time than labelling images. According to the report (Lin et al., 2014), segmenting takes 15 times longer than spotting object locations and 60 times longer than image labelling.
In recent years, the combination of deep learning and visual attention mechanism has been explored to increase the performance of vehicle Re-ID tasks . The visual attention is formed by training deep neural networks to learn which areas need to be focused on in each new image. Based on their insights, many researchers have proposed various handcrafted spatial attention-based models equipped with multi-attention branches for speci c viewpoints and vehicle parts. For instance, some researchers have developed multi-view attention networks (MVANs) in which each branch learns viewpoint-speci c features for three to ve views (e.g., front, rear, top, and side views) (Chen et al., 2020a, b;Teng et al., 2021;Zhang et al., 2020;Zhou & Shao, 2018). Others have devised part-oriented attention networks that localize various salient vehicle parts, such as the windshield and car head, for hard part-level attention and then provide an additional soft attention re nement at the pixel level (Guo et al., 2019;Zhang Figure 1: Activation heat maps, associated with attention weights, visualized by Grad-CAM (Zhang et al., 2020) for (a) the baseline model with single attention, (b) the proposed model with double attentions, (c) the baseline model with four attentions, and (d) the proposed model with ve attentions. The right-most heat maps in (b) and (d) represent the attention map for background or insigni cant areas, and the associated feature map is discarded from producing the output results. Compared with the baseline, the proposed model suppresses noise from the spatial attention maps and captures subtle discriminative regions useful for distinguishing vehicles. et al., 2020). However, as shown in Figs. 1(a) and (c), spatial attention mechanisms are usually hampered by noise generated attention maps or require expensive additional annotation for metadata attributes, such as keypoints, to improve quality (Chen et al., 2020a). Moreover, the handcrafted spatial attention architecture does not ensure that the number and types of attention branches are optimal for real-world Re-ID tasks.
To solve these problems, we proposed the multi-attentionbased soft partition (MUSP) network, which uses multiple and soft attention mechanisms in the spatial and channel-wise directions, as shown in Fig. 2. Multi-attention enables the model to learn diverse features from various discriminated regions and different viewpoints without introducing any arti cial partspeci c and view-dependent attention branches. Soft attention gives continuous values of the region mask; its weights can be learned through backpropagation during training, as it is fully differentiable . We also combined a channelwise attention mechanism with a spatial attention mechanism to select the better semantic attributes that are meaningful for vehicle Re-ID. Finally, we introduced a novel method for removing noise from spatial attention maps. In our framework, as shown in Fig. 2, we rst apply multiple spatial attention weights to feature maps encoded from a vehicle image, followed by channel-wise attention weights. Here, the last feature map with spatial attention applied is excluded from the input of the channel-wise attention branch. Thus, the edge from the node of the nal feature map to the next step in the computation graph is dropped out. Then, the spatial attention weights of the indiscriminative regions are aggregated into the last attention map due to backpropagation training. Simultaneously, the spatial attention weights of the salient areas contributing to the nal result are collected in the other attention maps. Figure 1(c) shows the heat map of the resulting attention weights for a four-attention-based model in which all spatial attention maps are used for training and inference. The attention maps contain a lot of noise. However, as shown in Fig. 1(d), our model's rst to fourth attention maps are clean. Here, the last fth map, which is excluded from further computation in the pipeline, contains attention weights that are highly activated to the background or irrelevant parts. Our experiments revealed that our proposed model achieved a state-of-the-art (SOTA) performance among the attention-based methods without metadata and was comparable to the approaches using metadata for the VehicleID (Liu et al., 2016c) and VERI-Wild (Liu et al., 2016a) datasets.
In summary, the main contributions of our approach are as follows: (i) We proposed a novel MUSP network for vehicle Re-ID.
With multiple spatial attention and channel-wise attention mechanisms, the model can learn to highlight the most discriminant regions and suppress the distraction of irrelevant parts without introducing any arti cial part-speci c and view-dependent attention branches. (ii) We developed a novel method that reduces noise in spatial attention maps so signi cantly that we solved the problem of spatial attention mechanisms generating noise in attention maps. (iii) Our approach achieved SOTA performance among the attention-based approaches that did not use metadata and Figure 2: Proposed MUSP pipeline. First, the image is placed in the backbone network (ResNet50). The features extracted from the backbone network go through the attention-based network, which is sequentially composed of spatial and channel-based attention modules. The n − 1 vectors obtained from the attention-based network and a global vector obtained using average pooling are used to train the model using triplet, spatial diversity, and cross-entropy losses. In the inference stage, a total of n vectors are calculated using the co-occurrence attentive module.
showed comparable performance to the methods that used metadata in the experiments using the VehicleID (Liu et al., 2016c) and VERI-Wild (Liu et al., 2016a) datasets.
The rest of the paper is organized as follows: Section 2 reviews recent related studies in detail, particularly on attentionbased approaches. Section 3 describes the architecture of the proposed system and detailed algorithms for each step. Section 4 discusses loss functions for training the model. Section 5 demonstrates our approach's effectiveness and competitiveness on numerous challenging vehicle Re-ID benchmarks through extensive experiments. Finally, Section 5 concludes this paper.

Related Work
Vehicle Re-ID technology has advanced enormously and is strongly driven by access to several large datasets (Kanaci et al., 2018;Liu et al., 2016c;Lou et al., 2019b), which enable models to be trained and tested more closely in real-world environments. Deep and metric learnings have been used for vehicle Re-ID tasks. Additional representative features must be extracted when embedding vehicle images in the feature space to increase metric learning performance. Consequently, numerous attempts to use the metadata of vehicles have been introduced, such as orientation, colour, type, key points, viewpoint, and spatiotemporal data. Additionally, most researchers have combined deep learning and visual attention mechanisms to extract the features of the distinguishing regions to improve the accuracy of the vehicle Re-ID task. The problem here is to identify the type of attention mechanism that is more ef cient and adaptive among the various types of attention networks. Recently, various methods of using generative adversarial networks (GANs) have also been proposed (Zhou & Shao, 2017, 2018; however, there is a large gap between the generated features and reality due to the limitations of the generation ability of the existing GANs and the lack of adversarial samples.

Metadata
Temporal data have been adopted by several studies for ef cient vehicle Re-ID (Jiang et al., 2018;Liu et al., 2016c;Shen et al., 2017). Shen et al. (2017) used temporal information to track gradual vehicle changes from different cameras, which enabled them to recognize the same vehicle that looked different and overcome the limitations of the method using only spatial information. However, this method has a disadvantage because a continuous stream of images is required. Liu et al. (2016c) re-ranked the images using temporal information after vehicle detection. This approach requires the temporal information of each vehicle even in the inference stage. Furthermore, Jiang et al. (2018) used temporal information and spatial information for re-ranking.
Previous studies used vehicle key points for vehicle Re-ID (Khorramshahi et al., 2019a;Wang et al., 2017). Wang et al. (2017) not only used temporal information but also estimated orientation using key points and extracted orientation-invariant features to improve the performance of vehicle Re-ID. Khorramshahi et al. (2019a) used key points to exploit local features. The key-pointbased method has the following disadvantages: it cannot cope with various types of vehicles that do not exist in the training data and additional key point labels are required.
Recent studies (Chen et al., 2020b;Liu et al., 2018;Meng et al., 2020) introduced a method of segmenting and comparing vehicle parts using metadata. This method is similar to how humans identify objects by segmenting the parts of a vehicle and comparing each part separately. Liu et al. (2018) used a detection model to segment vehicle parts. Chen et al. (2020b) proposed a model that segments vehicle parts in a weakly supervised method using the orientation of the vehicle to improve the performance of vehicle Re-ID. Meng et al. (2020) used a supervised segmentation model to divide vehicles. These methods have improved performance but have the disadvantage of requiring additional annotations or models. Detection and segmentation require numerous data resources, and the model is heavy.

Visual attention
The visual attention mechanism enables deep neural networks to learn the areas that must be focused during training. Figure 1 shows the learnt attention maps of several vehicle images, and the highlighted areas correspond to the subtle and discriminative regions. The attention mechanism automatically extracts the  (Yang et al., 2021) 1 Global Sp S Triplet and id ResNet50 SSIA (Li et al., 2021) 1 G l o b a l S p a n d C h S R o t a t i o n T r i p l e t , i d a n d r o t a t i o n R e s N e t 5 0 CAL (Rao et al., 2021) 16 Typical attention-based models comprise a trunk branch to learn a global feature representation and multiple part-or viewpointspeci c branches to learn feature representations dependent on parts or viewpoints (Chen et al., 2020b;Guo et al., 2019;Zhang et al., 2020).
Most models have spatial attention, whereas others have channel-wise attention. Since a channel-wise feature map is essentially for detecting the corresponding semantic attributes, channel-wise attention can be regarded as the process of selecting semantic attributes that are meaningful or potentially helpful for achieving the goal. Therefore, channel-wise feature maps are usually used to detect discriminative vehicle parts, such as the windshield or tires, while spatial attention feature maps are used to extract viewpoint-aware features (Chen et al., 2020a).
Soft attention uses soft shading to focus on regions, whereas hard attention uses image cropping to focus on regions. Moreover, soft attention can be learnt using gradient descent, whereas hard attention cannot be trained because there is no derivative for the procedure 'crop the image here'. All models use soft attention by default, and hard attention is used when the model has vehicle part branches.
Many models attempt to improve performance using metadata, which may require additional annotations. The triplet and cross-entropy losses are used for basic training, and if any label is used, its cross-entropy loss is added. The most used backbone network is ResNet50, which is also used for our model. Table 1 shows that recent attention-based approaches usually adopted multiple attention branches for viewpoints or discriminative parts and introduced spatial and channel-wise attention modules, which are connected in serial or parallel depending on the developer's choice. However, there are still opportunities to improve existing approaches, as the number of attention branches, associated viewpoints, and parts and metadata usage are hyperparameters or the developer's choice.
We now classify and describe the models listed in Table 1 into three groups: the viewpoint-centric, part-centric, and miscellaneous attention models. Zhou and Shao (2018) proposed a viewpoint-aware attentive multi-view inference (VAMI) model and an adversarial training architecture to produce attention maps for different viewpoints. They further generated multi-view features from a single-view input image. However, the VAMI requires the viewpoint and attribute information of the vehicles to train the network for the single-view feature extraction of the vehicle. Additionally, because of the lack of direct supervision on the generated attention maps, the attention outcomes are noisy and would unfavourably affect the learning of the network. Chen et al. (2020a) proposed the viewpoint-aware channel-wise attention mechanism (VCAM), which uses the viewpoint of a captured image to generate an attentive weight. They assumed that the viewpoint of the image determines the visibility of the vehicle parts and that channel-wise attentive weight is related to the visibility of the corresponding part. The Re-ID feature extraction module is incorporated to the channel-wise attention mech-anism to extract the viewpoint-aware feature for Re-ID matching. However, this method requires viewpoint annotations on vehicle images for supervised learning of the viewpoint estimation layer. Meng et al. (2020) proposed a parsing-based view-aware embedding network (PVEN) to achieve view-aware feature alignment and enhance vehicle Re-ID. A network parses a vehicle into four different views and then aligns the features using mask average pooling to provide a ne-grained representation of the vehicle. Furthermore, common-visible attention was designed to focus on the common-visible views, which not only shortens the distance among intra-instances but also enlarges their discrepancy. Thus, PVEN can capture the stable and discriminative information of the same vehicle and outperforms the previous methods by a large margin. However, it requires additional annotation for training the vehicle part parser, which consumes considerable labour time. Chen et al. (2020b) proposed a dedicated semantics-guided part-attention network (SPAN) that generates attention masks for three different vehicle views (i.e., front, side, and rear). The partattention masks enable the network to extract separately discriminative features in each part (or view). Moreover, to recognize correctly two positive images, the features of the co-occurrence view are emphasized when evaluating the feature distance of two images. However, to train the part-attention network, it is crucial to provide viewpoint semantic labels although they are not pixellevel but image-level labels. Teng et al. (2021) proposed a MVAN, where each branch learns a viewpoint-speci c feature for the front, rear, and side views. Furthermore, a spatial attention model is introduced into each branch to learn speci c local cues for different viewpoints. The viewpoint-speci c features may outperform the general features learned using a uniform network because they can focus on a limited range of views. However, to train the model to estimate the viewpoint of an input vehicle image, we must annotate the viewpoints on the image dataset. Guo et al. (2019) proposed a two-level attention network supervised using a multi-grain ranking loss (TAMR) to learn ef cient feature embeddings for vehicle Re-ID tasks. The two-level attention network consists of the hard part-level and soft pixel-level attention. The former is designed to localize the salient vehicle parts, such as the windscreen and car head. The latter provides an additional attention re nement at a pixel-level to focus on the distinctive characteristics within each part. The model consists of two groups of three branches: (i) a trunk branch for learning global feature representation with soft attention modules and (ii) two salient part branches with two-level hard and soft attention modules. Additionally, they proposed a coarse-to-ne multi-grain ranking loss to further enhance the discriminative ability of the learned features. However, as the windscreen and car head are only visible from the front, the branches of the parts do not provide useful information from the rear or side views. Zhang et al. (2020) introduced a part-guided attention network (PGAN) that combines part-guided bottom-up and top-down attention, global features, and partial visual features in an end-toend framework. First, the PGAN detects the locations of the various parts and salient regions and generates hard attention to them. Then, soft attention weights are learned for candidate parts to highlight the most discriminative regions and suppress the distraction of irrelevant parts. Finally, PGAN aggregates the global appearance and local features to improve the feature performance.

Part-centred attention models
However, we must annotate vehicle images to bounding boxes and class labels to train an object detector to recognise vehicle parts or regions. Teng et al. (2018) developed a spatial and channel attention network (SCAN), which embeds spatial and channel attention branches behind convolutional layers to highlight the outputs in discriminative regions and channels, respectively. The attention branches and convolutional layers are jointly trained using triplet and cross-entropy losses. However, since this method only focuses on the global feature map regardless of viewpoints, viewpointcentred or part-focused local attention is crucial. Khorramshahi et al. (2020) proposed the self-supervised attention for vehicle re-identi cation (SAVER), which generates an attention map using a variational auto-encoder (VAE). First, the SAVER generates a coarse reconstruction image; then, it performs its pixel-wise difference from the original image to construct a residual image. The residual image contains crucial details required for Re-ID and acts as a pseudo-saliency or pseudoattention map that highlights discriminative regions in the image. Finally, the convex combination (with trainable parameter α) of the reconstructed and residual image is calculated and passed through the Re-ID backbone for deep feature extraction. The SAVER bene ts from self-supervised attention generation and eliminates the requirement of having extra annotations for key points of vehicles and bounding boxes of the parts, which are used to train specialized detectors. However, it only learns instancespeci c discriminative features but ignores signi cant viewpoint changes between images of identical vehicles. Zheng et al. (2020) proposed a multi-scale attention framework (MSA), which includes multi-branch subnetworks that generate different scales of feature maps using bilinear interpolation. Each subnetwork has a spatial-channel attention block, as shown in the SCAN. The multi-scale mechanism on the feature map level helps supply the missing information caused by pooling operations. However, this framework does not consider view-dependent attention, which is crucial to the vehicle Re-ID. Li et al. (2021) rst introduced self-supervised representation learning to discover geometric features and applied an interpretable attention module to condense these features. Unlike the other attention model, which is automatically learned using the backpropagation algorithm, the proposed attention model takes the local spatial maxima and channel-wise maxima to construct an attention map. By sharing the encoder of the self-supervised learning module with the attention module, the model can discover discriminative features from automatically found geometric locations without corresponding supervision. The discovered locations are mainly the corners of the body of a vehicle. Rao et al. (2021) presented the counterfactual attention learning (CAL) method based on causal inference to learn attention that is more effective. They designed a framework to quantify the quality of attention by comparing the effects of facts and counterfactuals on the nal prediction. Furthermore, they proposed to maximize the difference to encourage the network to learn visual attention that is more effective.

Proposed Method
The proposed multi-attention network comprises a backbone network, which is used to encode a convolutional feature map for a given image, and an attention-based network used to extract a set of weighted feature vectors, each of which focuses on a speci c vehicle region. The latter consists of two modules: the spatial attention module, for soft partitioning of vehicle regions, and the channel-wise attention module, which is based on the squeeze and excitation (SE) method (Hu et al., 2020). The weighted feature vectors are used to compare the distance between images for metric learning and are fed to a classi er to predict vehicle ID. The classi er includes batch normalization (Ioffe & Szegedy, 2015) and linear layers (Luo et al., 2019). n − 1 classi ers are applied to n − 1 weighted feature vectors, excluding the background vector. Figure 2 depicts the overall architecture of the MUSP, and its components are described in detail in the rest of this section.

Feature extraction from vehicle images
We selected ResNet50 (He et al., 2016) as a backbone for feature extraction, removed the last fully connected (FC) layer, and used the output of the last convolution layer. Thus, the feature extraction process is as follows: where CNN B is the base network, M is a feature map extracted from I, and h, w, and d are dimensions of M, which depend on the feature extractor and input image I.

Spatial multi-attention mechanism
We use vehicle partitioning to extract subtle vehicle parts for vehicle Re-ID (Chen et al., 2020b;Meng et al., 2020;Sun et al., 2018) and an attention method to re ne the embedded features. Khorramshahi et al. (2019b) proposed a method for detecting and recropping a vehicle during pre-processing to reduce background regions. They used a detection model and bounding box annotation to depress the noisy background. We assume that the same function can be processed within the deep learning model without additional models or arti cial intervention. Figure 1(b) illustrates that the vehicle area was accurately recognized without additional annotation or detection. Meng et al. (2020) found that subtle vehicle components signi cantly impact the division of parts. However, they cannot be captured accurately using single attention because attention focuses on easily comparable features, such as headlights and bumpers. Therefore, we used spatially separated multiple attention to focus on different vehicle areas. This distributed attention can consider different parts; thus, the model can see and compare more vehicle details. Consequently, we designed a spatial multiple attention mechanism. We apply convolution layers to feature M, which is encoded using the backbone network, to extract two feature maps for attention weights (A) and values (V). An attention feature map has n channels with size h × w. Each channel corresponds to a vehicle part. A value feature map has c channels with size h × w. Attention weights are normalized by the softmax function and are multiplied by the corresponding values to obtain n weighted values, to which average pooling is applied to extract nal weighted feature vectors {f i } i = 0, n . Note that the softmax function is applied to the last dimension n, rather than the spatial dimension hw. The attention weight has exclusive activation at each spatial point of the value map. We discard the nal weighted feature vector f n . However, if the nal vector is not used in the subsequent layers of the network, only the rest vectors are trained to pay attention to the discriminative regions. Therefore, the attention for the indiscriminative regions, such as the background, is assigned to the discarded vector, which is noise (Fig. 3). The entire process can be summarized as follows: where CNN VE and CNN AE are value and attention extractors, respectively, i.e., a simple single convolution layer with a 3 × 3 kernel and 1 × 1 padding. F is the set of extracted feature vectors, f is a single vector in feature set F, F d is the nal feature set with the last discarded background feature, | · | is the matrix size, and σ is softmax operation.

Channel-wise attention mechanism
A set of the weighted feature vectors F d with spatial attention are recalibrated by capturing and applying channel-wise attention, as depicted in the SE block that modulates channel activation (Hu et al., 2020). Channel-wise attention adjusts the activation intensity according to the importance of each channel. This attention reduces the intensities of unnecessary feature elements, thereby reducing their in uence on image distance calculations. Because each feature vector relates to a feature map, which is highlighted on a speci c vehicle area, channel-wise attention should be applied to all n − 1 feature vectors, which are in contrast to the original SE that controls one feature using FC layers. We propose a channel-wise attention network based on an extended SE (ESE) algorithm. Furthermore, we re-shaped a set of the weighted feature vectors F d into one vector. We then feed the vec-tor to the SE block to modulate channel activation. The ESE module comprises two linear layers: the rst and second layers are followed by a recti ed linear unit and sigmoid operation, respectively. The ESE input dimension is 2048, and its output dimensions are 128 and 2048, in sequence. The result from the ESE is channelwise attention, which is multiplied by the original F d . Thus, ESE can be summarized as follows: where r is |c × (n − 1)|, MLP is a multi-layer perceptron, ρ is the sigmoid operation, and F e is a set of n − 1 nal recalibrated features. f d and f e are one-dimensional (1D) vectors, and F d and F e are 2D matrices.

Distance computation
The feature vector set (F e ) extracted from the attention-based network and a feature vector (f g ) obtained using the global average pooling of M are used to calculate losses. We separately apply triplet loss to each feature vector to train the proposed model, and we adopt a multi-feature re-weighting function called a cooccurrence attentive module (Chen et al., 2020b), with some modi cations, to calculate the feature distance by integrating these features for inference. The weights of the distance between two vehicles are calculated as follows: where AR a, i is an area ratio with the i-th attention weight for the ath image, and AR is calculated by averaging the attention weights. The original study (Chen et al., 2020b) used weight = 1 for the global feature, whereas we use 1 n−1 for the global feature weight w (a, b), g . Hence, the distance between two vehicles is denoted as follows: where f a, i is the i-th feature for the a-th image, f a, g is a global feature, and · 2 is the Euclidean distance.

Loss Function
We use three loss functions to train the model: cross-entropy loss for vehicle ID prediction (L id ), triplet loss for distance learning (L tri ), and production loss to separate each attention feature (L div ). The overall loss function is given as follows:

Cross-entropy loss
We apply cross-entropy loss following the vehicle ID prediction layer: where n is the number of features, K is the number of images in a mini-batch, C is the number of classes, y i,j,l is the jth element for the one-shot encoded vector, which describes ground-truth for the ith sample in a mini-batch and lth feature vector, andŷ i jl is the jth element of the output vector of the softmax FC layer for the ith image and lth feature vector.

Triplet loss
The proposed network is optimized using triplet loss for metric learning, which trains the network to minimize the distance between features from the same image classes and simultaneously maximizes the distance between features from different image classes. In a mini-batch that contains P identities and Q images for each identity, each image (anchor) has Q − 1 images of the same identity (positives) and (P − 1) × Q images of different identities (negatives). Triplet loss (Hermans et al., 2017) is de ned as follows: where v a, i is the prediction vector for the a-th image of the i-th identity group, and m is the margin to control the difference between positive and negative pair distances, which helps cluster the distribution of the same vehicle images more densely.

Spatial diversity loss
We adopt spatial diversity loss (Chen et al., 2020b) to restrict overlapped areas, hence ensuring that each attention weight acts on a different position: where a i n is n-th attention weight for the i-th image in the minibatch, and spatial diversity loss is the summation of space-wise production of attention weights.

Implementation details
During pre-processing, we resized all images to 256 -256 pixels and applied randomly erasing and translation effects. We used the Adam optimizer (Kingma & Ba, 2014) with weight decay and momentum of 5e−4 and 0.9, respectively. The proposed model was trained using a batch size of 64, 16 unique vehicle IDs, a training epoch of 90, and an initial learning rate of 0.000 35, which were divided by 10 at 30 and 60 epochs; we used the warm-up method with initial 10 epochs of 0.000 035-0.000 35. Furthermore, we applied label smoothing to avoid over tting. Training required 6 and 2 h on the VehicleID and VeRi-776 datasets, respectively, and we used an NVIDIA Quadro RTX 6000 GPU system. The training code was written in PyTorch.
In the training phase, we used weighted feature vectors from the spatial attention module and vehicle ID prediction vector, as described in Section 4, for the loss function. However, we only used F e and f g in the inference phase with a re-weighting method to compute the distances between vehicles.

Baseline and compared methods
We set our baseline as a strong baseline using bag-of-tricks proposed by Luo et al. (2019). Our proposed model uses the attention modules to extract n weighted feature vectors. In contrast, the baseline replaces the attention modules with an average pooling layer; however, the remaining pre-processing process, learning process, and architecture are the same.
We compared our method with some SOTA methods. As selected and brie y described by Meng et al. (2020), BOW-CN (Zheng et al., 2015) rst adopted the BOW model based on the colour name (CN). Local maximal occurrence representation (LOMO; Liao et al., 2015) is robust to the varied lighting conditions. The fusion of attributes and colour features (FACT; Liu et al., 2016b) combines the low-level colour and high-level semantic features. The BOW-CN, LOMO, and FACT are handcraft feature-based methods. The remaining are deep learning-based methods. GoogLeNet (Yang et al., 2015a) is a GoogleNet  model ne-tuned on the CompCars (Yang et al., 2015b) dataset. Plate-SNN  uses the number-plate features to enhance the retrieval vehicles. Siamese+Path (Shen et al., 2017) was used to propose the visual-spatial-temporal path to exploit the temporal restriction. GSTE (Bai et al., 2018) proposed group-sensitive-triplet embedding to elegantly model the intra-class variance. VAMI (Zhou & Shao, 2017) generated features of different views using GAN, while feature distance adversarial network (FDA-Net; Lou et al., 2019b) generated the hard negative samples in the feature space. EALN (Lou et al., 2019c) proposed an adversarial network that can generate samples localized in the embedding space. OIFE  used the 20 pre-de ned key points to roughly align vehicle features. RAM  horizontally split the image into three parts. PRN (He et al., 2019) detected the window, light, and brand of a vehicle to capture the difference between vehicle instances. AAVER (Khorramshahi et al., 2019a) proposed an attention mechanism based on vehicle key points and orientation. UMTS (Jin et al., 2020) introduced an uncertainty-aware multi-shot network composed of teacher and student networks. HPGN (Shen et al., 2022) used a hybrid pyramidal graph network to explore the spatial signi cance of feature maps. We added the following recent representative attention-based deep learning methods, as discussed in Section 2.2: SCAN (Teng et al., 2018), VCAM (Chen et al., 2020a), SPAN (Chen et al., 2020b), SAVER (Khorramshahi et al., 2020), PVEN (Meng et al., 2020), PGAN (Zhang et al., 2020), MSA (Zheng et al., 2020), MVAN (Teng et al., 2021), PSA (Yang et al., 2021), SSIA (Li et al., 2021), CAL (Rao et al., 2021), and MUSP. Table 2 compares VehicleID dataset outcomes for the baseline, MUSP, and existing attention-based vehicle Re-ID models using CMC@1 and CMC@5 metrics. The highest value without additional metadata is shown in bold, whereas the highest value is underlined in italics. The performance of the MUSP improved for all CMC metrics compared with the baseline, with 2.5% and 1.1%, Table 3. Model performance (mAP, CMC@1, and CMC@5) on the VERI-Wild.

Number of spatial attention maps
According to Chen et al. (2020a)'s interpretation, the spatial attention mechanism extracts viewpoint-aware features and the channel-wise attention mechanism detects discriminative vehicle parts. Experiments were conducted to observe how the number of attentions affects performance and the number of attention optimal for the experimental datasets. Generally, increasing the number of attention increases the number of views, thereby improving model performance. Reducing the number will have the opposite effect. Table 5 presents the optimal performance for the ve attention. Those less than three produced a signi cantly lower performance than four or more attention. Therefore, it can be interpreted that few viewpoint-aware attention cannot capture the discriminative features of a vehicle from suf ciently diverse angles. The experiments illustrate that the desired part recognition and comparison can only be performed if four or more spatial attention are used. Figure 4 shows the heat maps corresponding to the weights of spatial attention maps trained using the VehicleID dataset. Figures 4(a), (c), and (e) illustrate the attention distribution when all four spatial attention maps are used for inference. Figures 4(b), (d), and (f) show the attention distribution when four out of ve attention maps are used. As shown in the gures, excluding one attention map from the computational pipeline during training considerably reduces the noise. We also compared the performance of the two models for the VeRi-776 and VehicleID datasets. As shown in Table 6, the model using four out of ve attention maps outperforms the model using all four spatial attention maps for training and inference. Table 7 compares the MUSP performance with and without the channel-wise attention module on the VehicleID dataset. The ESE is the central element of the channel-wise attention module. It improved by 0.015 for the CMC@1 and CMC@5, verifying that it successfully recalibrated channel information, which is helpful for distance computation and embedding. It receives n − 1 features representing discriminative vehicle part regions from an arbitrary viewpoint and adjusts the channel activation by considering the overall system performance. The channel activation is reduced using ESE to suppress its effect when some channel information is unnecessary for vehicle comparison. Figure 5 compares the ESE impact using epoch. Here, the ESE improved performance for all epochs and accelerated convergence. Figure 6: Results of applying the MUSP to three primary vehicle Re-ID datasets. This model is trained using the VehicleID dataset and tested on three different datasets. Each column corresponds to the attention maps. Regardless of orientation and vehicle type, each attention has a high activation on one or more speci c regions. For example, the rst attention reacts to the bottom of the headlights, the second reacts to the headlights, and the third reacts to the front or rear window. The fourth has a high activation on the lower parts of the window and vehicle and the fth is for the background. Attention is invariant to the domain and orientation attributes even for the unseen domains not used in training the model (e.g., the VERI-Wild and VeRi-776 datasets). CBAM (Woo et al., 2018) and SENet (Hu et al., 2020) used a sigmoidbased attention module, whereas the proposed spatial attention module is based on softmax. Softmax satis es our spatial partition purposes more closely because it has a normalization effect that sets the sum of the dimension elements to one. Combining softmax and spatial diversity loss produces exclusively spatial activation. Gradient vanishing can occur for the sigmoid approach as training progresses, thus degrading performance. We compared the softmax-and sigmoid-based attention modules to verify that the former is the more suitable activation function. The spatial attention module discards the last attention weight, so we retained four and ve attention for the sigmoid-and softmax-based modules, respectively. Table 8 illustrates that the softmax-based attention improved by 0.8%, 0.3%, and 0.2% compared with the sigmoid-based attention for the mAP, CMC@1, and CMC@5 metrics, respectively. Thus, it showed superior overall performance improvement to the sigmoid-based attention.

Cross-domain experiments
These experiments con rm that the proposed MUSP performs particularly well for larger test datasets close to real-world environments. Recognizing previously unseen vehicles is another problem that emerges in real-world environments. Therefore, we conducted a cross-domain experiment comparing the RAM (Lin et al., 2014) and EALN (Lou et al., 2019a), which are trained and tested on the VehicleID, with the PVEN (Meng et al., 2020) and MUSP trained on the VERI-Wild and tested on the VehicleID. Table 9 presents an overwhelming MUSP performance for the cross-domain tests; the attention partition operates effectively even for vehicles not previously learned. The MUSP exceeded the models trained on the same dataset and improved by 6% and 2.5% compared with the baseline and PVEN, respectively. These results are consistent with those in Fig. 6, where vehicle parts on the VeRi-776 and VERI-Wild datasets were equally identi ed, although the MUSP was trained on the VehicleID dataset.

Conclusions
In this study, we proposed a MUSP network based on multiple soft attentions for vehicle Re-ID. This model can learn to highlight the most discriminant regions and suppress the distraction of irrelevant parts using multiple spatial attention and channel-wise attention mechanisms without introducing any arti cial partspeci c and view-dependent attention branches. We introduced a novel method that can considerably reduce noise in spatial attention maps, thereby solving the problem of noise generation by spatial attention mechanisms in attention maps. This architecture achieved signi cant performance improvement without using additional metadata. The visualization in Fig. 6 illustrates that the attention parts selected using the spatial attention module also operate effectively on unseen data and perform invariant orientations. The spatial and channel-wise attention modules are vital MUSP components, which were experimentally veri ed on the VehicleID, VERI-Wild, and VeRi-776 datasets. The experiments showed that the proposed method was comparable to or superior to current SOTA methods. Particularly, among the attention-based models that do not use additional metadata, our model achieved a SOTA and showed comparable performance to the methods that used metadata. As this performance is achieved without using metadata attributes or post-processing, such as re-ranking, the performance of the model can be signi cantly improved when they are used.
For future research, we will consider applying the MUSP to the feature maps extracted from various levels of the layers of a backbone network, such as the SENet. The MUSP is currently applied only to the resulting feature map of the last layer of the backbone. However, applying it to feature maps extracted from the middle layers of the backbone will signi cantly improve performance. Furthermore, the proposed multiple soft attention method can be applied to various areas (Mohammed et al., 2022;Shi et al., 2022;Yan, 2018), such as monitoring traf c (Baek & Lee, 2022;Jain et al., 2019), production lines (Park & Lee, 2022), or intelligent humanmachine systems (Eom & Lee, 2015Lee & Yoon, 2020).