The present disclosure relates to the technical field of image retrieval, and specifically, to a method for re-recognizing an object image based on multi-feature information capture and correlation analysis.
In recent years, booming artificial intelligence, computer vision, and other technologies have been widely used in various fields. With the continuous development of the information era, the combination of computer vision with object sales and management has become a current hot topic of concern. For a queried object image, object image re-recognition can retrieve all images of a same object by using a plurality of different cameras. The object re-recognition technology not only can improve people's shopping experience, but also can save costs, improve productivity, and reduce a loss rate of objects. An object image re-recognition system is also widely used, not only in retail stores, supermarkets, and other retail industries, but also in logistics companies, warehouses and other large places.
The existing image re-recognition methods include image re-recognition methods based on hand-crafted features and image re-recognition methods based on deep learning. The image re-recognition methods based on hand-crafted features use inherent attributes of an image to re-recognize the image. However, the image re-recognition methods based on hand-crafted features are limited to certain image types, and thus have a poor generalization capability and are time-consuming. Some image re-recognition methods based on deep learning focusing on global information fail to capture subtle differences between features and ignore importance of local information. Some image re-recognition methods based on deep learning can only capture partial important information and cannot take overall information into account, resulting in low accuracy of image re-recognition.
In order to overcome the shortcomings of the above technologies, the present disclosure provides a method for improving efficiency and accuracy of object re-recognition.
The technical solution used in the present disclosure to resolve the technical problem thereof is as follows:
A method for re-recognizing an object image based on multi-feature information capture and correlation analysis includes:
Further, step b) includes the following steps:
Further, the transformer encoder in step b-4) includes the multi-head attention mechanism and a feedforward layer. The multi-head attention mechanism is composed of a plurality of self-attention mechanisms. A weight Attention(hl,i) of an ith value in the sequence hl∈n×d is calculated according to a formula
where Qi represents an ith queried vector, T represents inversion, Ki represents a vector of a correlation between ith queried information and other information, and Vi represents a vector of the ith queried information. A new output embedding SA(hl) of the multi-head attention mechanism is calculated according to a formula SA(hl)=Proj(Concati=1m(Attention(hl,i))). An input h′ of the feedforward layer is calculated according to a formula h′=ωLN(hl+SA(hl)). An output y of the encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), where Proj(·) represents a linear mapping, Concat(·) represents a stitching operation based on the formula FFN(h′)=∂W2 (hlW1+c1)+c2, ∂ represents a GELU activation function, c1 and c2 are learnable offsets, ω represents a ratio, and LN represents a normalization operation. The feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the object image.
Further, in step c), a cross-entropy loss VID is calculated according to a formula
where gi represents an indicator variable, n represents a number of classes in the training data set, and pi represents a predicted probability of a class-i image. The triplet loss function Vt is calculated according to a formula Vt=[∥νa−νp∥2−∥νa−σn∥2+α]+, where α represents a spacing, νa represents a sample of a class marker learned by a transformer, νp represents a positive sample of the class marker learned by the transformer, νn represents a negative sample of the class marker learned by the transformer, [d]+ is max[d,0], and d=∥νa−νp∥2−∥νa−νn∥2+α.
The present disclosure has the following beneficial effects: An input feature map is weighted by using a convolutional layer with a spatial attention mechanism and a channel attention mechanism, such that channel and spatial information is effectively combined, which can not only focus on an important feature and suppress an unnecessary feature, but also improve a representation of a concern to obtain a better feature. A transformer is used. A multi-head attention mechanism can better process a feature after an image is divided into blocks to capture more abundant feature information and can consider a correlation between features to obtain good performance and improve efficiency of object image retrieval. The convolutional layer with the channel attention mechanism and the spatial attention mechanism and the transformer with the multi-head attention mechanism are combined to globally focus on the important feature and better capture a fine-grained feature, thereby improving performance of re-recognition.
The present disclosure will be described in detail below with reference to
A method for re-recognizing an object image based on multi-feature information capture and correlation analysis includes the following steps:
An input feature map is weighted by using a convolutional layer with a spatial attention mechanism and a channel attention mechanism, such that channel and spatial information is effectively combined, which can not only focus on an important feature and suppress an unnecessary feature, but also improve a representation of a concern to obtain a better feature. A transformer is used. A multi-head attention mechanism can better process a feature after an image is divided into blocks to capture more abundant feature information and can take into account a correlation between features to obtain good performance and improve efficiency of object image retrieval. The convolutional layer with the channel attention mechanism and the spatial attention mechanism and the transformer with the multi-head attention mechanism are combined to globally focus on the important feature and better capture a fine-grained feature, thereby improving performance of re-recognition.
Step b) includes the following steps:
The transformer encoder in step b-4) includes the multi-head attention mechanism and a feedforward layer. The multi-head attention mechanism is composed of a plurality of self-attention mechanisms. A weight Attention(hl,i) of an ith value in the sequence hl∈n×d is calculated according to a formula
where Qi represents an ith queried vector, T represents inversion, Ki represents a vector of a correlation between ith queried information and other information, and Vi represents a vector of the ith queried information. A new output embedding SA(hl) of the multi-head attention mechanism is calculated according to a formula SA(hl)=Proj(Concati=1m(Attention(hl,i))). An input h′ of the feedforward layer is calculated according to a formula h′=ωLN(hl+SA(hl)). An output y of the encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), where Proj(·) represents a linear mapping, Concat(·) represents a stitching operation based on the formula FFN(h′)=∂W2 (hlW1+c1)+c2, ∂ represents a GELU activation function, c1 and c2 are learnable offsets, ω represents a ratio, and LN represents a normalization operation. The feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the object image. A residual eigenvalue is rescaled by using a smaller value of ω, which helps to enhance a residual connection, and y represents the output of the encoder. After the attention coefficient is learned by the multi-head attention mechanism composed of the self-attention mechanisms, more abundant feature information is captured, and a degree of attention to each feature is obtained. In addition, residual design and layer normalization are added to prevent disappearance of a gradient and accelerate convergence. The new feature on this branch is obtained by using a plurality of encoders.
In step c), a cross-entropy loss VID is calculated according to a formula
where gi represents an indicator variable, n represents a number of classes in the training data set, and pi represents a predicted probability of a class-i image. The triplet loss function Vt is calculated according to a formula Vt=[∥νa−νp∥2−∥νa−σn∥2+α]+, where α represents a spacing, νa represents a sample of a class marker learned by the transformer, νp represents a positive sample of the class marker learned by the transformer, σn represents a negative sample of the class marker learned by the transformer, [d]+ is max[d,0], and d=∥νa−νp∥2−∥νa−σn∥2+α.
Re-recognition of a transport vehicle in a large industrial park is taken as an example. An implementation of the present disclosure is as follows: At first, a plurality of images of the transport vehicle in the industrial park are collected to construct a re-recognition database, ID information of a vehicle image in the database is labeled, and the database is divided into a training set and a test set.
Then, a vehicle image re-recognition model based on multi-feature information capture and correlation analysis is established. The model is divided into a first feature branch network and a second feature branch network. In the first feature branch network, a vehicle image h in the training set is input, where h∈Re×w×3, R represents real number space, e represents a quantity of horizontal pixels of the vehicle image (e=256), w represents a quantity of vertical pixels of the vehicle image (w=256), and 3 represents a quantity of channels of each RGB image. The vehicle image is processed by using a convolutional layer to obtain a feature map f of the vehicle image, as shown in
After that, the feature map f of the vehicle image is processed by using a channel attention mechanism. Global average pooling and global maximum pooling are performed on the feature map to obtain two one-dimensional vectors, and the two one-dimensional vectors are normalized through convolution, ReLU activation function, 1*1 convolution, and sigmoid function operations to weight the feature map. Maximum pooling and average pooling are performed on all channels at each position in a weighted feature map f by using a spatial attention mechanism to obtain a maximum pooled feature map and an average pooled feature map for stitching. 7*7 convolution is performed on a stitched feature map, and then the stitched feature map is normalized by using the batch normalization layer and a sigmoid function. A normalized stitched feature map is multiplied by the feature map f to obtain a new feature.
In the second feature branch network, the vehicle image h in the training set is input and divided into n two-dimensional vehicle image blocks (n=256). Embeddings of the two-dimensional vehicle image blocks are represented as a one-dimensional feature vector hl∈Rn×(p
An average embedding ha of all the vehicle image blocks is calculated according to a formula
where hi represents an embedding that is of an ith vehicle image block and obtained through initialization based on a Gaussian distribution, and i∈{1, . . . , n}. An attention coefficient ai of the ith vehicle image block is calculated according to a formula ai=qTσ(W1h0+W2hi+W3ha), where qT represents a weight, σ represents the sigmoid function, h0 represents a class marker of the vehicle image block, and W1, W2, and W3 are weights. A new embedding hl of each vehicle image block is calculated according to a formula
A new class marker h0′ of the vehicle image block is calculated according to a formula h0′=W4[h0∥hl], where W4 represents a weight. The new class marker h0′ of the vehicle image block and a sequence with an input size of the vehicle image as hl∈n×d
n×d is calculated according to a formula
where Qi represents a vector of an ith queried vehicle, T represents inversion, Ki represents a vector of a correlation between ith queried vehicle image block information and other vehicle image block information, and Vi represents a vector of the ith queried vehicle image block information. A feature embedding SA(hl) of a new vehicle image output by the multi-head attention mechanism is calculated according to a formula SA(hl)=Proj(Concati=1m(Attention (hl,i))). An input vehicle image h′ of the feedforward layer is calculated according to a formula h′=ωLN(hl+SA(hl)). An output vehicle image y of the encoder is calculated according to a formula y=ωLN(h′+FFN(h′)), where Proj(·) represents a linear mapping, Concat(·) represents a stitching operation FFN(h′)=∂W2 (hlW1+c1)+c2, ∂ represents a GELU activation function, c1 and c2 are learnable offsets, ω represents a ratio, and LN represents a normalization operation. Finally, the feature output from the first feature branch network and a feature y output from the second feature branch network are stitched into a feature vector of the vehicle image to complete the establishment of the vehicle image re-recognition model.
Then, the vehicle image re-recognition model is optimized by using a cross-entropy loss function and a triplet loss function. A cross-entropy loss VID is calculated according to a formula
where gi represents an indicator variable, n represents a quantity of vehicle image classes in the training data set, and pi represents a predicted probability of a class-i vehicle image. The triplet loss function Vt is calculated according to a formula Vt=[∥νa−νp∥2−∥νa−σn∥2+α]+, where α represents a spacing, νa represents a sample of a class marker that is in the vehicle image and learned by a transformer, νp represents a positive sample of the class marker that is in the vehicle image and learned by the transformer, νn represents a negative sample of the class marker that is in the vehicle image and learned by the transformer, [d]+ is max[d,0], and d=∥νa−νp∥2−∥νa−σn∥2+α.
A trained vehicle image re-recognition model is obtained after the optimization by using the loss functions and is stored.
A to-be-retrieved vehicle image is input into the trained vehicle image re-recognition model to obtain a feature of the to-be-retrieved vehicle image.
Finally, the feature of the to-be-retrieved vehicle image is compared with a feature of a vehicle image in the test set, and a comparison result is sorted by similarity measurement. A retrieval result is shown in
Finally, it should be noted that the above descriptions are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the foregoing embodiments, a person skilled in the art can still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacement of some technical features therein. Any modifications, equivalent substitutions, improvements, and the like made within the spirit and principle of the present disclosure should be included within the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110732494.2 | Jun 2021 | CN | national |
This application is the continuation-in-part application of International Patent Application No. PCT/CN2022/070929 filed on Jan. 10, 2022, which claims priority based on Chinese patent application 202110732494.2 filed on Jun. 29, 2021. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
20210012146 | Zhai | Jan 2021 | A1 |
20210150118 | Le | May 2021 | A1 |
Number | Date | Country |
---|---|---|
110751018 | Feb 2020 | CN |
111539370 | Aug 2020 | CN |
111553205 | Aug 2020 | CN |
113449131 | Sep 2021 | CN |
Number | Date | Country | |
---|---|---|---|
20220415027 A1 | Dec 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/070929 | Jan 2022 | WO |
Child | 17876585 | US |