The invention relates to the technical field related to image data processing, in particular to an image recognition method and a system of convolutional neural network based on global detail supplement, which is especially suitable for fine granularity image classification.
This part only provides the background technical information related to the invention, but do not necessarily constitute the prior art.
In recent years, the classification of fine granularity images has been widely used, which has attracted the attention of many researchers. Different from traditional image recognition and classification tasks, the focus of fine granularity image classification is to further classify subcategory images falling into one category.
Traditional image classification methods can be roughly divided into the methods based on manual feature annotation and the methods based on deep learning. The methods based on manual feature annotation have limited ability to express features, and require a lot of manpower and material resources. Hence, they are not so popular. Compared with the traditional methods of manual feature annotation, the deep neural network has strong feature expression and learning ability. At present, the methods based on deep learning have become the mainstream methods of image recognition.
The inventor found that the current fine granularity image classification task is a challenge to the deep learning model. In the task of fine granularity image classification, the images of different categories have very similar appearance and features, resulting in small differences between fine granularity images of different categories. In addition, there are also interference from attitude, acquisition perspective, lighting, occlusion, background and other factors of the same category, resulting in large intra-category differences among fine granularity images of the same category. Furthermore, large intra-category difference and small inter-category difference increase the difficulty of fine granularity image classification. When extracting features, most of the existing deep learning methods focus on learning better target representation, ignoring the learning of different targets and their details, which makes it difficult to distinguish between different fine granularity images and limits the improvement of classification performance.
In order to solve the above problems, the invention provides an image recognition method and system of convolutional neural network based on global detail supplement, constructs the convolutional neural network based on global detail supplement, and uses progressive training for image fine granularity classification, further improving the precision of fine granularity classification.
To serve the above purpose, the invention adopts the following technical solution:
One or more embodiments provide an image recognition method of convolutional neural network based on global detail supplement, which includes the following steps:
One or more embodiments provide an image recognition system of convolutional neural network based on global detail supplement, which comprises the following:
Compared with the prior art, the invention has the following advantages:
The invention obtains the detail features including the texture detail information through detail feature learning, supplements the detail features to the high-level features obtained through the feature extraction network so as to make up for the deficiency of the detail information at the high-level stage, supplements the texture detail information to the global structure features, and classifies based on the features after global detail supplement, which improves the classification effect of fine granularity images.
The advantages of the invention and additional aspects are described in detail in the following specific embodiments.
The drawings for specification, which form part of the invention, are intended to provide a further understanding of the invention. The schematic embodiments of the invention and their descriptions are used to explain the invention, but do not constitute any limitation of the invention.
The invention is further described in combination with the drawings and embodiments.
It should be noted that the following detailed descriptions are exemplary and are intended to provide a further description of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by ordinary technicians in the technical field to which the invention belongs.
It should be noted that the terms used here are only intended to describe the specific embodiments, but are not intended to limit the exemplary embodiments according to the invention. The singular is also intended to include the plural unless the context otherwise expressly indicates. Furthermore, it should be understood that when the terms “include” and/or “comprise” are used in this specification, they indicate the presence of features, steps, operations, devices, components and/or the combinations thereof. It also should be noted that in any case of no conflict, the embodiments and the features in the embodiments of the invention can be combined with each other. The embodiments are described in detail below in combination with the drawings.
As shown in
Although the traditional feature extraction network can obtain the global structure features rich in semantic information, it ignores the texture detail information in the global structure. The embodiment obtains the detail features including the texture detail information through detail feature learning, supplements the detail features to the high-level features obtained through the feature extraction network so as to make up for the deficiency of the detail information at the high-level stage, supplements the texture detail information to the global structure features, and classifies based on the features after global detail supplement, which improves the classification effect of fine granularity images.
Optionally, before feature extraction, the image data is preprocessed, specifically, the image data scale is transformed into a uniform size, and part of the image data is horizontally flipped, translated and noised.
Step 1: feature extraction is carried out according to the image to be tested to obtain the feature corresponding to each feature extraction stage, and the steps are as follows:
Optionally, the feature extraction network is the convolutional neural network, which can be a deep learning network, a VGG network or a residual network. Specifically, it can be resnet18 or resnet50.
Resnet50 is described in this embodiment. Resnet50 includes five stages, each stage includes 10 layers, 50 layers in total. Each stage can output the extracted feature map.
The feature extraction network includes multiple cascaded stage networks. Each stage network includes multiple layers. Each stage network can output features corresponding to the corresponding stage. Each stage network includes a convolution layer, an activation layer and a pooling layer connected in turn. After the image data is input into the network (VGG, resnet18, resnet50, etc.), it first passes through the convolution layer, and then uses the activation function to increase nonlinearity, then enters the pooling layer for feature extraction. This process is repeated until the stage feature map is finally obtained.
Step 1.2 convolves the obtained feature map to obtain the feature vector of the corresponding feature map.
Specifically, the feature map Fl is input into the convolution module
and the feature map is converted into a feature vector
l∈{3,4,5} containing obvious features;
Specifically, the feature map Fe obtained from the last layer of the feature extraction network is used as the Q and K input of the self-attention S( ). The detail feature map Id obtained through detail feature learning is used as the V input of the self-attention. The global feature and detail feature are fused through the self-attention to obtain the feature map Gd of global detail supplement;
The global feature is the feature map obtained from the last layer of the feature extraction network. In this embodiment, the Q input, K input and V input of self attention are respectively Fe, Fe and Id.
The global detail supplement of this embodiment is realized through detail feature learning, the feature map of the last layer of feature extraction network and self-attention fusion. The self-attention is used to fuse the feature map that can obtain the global structure with the detail feature map that contains the texture details of the input image, making up for the deficiency of detail information at the high level stage.
Step 4: the global detail features are fused with the features in each stage of feature extraction. The features in each stage of feature extraction refer to the features output in other stages except the last stage. Optionally, multi-resolution feature fusion can be used.
Specifically, the multi-resolution feature fusion method may include the following steps:
Optionally, in this embodiment, the resnet50 network can be used to extract the feature maps of the reciprocal three layers of the feature extraction network, wherein the feature map of the last reciprocal layer is the feature map after global detail supplement. After being input into the convolution block to expand the feature map into feature vector Vl, the three groups of feature vectors are cascaded to obtain the fused feature Vconcat.
Step 5: the fused features are input into the classification module
to obtain the category prediction results yconcat after fusion:
Optionally, the classification module includes two full connection layers and one softmax layer. The results obtained through the convolution module are processed by the classification module to obtain the classification prediction results of this stage, wherein, the category label corresponding to the maximum value yconcat is the classification result of the image.
In this embodiment, the network model realizing the above steps is shown in
Progressive training is used in the feature extraction network. The training start stage n of the feature extraction network is set. From the beginning stage n to the last stage, the training is carried out by stages. From stage n+1, the training parameters obtained in the previous stage are the initial parameters until the last stage of training, so that the feature extraction network after training is obtained. As shown in
Wherein, the calculated loss of real tags and predicted tags is the cross entropy loss.
Optionally, the training is carried out from the setting start stage n of the feature extraction network training to the previous stage of the last stage. The training process of each stage is as follows:
Step S12: the data of the dataset is input into the feature extraction network for feature extraction, and the feature map of the setting stage n is obtained;
Step S13: the obtained feature map is convolved to obtain the feature vector of the corresponding feature map.
The method of this step is the same as that of Step 1.2.
The results obtained through convolution are classified, and the classification prediction results of at this stage n are obtained;
Step S16.6: the loss after the final network fusion is taken as the final loss, and continuous training is carried out until the training rounds reach the set value. The feature extraction network corresponding to the minimum loss value is the feature extraction network after training.
Specifically, in this embodiment, the data set is input into the backbone network (resnet50 is taken as the example) to obtain the feature map of the third stage of the feature extraction network. At this stage, the feature map is expanded into the feature vector V3 which are input into the classification module to obtain prediction tags. The losses of real tags and prediction tags are calculated through the cross entropy function, and back propagation and continuous training are carried out until the loss becomes stable. The training parameters of the first three stages are reserved as the initial parameters of the next training.
The results Vl obtained through the convolution module are processed by the classification module Cclass to obtain the classification prediction results of this stage according to yi =Cclass (Vl), l ∈ {3, 4, 5}
The training parameters of the previous stage are taken as the initial parameters. The feature map obtained in stage 4 is expanded into the feature vector V4 which is input into the classification module to obtain prediction tags. The losses of real tags and prediction tags are calculated through the cross entropy function, and back propagation and continuous training are carried out until the loss becomes stable. The training parameters of the first four stages are reserved as the initial parameters of the next training.
The training parameters of the previous stage are taken as initial parameters. The feature map obtained in stage 5 is input to the global detail supplement module. After being expanded into the feature vector Vl, the feature map obtained is cascaded with the feature vector V3 obtained in stage 3 and the feature vector V4 obtained in stage 4. Then it is input into the classification module to get the prediction tag of cascade operation, the cross entropy loss is calculated, and the training is continued until the loss is stable.
This embodiment adopts a progressive training network. The improved network can improve the diversity of information acquisition, acquire the low-level subtle discriminant information, and also integrate the global structure of the target object in the high-level learning, so that the local discriminant information can be integrated with the global structure. The feature maps obtained in the last three stages of the network are processed by a convolution module and a classification module respectively, and then the CELoss of the prediction tags and actual tags obtained in this stage is calculated. In progressive training, the reciprocal stage 3 is trained at first, and then new training stages are gradually added. In each step, the obtained CELoss updates the constraint parameters. Because the receptive field at the bottom stage (such as the reciprocal stage 3 of resnet50 network) is small, the subtle discriminant information in local areas can be obtained. With the increase of stages, the global structure of the target can be obtained in the high-level stage. The progressive training method can fuse the local discriminant information with the global structure.
Based on embodiment 1, the embodiment provides an image recognition system of convolutional neural network based on global detail supplement, which comprises the following:
It should be noted that each module in this embodiment corresponds to each step in embodiment 1, and the specific implementation process is the same, which will not be described here.
The embodiment provides an electronic device, which comprises a memory, a processor and a computer instruction stored in the memory and running on the processor. The steps described in the method of embodiment 1 are completed when the computer instruction is run by the processor.
The above are only preferred embodiments of the invention but are not intended to limit the invention. Those skilled in the art can change or vary the invention in different ways. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the invention shall fall within the protection scope of the invention.
Although the above content describes the specific embodiments of the invention in combination with the drawings, but it doesn’t limit the scope of protection of the invention. Those skilled in the art should understand that diversified modifications or changes can be made based on the technical solution of the invention without any creative effort, which shall fall within the scope of protection of the invention.
Number | Date | Country | Kind |
---|---|---|---|
202210500255.9 | May 2022 | CN | national |