The present invention relates to generation of a feature amount.
In recent years, the accuracy of image recognition techniques such as image classification, object detection, and object tracking has remarkably improved due to advent of a deep neural network (DNN). There are various DNN structures, and in image recognition, a convolutional neural network (CNN) in which convolution operations are performed in multiple layers is mainly used. On the other hand, in Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”, arXiv: 2010.11929, 2020 (non-patent literature 1), there is proposed a Vision Transformer (ViT) that applies Transformer used in natural language processing is applied to image recognition. Transformer is a structure representing the relationship between words using Attention in natural language processing. However, in ViT, the number of parameters and the calculation amount are large.
In Yu et al., “Metaformer is Actually What You Need for Vision”, CVPR, 2021 (non-patent literature 2), a method of changing Multi-head Self Attention (MSA) that is the key of operations in ViT to lighter processing is proposed. More specifically, MSA is changed to processing such as Pooling or Multi-Layer Perceptron (MLP). Also, in Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv: 2103.14030, 2021 (non-patent literature 3), there is proposed a method of dividing feature amounts into several rectangular windows and performing MSA for each window.
The above-described MSA, Pooling, or MLP is processing of performing mixing (mix) of feature amounts at token level. If all feature amounts are efficiently mixed in this processing, various patterns can readily be recognized, and as a result, the recognition accuracy improves.
In the above-described conventional techniques, however, it is difficult to efficiently mix all feature amounts. For example, in non-patent literature 3, window division is performed while shifting the window by ½ its size for each layer, thereby causing tokens that belong to different groups in a certain layer to belong to the same group in another layer. However, since tokens of ½ the window size overlap tokens mixed in the preceding layer, it is difficult to efficiently mix a larger number of types of tokens. On the other hand, if a larger number of types of tokens are to be mixed, the number of parameters and the calculation amount of Attention increase.
According to one aspect of the present invention, an information processing apparatus comprising one or more memories storing instructions and one or more processors that execute the instructions to: obtaining input data; generateing a feature amount from the obtained input data; and irregularly mixing a plurality of tokens included in the feature amount in a spatial direction of the generated feature amount.
The present invention makes it possible to implement a more accurate task while suppressing an increase of the number of parameters or a calculation amount.
Further features of the present invention will become apparent from the following description of exemplary embodiments (with reference to the attached drawings).
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
Hereinafter, embodiments will be described in detail with reference to the attached drawings. Note, the following embodiments are not intended to limit the scope of the claimed invention. Multiple features are described in the embodiments, but limitation is not made to an invention that requires all such features, and multiple such features may be combined as appropriate. Furthermore, in the attached drawings, the same reference numerals are given to the same or similar configurations, and redundant description thereof is omitted.
As the first embodiment of an information processing apparatus according to the present invention, an information processing apparatus that performs an object detection task using a neural network (NN) will be described below as an example. Note that in the first embodiment, the object detection task means a task for detecting the position and size of an object from an image. In the following explanation, processing of performing mixing (mix) of feature amounts at token level is called “token mix”. MSA, Pooling, and MLP described in “BACKGROUND” are a kind of token mix.
An input device 109, an output device 110, a network 111 such as the Internet, and a camera 112 are connected to the information processing apparatus 100. Note that the connection method is not limited to that shown in
The input device 109 is a device configured to perform user input to the information processing apparatus 100. The input device may be, for example, a pointing device or a keyboard. The output device 110 is a device such as a monitor capable of displaying an image and characters to display data held by the information processing apparatus 100, data supplied by user input, and a program execution result. The camera 112 is an image capturing device capable of obtaining a captured image. The camera 112 may obtain continuous captured images having a predetermined interval Δt, which are to be input to, for example, an image obtaining unit 201 to be described later.
A CPU 101 is a central processing unit that controls the entire information processing apparatus 100. The CPU 101 performs various kinds of software (computer programs) stored in, for example, an external storage device 104, thereby causing the information processing apparatus 100 to implement various kinds of functions and operations to be described later. A ROM 102 is a read only memory configured to store programs and parameters which do not need to be changed. A RAM 103 is a random access memory configured to temporarily store programs and data supplied from an external device or the like.
The external storage device 104 is an external storage device readable by the information processing apparatus 100, and stores programs and data in a long term. The external storage device 104 may be, for example, a hard disk and a memory card stationarily installed in the information processing apparatus 100. Alternatively, for example, the external storage device 104 may be a flexible disk (FD), an optical disk such as a compact disk (CD), a magnetic or optical card, an IC card, or a memory card, which are detachable from the information processing apparatus 100.
An input device interface 105 is an interface to the input device 109. An output device interface 106 is an interface to the output device 110. A communication interface 107 is an interface to be connected to the network 111 such as the Internet or the camera 112. A system bus 108 is a bus that communicably connects the units in the information processing apparatus 100.
Note that
As described above, programs that implement various kinds of functions and operations are stored in the external storage device 104. When performing a program, the CPU 101 reads out the program to the RAM 103. The CPU 101 performs the program, thereby implementing various kinds of functions and operations. Note that various kinds of programs and setting data sets are assumed to be stored in the external storage device 104, but these may be stored in an external server (not shown). In this case, the information processing apparatus 100 obtains the programs and the setting data sets from the external server via, for example, the network 111.
The information processing apparatus 100 includes the image obtaining unit 201, a parameter obtaining unit 202, the feature amount generation unit 203, the feature amount processing unit 204, and the post-processing unit 205. As described above, these function units are implemented by the CPU 101 performing a program. These function units are communicably connected to a storage unit 206. Note that
The image obtaining unit 201 obtains image data of an object captured by the image capturing device. The object is, for example, an object such as a person or an animal. A case where a person is detected will be described below. The parameter obtaining unit 202 obtains parameters associated with the NN.
The feature amount generation unit 203 generates feature amounts from the image data obtained from the image obtaining unit 201. The feature amounts are generated using a CNN or the like. The feature amount processing unit 204 performs an operation for the feature amounts obtained from the feature amount generation unit 203, thereby mixing the feature amounts.
A group division unit 401 divides feature amounts into several groups. A token mix unit 402 performs processing (token mix) such as MLP or MSA for each feature amount divided by the group division unit 401. A group division cancel unit 403 returns the position of each token of the token-mixed feature amounts to the original position (a position in a spatial direction before group division). A channel mix unit 404 mixes the feature amounts in a channel direction. The detailed operations of the function units shown in
Based on the feature amounts output from the feature amount processing unit 204, the post-processing unit 205 forms a bounding box (BB) representing the position and size of the person that is the object and outputs the BB.
In step S501, the image obtaining unit 201 obtains image data of a captured object. Note that the image obtaining unit 201 may obtain image data generated by the camera 112 connected to the information processing apparatus 100, or may obtain image data stored in the external storage device 104.
In step S502, the parameter obtaining unit 202 obtains parameters necessary for the operation of the NN. More specifically, parameters of an operation such as CNN or MLP and parameters necessary for group division to be described later are obtained.
In step S503, the feature amount generation unit 203 generates feature amounts from the image data obtained by the image obtaining unit 201. The feature amounts can be generated using, for example, a CNN.
In step S504, the feature amount processing unit 204 processes the feature amounts obtained by the feature amount generation unit 203. As described above, the feature amount processing unit 204 is formed by a plurality of layers (three layers 301 to 303 shown in
In step S601, the group division unit 401 irregularly divides feature amounts into groups. As an example of irregular division to groups, details of random group division processing will be described with reference to
In step S602, the token mix unit 402 performs token mix for each divided group. Examples of the token mix method are MLP in which a plurality of Fully-Connected (FC) layers are connected, as shown in
Note that token mix by MSA is implemented by equation (1) below. Q, K, and V are obtained via FC for features divided into groups.
In step S603, the group division cancel unit returns the elements token-mixed in each group by the token mix unit 402 to the original positions concerning the spatial direction.
In step S604, the channel mix unit 404 mixes in the channel direction (the depth direction in
In step S505, the post-processing unit 205 specifies a Bonding Box (to be abbreviated as BB hereinafter) representing the position/size of a person included in the input image, based on the feature amounts obtained in step S504. As the method of specifying the BB of the person from the output of the NN, a method described in literature A below can be used.
As described above, according to the first embodiment, feature amounts are divided irregularly (at random) into groups in the spatial direction, and token mix is performed for each divided group. With this configuration, the tokens mixed in the layers (layers 301 to 303) overlap little. It is therefore possible to mix a larger number of types of tokens while suppressing an increase of the number of parameters or a calculation amount. As a result, various patterns can easily be detected, and the detection accuracy can be improved. “Irregular” means not only “random” obtained from a random function but also a changeable rule that changes every time division is performed, a rule without periodicity, or a rule that is not prepared in advance by calculation.
In the description of the first embodiment, in step S601, the group division unit 401 divides feature amounts into groups at random concerning only the spatial direction. However, group division may be performed at random in both the spatial direction and the channel direction. Since tokens of more patterns can readily be mixed, the recognition accuracy improves.
In the description of the first embodiment, in the processing of each layer in the feature amount processing unit 204, the group division unit 401 divides input feature amounts into groups. However, input feature amounts may undergo predetermined processing and are then divided into groups. For example, as such predetermined processing, local token mix can be performed.
In step S1201, the local mix unit 1101 divides feature amounts into groups, as shown in
In step S1202, the local mix unit 1101 performs, using CNN, token mix for the feature amounts output in step S1201 (
When local token mix is thus performed in advance, the group division unit 401 can readily mix more tokens. As a result, recognition can be performed in consideration of both the local relationship and the global relationship of feature amounts.
In the description of the first embodiment, in the processing of each layer in the feature amount processing unit 204, token mix in the spatial direction is performed, and after that, token mix in the channel direction is performed. However, token mix in the spatial direction and that in the channel direction may be performed in combination.
In step S1501, the channel division unit 1401 divides feature amounts to a predetermined number in the channel direction. The predetermined number can be decided empirically.
In step S1502, the group division unit 401 divides the predetermined number of feature amounts divided in the channel direction into groups. The group division method may change for each feature amount divided in the channel direction. For example, irregular group division (
In step S602, the token mix unit 402 performs token mix for each group divided in the channel direction.
In step S603, the group division cancel unit returns the token-mixed elements to the original positions concerning the spatial direction.
In step S1503, the channel connection unit 1402 connects the feature amounts output in step S603 in the channel direction and cancels division in the channel direction.
In step S604, the channel mix unit 404 performs mix in the channel direction for the feature amounts obtained in step S1503.
In this way, after the feature amounts are divided in the channel direction, a different group division method is combined, thereby increasing division patterns in group division. Since tokens of more patterns can readily be mixed, the recognition accuracy improves.
In the second embodiment, an information processing apparatus that performs a tracking task for detecting a specific target object from an image and tracking this will be described. The hardware configuration of the information processing apparatus is the same as in the first embodiment (
A tracking target designation unit 1601 decides a tracking target in an image in accordance with an instruction designated by an input device 109. For example, a target image is displayed on a touch panel display, and user touch on a tracking target in the displayed image is accepted, thereby deciding the tracking target. Note that instead of using the designation from the user, a main object or the like in the image may automatically be detected and decided. The decision may be made based on both the designation by the user and an object detection result in the image. As a method of automatically detecting a main object/object from an image, for example, a method described in Japanese Patent No. 6556033 or literature C below can be used.
In step S1701, an image obtaining unit 201 obtains image data (template image) in which a tracking target exists.
In step S1702, the image obtaining unit 201 cuts out an image on the periphery of the tracking target in the template image based on the position/size of the tracking target obtained by the tracking target designation unit 1601, and resizes the image. For example, a region whose size is a constant multiple of the size of the tracking target is cut out with respect to the position of the tracking target as the center. A partial region 1802 shown in
In step S1703, the image obtaining unit 201 obtains image data (search image) that is the target to search for the tracking target. An image 1805 shown in
In step S1704, the image obtaining unit 201 cuts out an image on the periphery of the tracking target in the search image, and resizes the image. For example, an image whose size is a constant multiple of the size of the tracking target is cut out with respect to the position of the tracking target in one frame (or before a predetermined time) as the center. A partial region 1806 shown in
In step S503, feature amounts are generated for the template image and the search range image, as in the first embodiment. In step S504, the feature amounts are processed for of the template image and the search range image, as in the first embodiment. That is, token mix based on irregular group division is performed.
In step S1705, a matching unit 1602 estimates the position/size of the tracking target in the search image based on the feature amounts of the cutout image of the template image and the feature amounts of the cutout image of the search image.
In step S505, a BB representing the position/size of the person included in the input image is specified based on the feature amounts obtained by the matching unit 1602, as in the first embodiment.
As described above, according to the second embodiment, for each of a template image and a search image, feature amounts are divided irregularly (at random) into groups in the spatial direction, and token mix is performed for each divided group. With this configuration, various patterns can easily be recognized. It is therefore possible to easily cope with a change of the posture or background of the tracking target, and the tracking accuracy improves.
In the third embodiment, an information processing apparatus that performs a class classification task for classifying an image into a preset class will be described. The hardware configuration of the information processing apparatus is the same as in the first embodiment (
In step S2101, the identification unit 2001 outputs feature amounts for class classification from feature amounts obtained from the feature amount processing unit 204.
In step S2102, the post-processing unit 2002 outputs an index of a dimension having the highest value out of the feature amounts output in step S2101 (S2202) as the index of the class classification result. For example, if the third (index=“3”) feature amount has the highest value, and the third class is “dog”, the target object is classified to “dog”.
As described above, according to the third embodiment, feature amounts are divided irregularly (at random) into groups in the spatial direction, and token mix is performed for each divided group. With this configuration, various patterns can easily be detected, and therefore, the accuracy of class classification improves.
In the above-described first to third embodiments, processing (object detection, tracking, and class classification) associated with image recognition has been described. However, the information processing apparatus can be applied not only to image recognition but also to prediction/recognition using time-series data and natural language processing using text data.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2023-010487, filed Jan. 26, 2023, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2023-010487 | Jan 2023 | JP | national |