The present invention relates to an information processing device and an information processing method.
There is conventionally a technique of dividing data that has been input into an important part (features) and an unimportant part (background). For example, a technique using deep learning ignores the background of image data to detect only the features, thereby enabling an analysis of the features. This technique has the following two advantages.
The above technique is applicable to, for example, the analysis of an object, e.g., a person, an animal, a moving object, or the like that appears in an image or a video captured by a monitoring camera.
In addition, an EDRAM (Enriched Deep Recurrent visual Attention Model) is known as a technique of analyzing an object appearing in a video or an image, as described above. The EDRAM is a technique of moving a frame for capturing an object part in an input image or an input video, and analyzing a region of the frame cut out each time the frame is moved.
Here, for an image, the frame can move in two directions of vertical and horizontal directions, and for a video, in three directions with the time axis added to the vertical and horizontal ones. Further, the frame may move to a position such that the frame includes an object in an image or a video therein. Here, the region of the frame cut out is analyzed, for example, by the following classification and crosschecking of the object. Note that the following is an example of classification and crosschecking when the object is a person.
Note that the above classification includes estimating a variety of information and states related to the person, such as motion of the person, in addition to estimating the attributes of the person.
Further, the EDRAM is composed of, for example, the following four neural networks (NNs).
In the initialization NN of the EDRAM, when an image 101 including a person, for example, is acquired, the first frame for the image 101 is determined and cut out. Then, the position of the frame cut out (e.g., the first frame illustrated in
After that, in the movement NN, the frame is moved to an optimum position. For example, in the movement NN, the frame is moved to the position of the second frame illustrated in
After that, the frame is moved to a more optimal position in the movement NN. For example, in the movement NN, the frame is moved to the position of the third frame illustrated in
With the EDRAM repeating such processes, the frame is narrowed down gradually so that the frame converges on the whole body of the person in the image 101 finally. Therefore, in the EDRAM, it is important that the frame generated by the initialization NN includes a person in order for the frame to converge on the whole body of the person in the image. In other words, if the frame (first frame) generated in the initialization NN does not include a person, it is difficult to find a person no matter how many times the frame is narrowed down in the movement NN.
Here, an experiment was conducted, and the result of the experiment was obtained such that, when an image group to be handled in the EDRAM has the multi-scale property, the initialization of a frame including a person often fails. The multi-scale property here is a property wherein the size (scale) of a person appearing is different depending on images. For example, as illustrated in
When an image group to be handled in the EDRAM has a multi-scale property, the initialization of a frame including a person may fail, and as a result, the analysis accuracy of persons in the image may be reduced.
This will be described with reference to
Note that, when an image group to be handled in the EDRAM has the multi-scale property, the reason why the initialization of a frame including a person fails is believed to be as follows.
For example, as illustrated in images 201, 202 and 203 of data set B in
[NPL 1] Artsiom Ablavatski, Shijian Lu, Jianfei Cai, “Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition”, IEEE WACV 2017, 12 Jun. 2017
In an analysis device that extracts and analyzes features from input data as well as the above-described EDRAM, when the input data has the multi-scale property, the initialized first frame may not include the features. Therefore, it may not be possible to accurately analyze the input data. Accordingly, an object of the present invention is to solve the above-described problem and accurately analyze features of input data even when the input data has the multi-scale property.
In order to solve the above-described problem, the present invention is an information processing device that performs pre-processing on data used in an analysis device that extracts and analyzes features of data. The information processing device includes an input unit that accepts an input of the data; a prediction unit that predicts a ratio of the features to the data; a division method determination unit that determines a division method for the data according to the predicted ratio; and a division execution unit that executes division for the data based on the determined division method.
According to the present invention, even when the input data has the multi-scale property, it is possible to accurately analyze the features of the input data.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. To begin with, an overview of a system including an information processing device of an embodiment will be described with reference to
The system includes an information processing device 10 and an analysis device 20. The information processing device 10 pre-processes data (input data) to be handled by the analysis device 20. The analysis device 20 analyzes the input data pre-processed by the information processing device 10. For example, the analysis device 20 extracts features of the input data on which the pre-processing has been performed by the information processing device 10, and analyzes the extracted features.
For example, when the input data is image data, the features of the input data is, for example, a person part of the image data. In this case, the analysis device 20 extracts a person part from the image data that has been pre-processed by the information processing device 10, and analyzes the extracted person part (e.g., estimates the gender, age, etc. of the person corresponding to the person part). The analysis device 20 performs analysis using, for example, the above-described EDRAM. Note that, when the input data is image data, the features of the input data may be other than a person part, and may be, for example, an animal or a moving object.
Note that the input data may be video data, text data, audio data, or time-series sensor data, other than the image data. Note that, in the following description, a case where the input data is image data will be described.
In accordance with the above-described EDRAM, the analysis device 20 for example initializes the frame based on the input data which has been pre-processed by the information processing device 10, stores the previous frames as memory, narrows down and analyzes the frame based on the memory, updates the parameters of each NN based on errors on the position of the frame and the analysis, and the like. An NN is used for each processing, and the process results of each NN propagate forward and backward, for example, as illustrated in
Note that, instead of or in addition to the above-described EDRAM, the analysis device 20 may extract and analyze the features from the input data by a sliding window method (described later), YOLO (You Only Look Once, described later), or the like.
Here, the information processing device 10 divides the input data based on a prediction result of a ratio (scale) occupying a ratio of the features to the input data.
For example, the information processing device 10 predicts the ratio (scale) of the features to the input data, and if the predicted scale is equal to or smaller than a predetermined value (e.g., if the person part serving as the features in the image data is small), a predetermined division is performed on the input data. Then, the information processing device 10 outputs the divided input data to the analysis device 20. On the other hand, if the predicted scale is equal to or smaller than the predetermined value (e.g., if the person part serving as the features in the image data is small), the information processing device 10 outputs the input data to the analysis device 20 without performing division.
Thus, the variations in the scales of the data input to the analysis device 20 can be reduced as much as possible, so that the analysis device 20 can accurately analyze the features of the input data.
Subsequently, a configuration of the information processing device 10 will be described with reference to
The input unit 11 accepts an input of input data. The scale prediction unit 12 predicts a ratio (scale) of the features to the input data accepted by the input unit 11. For example, if the input data (image data) includes a person, the scale prediction unit 12 predicts what scale the person is likely to appear. For the scale prediction performed here, machine learning is used, for example. As the machine learning, an NN is used, for example. The NN allows more accurate prediction of the scale of unknown input data by learning with pairs of the input data and its scale.
Here, an example of training data used for learning with the NN will be described with reference to
Here, for a ratio (scale, R) of the features (person part) to the image data, as an example, a data set of three categories is prepared: R∈[15, 30] (category 1: scale “Large”), R∈[10, 15] (category 2: scale “Medium”), and R∈[5, 10] (category 3: scale “Small”). Then, the scale prediction unit 12 predicts the scale in a manner of updating the parameters of the NN so as to fit the data set and determining which of scale “Large”, scale “Medium”, and scale “Small” as described above the input data (image data) to be predicted belongs to.
For example, consider cases where the input data is image data indicated by reference numeral 301 and image data indicated by reference numeral 302 in
Note that the scale prediction unit 12 may directly predict the value of the scale (R) without categorizing the scale (R) of the input data into large, medium, small, and the like.
Note that it is assumed that, when the input data is image data including a background, the NN that implements the scale prediction unit 12 determines whether the input data (image data) is of wide-angle photography or telephotography based on the size of a building or the like which is the background of the features of the image data or the like, and makes use of the results for accurate scale prediction.
The division method determination unit 13 in
For example, as illustrated in
On the other hand, as illustrated in
Note that the scale prediction unit 12 may be implemented by an NN. In this case, the scale prediction unit 12 accepts an error between the scale predicted by the scale prediction unit 12 and the actual scale. Then, the scale prediction unit 12 adjusts parameters used for scale prediction based on the error. Repeating such processing makes it possible for the scale prediction unit 12 to more accurately predict the scale of the input data.
The division execution unit 14 in
The output unit 15 outputs the input data output from the division execution unit 14 and the division method determination unit 13 to the analysis device 20. For example, the output unit 15 outputs the image data 402 (see reference numeral 403 in
Next, a processing procedure of the system will be described with reference to
As the result of determining the division method in S3, if it is determined that the input data accepted in S1 is not to be divided (“not divide” in S4), the division method determination unit 13 outputs the input data to the analysis device 20 via the output unit 15 (S6: output the data). On the other hand, as the result of determining the division in S3, if it is determined that the input data accepted in S1 is to be divided (“divide” in S4), the division execution unit 14 performs a predetermined division on the input data based on the determination result by the division method determination unit 13 (S5). Then, the division execution unit 14 outputs the divided input data to the output unit 15. Then, the output unit 15 outputs the divided input data to the analysis device 20 (S6: output the data). After S6, the analysis device 20 analyzes the data output from the information processing device 10 (S7).
In such an information processing device 10, if the scale of the input data is equal to or smaller than the predetermined value, it is possible to perform division according to the scale and then output the data to the analysis device 20. Thus, even when input data group has the multi-scale property, it is possible to make the scale of the data group to be input to the analysis device 20 as equal as possible. As a result, the analysis device 20 can improve the analysis accuracy of the features in the input data.
Note that, when the input data is image data having a sense of depth as illustrated in
Further, the analysis device 20 is not limited to the above-described device using the EDRAM as long as it can extract features from the input data and analyzes them. For example, the analysis device 20 may be a device that extract features from the input data and analyzes them by the sliding window method, YOLO, or the like.
For example, when the analysis device 20 is a device that extracts features (person part) from the input data (e.g., image data) by the sliding window method, the analysis device 20 extracts the person part from the image data and analyzes it as follows.
That is, the analysis device 20 using the sliding window method prepares frames (windows) of several types of sizes, slides the frames on image data, and performs a full scan to detect and extract a person part. Thus, the analysis device 20 detects and extracts, for example, the first, second, and third person parts from the image data illustrated in
In the sliding window method, since processing of adjusting the sizes of the frames is not performed, a person who appears to be large in the image cannot be detected unless a large frame is used, and a person who appears to be small in the image cannot be detected unless a small frame is used. Then, unsuccessful detection of the person part results in a reduced analysis accuracy of the person part.
Accordingly, the analysis device 20 using the sliding window method accepts pieces of data (image data) with a scale as equal as possible from the information processing device 10 described above, thereby making it easy to prepare a frame with an appropriate size for the image data. As a result, the analysis device 20 easily detects the person part from the image data, and thus it is possible to improve the analysis accuracy of the person part in the image data. Further, since the analysis device 20 does not need to prepare frames of various sizes for the image data, it is possible to reduce the processing load required when detecting a person part from the image data.
Further, for example, when the analysis device 20 is a device that extracts a person part, which is features, from the input data (e.g., image data) and analyzes it by YOLO, the analysis device 20 extracts the person part, which is features, from the image data and analyzes it as follows.
That is, the analysis device 20 using YOLO divides the image data into grids to look for a person part as illustrated in
Accordingly, the analysis device 20 using YOLO accepts pieces of data (image data) with a scale as equal as possible from the information processing device 10 described above, thereby making it easy to detect a person part from the image data. As a result, it is possible to improve the analysis accuracy of the person part in the image data.
Further, as described above, the input data to be handled in the system may be video data, text data, audio data, or time-series sensor data, other than the image data.
For example, when the input data is text data, the features is, for example, a specific word, phrase, expression, or the like in the text data. Therefore, when the input data is text data, the information processing device 10 uses, as a scale of the input data, for example, a ratio of the number of characters in the above-described features to the number of all characters in the entire text data.
Then, the information processing device 10 divides the text data as necessary so that the ratio (scale) of the number of characters of the above-described features to the number of all characters of the entire text data is as equal as possible, and outputs the divided data to the analysis device 20.
In this way, when the analysis device 20 is an analysis device that analyzes a specific word, phrase, expression, or the like in text data, it is possible to improve the analysis accuracy.
Further, for example, when the input data is audio data, the features include, for example, a human voice in audio data with background noise and a specific word or phrase in audio data, a voice of a specific person, a specific frequency band and the like, without background noise. Therefore, when the input data is audio data, the information processing device 10 uses, as the scale of the input data, for example, an SN ratio (Signal-to-Noise ratio) of the human voice to the audio data, or the length of time for a particular word or phrase relative to the total length of time for the entire audio data. Further, when a specific frequency band in audio data is used, the information processing device 10 uses, as the scale of the input data, for example, a width of a specific frequency band with respect to all bars of a histogram indicating an appearance frequency for each of the frequency bands included in the audio data (see
Then, the information processing device 10 divides the audio data as necessary so that the ratio (scale) of the features (the SN ratio of a human voice, the length of time of a specific word or phrase, and the width of a specific frequency band) to the entire audio data is as equal as possible, and outputs the divided data to the analysis device 20.
In this way, when the analysis device 20 analyzes a human voice, a specific word or phrase, a voice of a specific person, a specific frequency band, and the like in audio data, it is possible to improve the analysis accuracy.
Further, when the input data is time-series sensor data, the features include, for example, a sensor value pattern indicating some abnormality and the like. As an example, the sensor value itself is in a normally possible range (normal range), but it may have a repeated pattern peculiar to an abnormality (see
Therefore, when the input data is time-series sensor data, the information processing device 10 uses, as the scale of the input data, for example, a frequency of a part which is in a normal range of the sensor value itself but indicates a pattern peculiar to an abnormality in the time-series sensor data (see
In this way, when the analysis device 20 detects and analyzes an abnormality from time-series sensor data, it is possible to improve the analysis accuracy.
Further, the input data may be a video image (image data). In this case, the features include, for example, a frame in a video image in which a person makes a specific motion. Then, the information processing device 10 divides the frame of the video image as necessary so that the ratio (scale) of the features (the frame in the video image in which a person makes a specific motion) to the number of all frames of the entire video image is as equal as possible, and outputs the divided frames to the analysis device 20.
In this way, when the analysis device 20 analyzes a frame in a video image in which a person makes a specific motion, it is possible to improve the analysis accuracy.
Further, the functions of the information processing device 10 described in the above embodiment can be implemented by installing a program for realizing the functions on a desired information processing device (computer). For example, by causing the information processing device to execute the above-described program provided as package software or online software, the information processing device can function as the information processing device 10. The information processing device referred to here includes a desktop or laptop personal computer, a rack-mounted server computer, and the like. The information processing device also includes a mobile communication terminal such as a smartphone, a mobile phone, and a PHS (Personal Handyphone System), and also a PDA (Personal Digital Assistants) and the like. Further, the information processing device 10 may be implemented in a cloud server.
An example of a computer that executes the above program (information processing program) will be described with reference to
The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130.
Here, as illustrated in
Then, the CPU 1020 loads the program module 1093 and the program data 1094 stored in the hard disk drive 1090 into the RAM 1012 as necessary, and executes the processes in the above-described procedures.
Note that the program module 1093 and the program data 1094 according to the above information processing program are not limited to being stored in the hard disk drive 1090. For example, the program module 1093 and the program data 1094 according to the above program may be stored in a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 according to the above program may be stored in another computer connected via a network such as a LAN or a WAN (Wide Area Network), and read out by the CPU 1020 via the network interface 1070. Further, the computer 1000 may execute the processing using a GPU (Graphics Processing Unit) instead of the CPU 1020.
Number | Date | Country | Kind |
---|---|---|---|
2018-050181 | Mar 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/010714 | 3/14/2019 | WO | 00 |