The present invention relates to learning processing based on training data.
In recent years, research on deep learning, including convolutional neural networks (CNNs), is advancing, and with improvements in implementation techniques, processing speeds, and more, many applications are being made with respect to real-world problems. At present, many applications are being developed for image recognition, voice recognition, text recognition, and more using machine learning models that can be used in the cloud, in embedded devices, and in PCs. In general, for pattern recognition tasks (in image recognition, tasks such as image classification, object detection, semantic region segmentation, and the like), training data and validation data are prepared for each task. If there is unevenness in the variation of training images and ground truth labels in the training data, sufficiently good performance cannot be achieved when evaluating accuracy using the validation data.
Japanese Patent Laid-Open No. 2010-204966 (Patent Document 1) discloses a method for reducing unevenness in the number of pieces of data per class included in training data such that even when there are three or more types (classes) of ground truth labels included in the training data, discrimination results for classes from learning are not biased toward a particular class. Japanese Patent No. 6567488 (Patent Document 2) discloses a method for obtaining high identification accuracy even for attributes having a small number of pieces of data in training data when training an identification model that estimates attributes, which are ground truth labels, from features of target data. Specifically, the identification model is trained in a state where a number of pieces of data for attributes that do not have a maximum number of pieces of data are replicated and increased until reaching the number of pieces of data for the attribute having the maximum number of pieces of data.
However, although the conventional techniques reduce data unevenness by focusing only on the number of ground truth labels, these techniques may not necessarily be capable of reducing data unevenness when applied to tasks such as semantic region segmentation. Here, “semantic region segmentation” is a task of cropping regions of various classes (people, automobiles, roads, buildings, plants, the sky, and so on) from an image. For example, when using the method of Patent Document 1, unevenness in the number of pieces of data between classes will be reduced, but unevenness in shape variation will not necessarily be reduced. Additionally, when using the method of Patent Document 2, unevenness in the number of pieces of data between attributes will be reduced, but unevenness in shape variation will not necessarily be reduced. When these methods are applied to a task such as semantic region segmentation, appropriate learning cannot be performed. As a result, image processing based on the learning results also cannot be performed appropriately.
According to one aspect of the present invention, a learning apparatus that trains an estimator that executes a recognition task, the learning apparatus comprises: an obtaining unit configured to obtain a plurality of training data items including input data and supervisory data corresponding to the input data; a calculating unit configured to calculate statistic information relating to a predetermined perspective in the plurality of training data items obtained by the obtaining unit; a determining unit configured to determine a degree of importance of each training data item included in the plurality of training data items based on the statistic information; and a control unit configured to control training of the estimator based on the degree of importance determined by the determining unit, wherein the determining unit determines the degree of importance of each training data item such that unevenness of the plurality of training data items with respect to the predetermined perspective is reduced.
The present invention performs learning appropriately even when unevenness is present in data contained in a training data set.
Further features of the present invention will become apparent from the following description of exemplary embodiments (with reference to the attached drawings).
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
Hereinafter, embodiments will be described in detail with reference to the attached drawings. Note, the following embodiments are not intended to limit the scope of the claimed invention. Multiple features are described in the embodiments, but limitation is not made to an invention that requires all such features, and multiple such features may be combined as appropriate. Furthermore, in the attached drawings, the same reference numerals are given to the same or similar configurations, and redundant description thereof is omitted.
Overview
In data used for training in tasks such as semantic region segmentation, it is necessary to reduce unevenness in the number of pieces of data for ground truth labels as well as unevenness regarding variations in the shapes included in the training data. Specifically, it is necessary to homogenize the data such that there is no unevenness in the data, from complex to simple shapes.
In a region detection task, which detects regions of various classes (people, automobiles, roads, buildings, plants, sky, and the like) from an images, these regions are cropped along the boundaries between one target object and a different target object. Ground truth labels, which indicate the region of the target object, are assigned in units of pixels for the target object in the image, and can take on various shapes depending on the target object to be detected. For this reason, variation in the shapes of target objects in the training data is important in semantic region segmentation tasks.
Accordingly, a first embodiment will describe a method that reduces unevenness in the number of pieces of data for ground truth labels as well as unevenness with respect to variation in shapes included in training data. Specifically, a learning apparatus that learns by homogenizing data such that there is no unevenness in the data, from complex to simple shapes, will be described. An image processing apparatus that uses the learning results will also be described.
Apparatus Configuration
First, the functional configuration and the hardware configuration of an image processing apparatus 1000 and a learning apparatus 5000 will be described with reference to
Functional Configuration as Image Processing Apparatus
An image obtainment unit 1100 obtains input images. An estimation unit 1200 uses an estimator 600, which will be described later with reference to
The image obtainment unit 1100, the estimation unit 1200, and the output unit 1300 may be implemented in the same computer, or may be configured as independent modules. These may also be implemented as programs that run on the computer. These may also be implemented as circuitry or programs inside an image capturing apparatus such as a camera or the like.
Functional Configuration as Learning Apparatus
A training data storage unit 5100 stores the training data prepared in advance. The training data includes training images and supervisory information. A data obtainment unit 2100 obtains the training data from the training data storage unit 5100. A class mixing ratio calculation unit 2200 calculates class mixing ratio supervisory information for each identification image from class label supervisory information input for each pixel. This will be described in detail later with reference to
A statistic information calculation unit 2300 calculates statistic information pertaining to a predetermined perspective to ascertain unevenness of the training data, based on the training images, the supervisory information, and the supervisory information of the class mixing ratio for each identification image. Here, the statistic information calculation unit 2300 calculates the class mixing ratio for a range including the periphery of the identification image. This will be described in detail later with reference to
An importance changing unit 2400 changes a degree of importance during learning for each piece of the training data based on the statistic information calculated by the statistic information calculation unit 2300. This will be described in detail later with reference to
A learning unit 2500 uses the training images, the supervisory information, and the degree of importance thereof to perform learning for the estimator 600, which will be described later with reference to
The data obtainment unit 2100, the class mixing ratio calculation unit 2200, the statistic information calculation unit 2300, the importance changing unit 2400, and the learning unit 2500 may be implemented in the same computer, or may be configured as independent modules. These may also be implemented as programs that run on the computer. The training data storage unit 5100 and the estimator storage unit 5200 of the learning apparatus can be realized using storage inside or outside the computer.
The image processing apparatus 1000 and the learning apparatus 5000 described above may be realized on the same computer, or on separate computers. The estimator storage units 5200 provided in the learning apparatus 5000 and the image processing apparatus 1000 may be the same storage, or may be different storage. When different storage is used, the estimator stored in the estimator storage unit 5200 by the learning apparatus is assumed to be copied or moved to the estimator storage unit 5200 provided in the image processing apparatus.
Hardware Configuration
A processor 101 is a CPU, for example, and controls the operations of the computer as a whole. Memory 102 is RAM, for example, and temporarily stores programs, data, and the like. A computer-readable storage medium 103 is, for example, a hard disk, a CD-ROM, or the like, and stores programs, data, and the like on a long-term basis. In the present embodiment, programs that realize the functions of each unit, which are stored by the storage medium 103, are read out to the memory 102. Then, the processor 101 operates according to the programs in the memory 102 to realize the functions of each unit.
An input interface 104 is an interface for obtaining information from an external apparatus. An output interface 105 is an interface for outputting information to an external apparatus. A bus 106 connects the various units described above to enable the exchange of data.
Apparatus Operations
The present embodiment will describe a semantic region segmentation task using a CNN as an example. As described in the Description of the Related Art, “semantic region segmentation” is a task of cropping regions of various classes from an image, such as people, automobiles, plants, sky, and the like. The present embodiment assumes the use of a method for recognizing a mixing state of classes in a partial region image (an identification image) in a captured image, as described in Japanese Patent Laid-Open No. 2019-16114.
Operations as Image Processing Apparatus
In step S1100, the image obtainment unit 1100 obtains the input image for which the mixing state is to be estimated. The image obtainment unit 1100 can also obtain image data before development, obtained from an image capturing apparatus.
The image obtainment unit 1100 sets a plurality of partial regions for which the mixing state is to be estimated at predetermined positions in the input image according to a predetermined setting pattern. An image of this partial region will be called an “identification image” hereinafter. Although details will be given later, the identification image is a partial image of a predetermined size according to a unit of identification, such as a 4×4 pixel rectangular partial image, for example, and the setting method thereof is not particularly limited.
In step S1200, the estimation unit 1200 loads, from the estimator storage unit 5200, an estimator that has been pre-trained by the learning apparatus described later. 600 in
Various forms can be used as the configuration of the CNN. Typically, a CNN is a neural network that performs recognition tasks by gradually collecting local features of an input signal by iterating through convolutional layers and pooling layers to obtain information that is robust with respect to deformation, misalignment, and the like. For example, the CNNs described in Document A can be used.
Document A: A. Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, Proc. Advances in Neural Information Processing Systems 25 (NIPS 2012).
The feature extraction unit 610 is constituted by a plurality of convolutional layers (a layer 611, a layer 613, and a layer 615) and a plurality of pooling layers (a layer 612 and a layer 614), and extracts features from an input image 630.
In the convolutional layer, a plurality of channels of filters of sizes such as 3×3 or 5×5 are set for the input image or a feature map, and for convolution operations are performed centered on a pixel of interest to output a plurality of feature maps corresponding to the plurality of channels. Here, a convolutional layer 1 (the layer 611) has four channels with a filter size of 3×3, a convolutional layer 2 (the layer 613) has 12 channels with a filter size of 3×3, and a convolutional layer 3 (the layer 615) has 24 channels with a filter size of 3×3.
The pooling layers reduce the feature map output from the convolutional layers. If pooling is performed in a 2×2 range, the feature map is reduced by a factor of ½×½. Methods such as maximum value pooling and average value pooling can be used. Here, a pooling layer 1 (the layer 612) and a pooling layer 2 (the layer 614) both perform pooling in a 2×2 range. In the example in
The output layer 620 generates an output map 640, which is information indicating the mixing state of classes in the identification image based on the features obtained from the feature extraction unit 610.
The network structure of the CNN is not limited to that illustrated in
In step S1300, the output unit 1300 outputs the estimation result obtained in step S1200. The processing performed by the output unit 1300 depends on the method for using the estimation result, and is not particularly limited. Information indicating the mixing state can be used for various processing, such as image correction, focus control, exposure control, and the like, as described in Document B.
Document B: O. Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015
In step S2100, the data obtainment unit 2100 obtains the training images and the supervisory information pertaining to the region as the training data from the training data storage unit 5100. The training data storage unit 5100 stores, in advance, a plurality of training images as well as information associated with the training images, such as supervisory information pertaining to the region and the camera parameters at the time of shooting.
“Training images” are images used to train the estimator. The training image can be, for example, image data captured by a digital camera or the like. The format of the image data is not particularly limited and can be, for example, JPEG, PNG, BMP, or the like. In the following, the number of prepared training images is assumed to be N, and an nth training image is denoted as In (where n=1, . . . , N).
The supervisory information pertaining to the region is the input class information for each pixel of the training image. This supervisory information is prepared in advance and can be created, for example, by a human looking at the training images.
A training image 500 contains trees (plants) in a central part of the image, grass (plants) in a bottom part of the screen, and sky in a top part of the screen, each of which can be classified into a different class. Supervisory information 501 indicates the class labels for the “plant” region corresponding to the training image 500, and a class label is created for each pixel of the training image. White regions in the image represent regions where plants are present, and black regions represent regions where plants are absent (non-plant regions). Meanwhile, supervisory information 502 indicates the class labels for the “sky” region corresponding to the training image 500, and a class label is created for each pixel of the training image. White regions in the image represent regions where sky is present, and black regions represent regions where sky is absent (non-sky regions).
A region of a person's entire body, a region of a person's skin, a region of a person's hair, a region of an animal such as a dog, a cat, or the like, an artificial object such as an automobile or a building, and the like can be given as examples of class labels aside from “plant” and “sky”. Class labels can also be used to indicate specific objects, such as component A or component B used in a factory. In addition, each pixel may be classified into a main subject region and a background region. The classification may also be performed based on differences in surface properties, such as glossy surfaces or matte surfaces, or on differences in materials, such as metal surfaces or plastic surfaces.
A training image 500a contains grass (plants) in the bottom part of the screen, trees (plants) in the upper-left and right-center of the screen, and sky in a top part of the screen. Supervisory information 501a is supervisory information for the “plant” region, and supervisory information 502a is the supervisory information for the “sky” region.
A training image 500b is an image that contains only flowers (plants) throughout. Supervisory information 501b is supervisory information for the “plant” region, and supervisory information 502b is the supervisory information for the “sky” region. Note that because the training image 500b does not contain any “sky” regions, the entirety thereof is an image with black regions.
The present embodiment will describe an example of a task that identifies two classes, namely a “plant region” class, and a “non-plant region” class that represents regions aside from plant regions. Note that information accompanying the training image, such as camera parameters at the time of shooting, is not used in the first embodiment, but will be described in detail later in a third embodiment.
In step S2200, the class mixing ratio calculation unit 2200 calculates supervisory information of the mixing state of classes in predetermined units of identification images from the supervisory information of class labels for each pixel, obtained from the data obtainment unit 2100 in step S2100. In the present embodiment, the identification image is set as a 4×4-pixel rectangular partial image.
The method for setting the identification image is not particularly limited. For example, a plurality of identification images can be set in the input image according to a predetermined setting pattern. As a specific example, the training image can be divided into a plurality of rectangular partial images of a predetermined size (e.g., 4×4 pixels) and each rectangular partial image can be treated as an identification image. The method described in Document C can also be used to divide the image into irregularly-shaped small regions (superpixels) and treat each small region as an identification image.
Document C: R. Achanta et al., “SLIC Superpixels”, EPFL Technical Report 149300, 2010.
In step S2300, the statistic information calculation unit 2300 calculates the statistic information based on the class mixing ratio information obtained from the class mixing ratio calculation unit 2200. The specific flow of the processing in step S2300 is illustrated in
In step S2310, the statistic information calculation unit 2300 calculates the statistic information of the identification image. Specifically, the statistic information is calculated based on the value of the class mixing ratio for all identification images (within the partial regions) for all N training images. In the present embodiment, the training image is divided into identification images of 22 horizontal blocks×16 vertical blocks, and thus an identification image of 352 blocks is created for each image. This applies to the N images, and thus the total amount of information is N×352 blocks.
As can be seen by referring to the supervisory information 501, 501a and 501b, in the identification images delimited by the rectangles, plant regions at 100% (class mixing ratio=1) and plant regions at 0% (class mixing ratio=0) make up the majority. The training data includes many images in which plant regions do not exist (negative data) to prevent false positives in regions where plants do not exist, the number of pieces of data in which the class mixing ratio=0 is particularly high.
The class mixing ratio takes an intermediate value (a value greater than 0 and less than 1) when the identification image is a boundary part between a plant region and a non-plant region, or when plant regions and non-plant regions (e.g., sky) are mixed together, such as the leaves of plants. The frequency at which such identification images occur is relatively low.
As illustrated in
In step S2320, the statistic information calculation unit 2300 calculates the class mixing ratio in the range including the information in the periphery of the identification image. As in step S2310, the values of the class mixing ratios for all N identification images in the training image are calculated. The reason for calculating the class mixing ratio within a range that includes the information in the periphery will be described below.
The class mixing ratio of the identification images corresponds to the output result of the CNN (a response variable). The information input to the CNN to produce the output result (an explaining variable) is information from a broader range. Accordingly, by understanding not only the statistic information of the supervisory information corresponding to the output result, but also the statistic information of the supervisory information in a range corresponding to the information input to the CNN, unevenness in the variation of the training images can be evaluated more correctly.
For example, when the class mixing ratio of a given identification image is high and the image is filled with plant regions, there is a strong trend that the class mixing ratio will also be high in the periphery of the identification image. Such cases occur frequently. On the other hand, in the case of isolated plants, the class mixing ratio of the identification image will be high, but the class mixing ratios in the periphery thereof will be low. Such cases occur infrequently. Unevenness in the variation of such data cannot be grasped simply by grasping the class mixing ratio of the identification image. Accordingly, by grasping the class mixing ratio of a range including the periphery of the identification image in addition to the class mixing ratio of the identification image, unevenness in the variation of the data can be appropriately grasped and reflected in the learning.
The range of information input to the CNN will be described next. As described earlier, in the present embodiment, the estimation processing is performed using the CNN illustrated in
If a region 710 in
When obtaining the class mixing ratio of a range including information from the periphery as well, there is a method which obtains a uniform average value for all the peripheral pixels, and a method which obtains a weighted average value which is weighted such that values increase with proximity to the center. When an input image is input to the CNN, information at earlier stages of the CNN is added more frequently, and thus finding an average such that values increase with proximity to the center can be said to make it possible to calculate an average value more in line with the operations of the CNN.
Specifically, as illustrated in
A graph in which the frequencies are plotted along a dotted line connecting the lower-left and the upper-right of the above-described graph (the left side of
In step S2400, the importance changing unit 2400 obtains a correction coefficient based on the statistic information obtained from the statistic information calculation unit 2300. This correction coefficient is used when optimizing weighting coefficients of the CNN using error back propagation in step S2500, described later. A higher correction coefficient indicates a greater contribution to the learning, whereas a lower correction coefficient indicates a smaller contribution to the learning. In other words, controlling the correction coefficient makes it possible to change the degree of importance of the data in the learning.
Accordingly, by increasing the correction coefficient for data having a low frequency of occurrence in
Ci=α×DMAX/Di (1)
α is a hyperparameter. i represents the ID of the histogram bin, and if the total number of bins is I, correction coefficients are calculated in the range i=1 to I, respectively. The magnitude of the correction coefficient is inversely proportional to the frequency of occurrence, such that the coefficient is lower when the frequency is higher, and the coefficient is higher when the frequency is lower.
The left side of
A graph in which the correction coefficients are plotted along a dotted line connecting the lower-left and the upper-right of the above-described graph (the left side of
In step S2500, the learning unit 2500 performs learning based on the training image obtained from the data obtainment unit 2100, the supervisory information expressing the class mixing state obtained from the class mixing ratio calculation unit 2200, and the correction coefficient obtained from the importance changing unit 2400. More specifically, the learning unit 2500 learns the parameters of the feature extraction unit 610 and the output layer 620 of the estimator 600 indicated in
As described earlier, in the convolutional layers of the CNN, a convolution operation is performed on the input image 630, using a filter of a predetermined size and centered on a pixel of interest, and a feature map is output. For the number of channels, the filter size is 3×3, for four channels, in the convolutional layer 1 (the layer 611) in
When performing learning for the estimator 600, the learning unit 2500 compares a supervisory signal with the value of an output signal obtained from the output layer 620 when an identification image obtained from a predetermined position j of a training image In is input to the CNN, and obtains error. The CNN can be trained by sequentially back propagating the error obtained in this manner from the output layer to the input layer using error back propagation. As the initial values for the weighting coefficients of the CNN, random values can be used, or weighting coefficients obtained from learning related to some task may be used.
When sequentially back propagating error from the output layer to the input layer using error back propagation, changing the correction coefficient for each identification image makes it possible to control the contribution to the learning. The following will describe an example of controlling the contribution using the correction coefficient.
Assume that Xnj represents the feature obtained by inputting, to the feature extraction unit 610, the identification image obtained from the predetermined position j in the training image In, and y(Xnj) represents the output signal obtained by inputting this to the output layer 620. Additionally, of the output map 640, the supervisory data corresponding to the predetermined position j is represented by Tnj. In this case, error E between the output signal and the supervisory information is calculated as indicated by Formula (2).
E(n,j)=(y(Xnj)−Tnj) (2)
Assuming that the correction coefficient of the identification image obtained from the predetermined position j in the training image In is Cnj, error Ec which takes into account the correction coefficient is calculated as indicated by Formula (3).
Ec(n,j)=Cnj×E(n,j) (3)
The relationship between the correction coefficient Ci by the bin ID and the correction coefficient that is Cnj for the identification image is determined in advance. Although Formula (3) indicates an example of multiplying the correction coefficient by the error, other operations, such as using addition, may be used instead.
As described earlier with reference to step S2400, when there is unevenness in the variation of the data, the contribution is increased for relatively small data numbers, and the contribution is reduced for relatively large data numbers. The correction coefficient may be increased to increase the contribution, and the correction coefficient may be reduced to reduce the contribution. In the learning unit, error back propagation is performed by referring to the value of the correction coefficient calculated for each identification image in step S2400.
According to the first embodiment described thus far, a statistic amount is obtained by taking into consideration not only the class mixing ratio of the identification image, but the class mixing ratio and a range including the periphery of the identification image. Based on the obtained statistic amount, the correction coefficient is increased for relatively large numbers of data, and the correction coefficient is reduced for relatively small numbers of data. Through this, learning can be performed for the estimator, taking into account unevenness in variation included in the training data.
The recognition accuracy for a variety of target objects can be improved by reducing not only unevenness in the number of data between classes, attributes, and the like of the training data, but also unevenness in the variation of the training data within the same class, the same attribute, and so on.
Although the first embodiment describes an example of application to a semantic region segmentation task, the embodiment is applicable to other tasks as well. For example, the embodiment can be applied to other tasks such as object detection, scene recognition, and the like.
First Variation
In the first embodiment, unevenness in the variation of the training data is calculated using two indicators, namely the class mixing ratio of the identification image and the class mixing ratio of a range including the identification image and the periphery thereof. In a first variation, the variation of data is evaluated based on orientation information, i.e., in which direction class labels are more prevalent, in a range including the identification image and the periphery thereof.
The functional configuration of the learning apparatus 5000 according to the first variation is similar to that in
Although the processing flow according to the first variation is similar to that in
Step S2300 includes step S2320, and the statistic information calculation unit 2300 calculates the statistic information based on orientation information of the class label data in a peripheral range of the identification image. The specific flow of the processing in step S2300 is illustrated in
Next, a center of gravity position of the label is obtained for each class within the peripheral range. Note that when obtaining the center of gravity position for the label, the weight may be increased with proximity to the center, as illustrated in
Enlarged versions of the rectangular ranges surrounded by dotted lines 801 to 803 in
The rectangular ranges surrounded by dotted lines 804 to 806 in
The orientation information is obtained for all the identification images in the N training images, and a distribution of the frequencies thereof is illustrated in
In step S2400, the importance changing unit 2400 calculates the correction coefficient based on the statistic information obtained in step S2300. As illustrated in
In step S2500, the learning unit 2500 performs learning while changing the correction coefficient based on the orientation of each identification image when applying error back propagation, similar to the first embodiment. Performing control in this manner makes it possible to reduce unevenness in variation in the orientation information of the class label data.
As described thus far, and the first variation, the orientation information is obtained based on the center of gravity position of the labels, in a range including information of the periphery of the identification image. This makes it possible to reduce unevenness in variation in the orientation information of the class label data.
Second Variation
A second variation will describe a method for reducing variation in the data using statistic information on the class mixing ratios of the identification images.
In the histogram of the class mixing ratios of the identification images illustrated in
Accordingly, in the first embodiment, data unevenness is reduced by increasing the correction coefficient for a minority of pieces of data, and reducing the correction coefficient for the majority of the data. However, when there is too large difference between the number of pieces of data in the majority data and the minority data, determining the correction coefficient faithfully to the ratio of the data numbers can be detrimental to the overall performance due to the following reasons.
In other words, because the distribution of validation data for performance validation tends to be generally similar to that of the training data, the majority data in the training data will also be the majority in the validation data. Accordingly, if the degree of importance of the minority data is increased faithfully according to the number of pieces of data in the training data, the performance may be degraded in the majority data occupying the majority of the validation data, resulting in lower overall performance.
Accordingly, in the second variation, the statistic information of the training data is obtained, and the ratio of the correction coefficients between the highest-frequency data and other data is made to not be excessively high. Specifically, the ratio of the correction coefficients is determined so as to be smaller than the ratio between the number of pieces of the highest-frequency training data and the number of pieces of training data with a lower occurrence frequency. This makes it possible to suppress excessively large contributions from data which does not have a maximum frequency. This in turn makes it possible to improve the performance of minority data while maintaining the performance of majority data, which affects the overall performance.
An example of the configuration of the learning apparatus 5000 according to the second variation is similar to that in
Although the processing flow according to the second variation is similar to that described in the first embodiment with reference to
In step S2400, the importance changing unit 2400 obtains the correction coefficient based on the frequency of occurrence of the identification images. The number of pieces of data in the bin containing the class mixing ratio of 0.0, which has the largest number of pieces of data, is assumed to be DMAX. At this time, the correction coefficient Ci for data having a frequency of Di is calculated through Formula (4).
Ci=βi×DMAX/Di (4)
βi<1.0 (5)
βi in Formula (5) is a hyperparameter. i represents the ID of the histogram bin, and if the total number of bins is I, correction coefficients are calculated in the range i=1 to I, respectively.
Setting βi to be less than 1 makes it possible to suppress excessively large contributions from data which does not have a maximum frequency.
In step S2500, the learning unit 2500 performs learning while changing the correction coefficient, based on the class mixing ratio for each identification image.
As described thus far, according to the second variation, the ratio of the correction coefficients between the data having the highest frequency and other data is determined so as to be smaller than the ratio between the number of pieces of the highest-frequency training data and the number of pieces of training data with a lower occurrence frequency. This makes it possible to reduce performance degradation accompanying unevenness in the variation in data while maintaining the overall performance.
Third Variation
In the first embodiment, a correction coefficient for error back propagation is used to control contributions to learning in order to reduce unevenness in the frequency with which data occurs. A third variation will describe a method of padding low-frequency data instead of (or along with) using a correction coefficient for error back propagation.
The configuration of the learning apparatus 5000 according to the third variation is similar to that in
In step S2410, the importance changing unit 2400 obtains a data padding amount based on the statistic information obtained from the statistic information calculation unit 2300. This data padding amount is used when optimizing weighting coefficients of the CNN using error back propagation in step S2500, described later. As the data padding amount increases, the contribution to the learning increases, whereas as the data padding amount decreases, the contribution to the learning decreases. In other words, controlling the data padding amount makes it possible to change the degree of importance of the data in the learning.
By increasing the data padding amount for data having a lower frequency of occurrence in
Wi=γ×DMAX/Di (6)
γ is a hyperparameter. i represents the ID of the histogram bin, and if the total number of bins is I, correction coefficients are calculated in the range i=1 to I, respectively. The magnitude of the data padding amount is inversely proportional to the frequency of occurrence, such that the padding amount is lower when the frequency is higher, and the padding amount is higher when the frequency is lower.
A graph in which the data padding amounts are plotted along a dotted line connecting the lower-left and the upper-right of the above-described graph (the left side of
In step S2500, the learning unit 2500 performs learning based on the training image obtained from the data obtainment unit 2100, the supervisory information expressing the class mixing state obtained from the class mixing ratio calculation unit 2200, and the data padding amount obtained from the importance changing unit 2400. When using error back propagation to back propagate error obtained by comparing the value of the output signal obtained from the output layer 620 and the supervisory signal, changing the data padding amount for each identification image makes it possible to control the contribution to the learning.
To increase the contribution of identification images for which the data padding amount is high, learning is performed having copied the data of the identification images to increase the number of pieces of data. In other words, the number of copies is controlled in accordance with the magnitude of the padding amount. Note that not even a single copy may be made when the padding amount is low.
As described thus far, according to the third variation, the contribution to learning is controlled by controlling the data padding amount. Through this, similar to the first embodiment, learning can be performed for the estimator, taking into account unevenness in variation included in the training data.
The first embodiment described an example in which unevenness in the variation of data is reduced by using class mixing ratios of an identification image and a range including the periphery thereof. A second embodiment will describe reducing unevenness in the variation of data using information on subcategories of classes.
Apparatus Configuration
The configuration of the learning apparatus 5000 according to the second embodiment is similar to that in
The second embodiment uses information on subcategories of classes. Specifically, assume that three subcategories, namely “tree”, “grass”, and “flower”, are added to the “plant” class.
The plant class labels indicated by the supervisory information 501 in
Apparatus Operations
Although the processing flow according to the second embodiment is similar to that in
Note that when a single identification image is given a class label having a plurality of subcategories, the frequency may be added to a plurality of subcategories with a decimal point value based on the mixing ratios, or 1 may be added only to the subcategory having the highest ratio.
In step S2400, the importance changing unit 2400 calculates the correction coefficient based on the statistic information obtained in step S2300.
In step S2500, the learning unit 2500 performs learning while changing the correction coefficient, based on the class mixing ratio and the subcategory for each identification image. By performing learning as described above, unevenness in the variation of the data is reduced according to the nature of the subcategories. Additionally, when there is unevenness in the number of pieces of data in the subcategories of the training data, that unevenness can also be reduced.
Although plants are described as an example in the foregoing, subcategory information can also be used for other classes as well. For example, when detecting a sky region, the region can be divided into subcategories such as “cloudy sky”, “blue sky”, “sunset sky”, and the like. Automobiles, meanwhile, can be divided into “sedan”, “minivan”, “SUV”, “bus”, “truck”, and the like.
Additionally, although a one-dimensional histogram pertaining to the class mixing ratio of the identification image is used for each subcategory, a two-dimensional histogram pertaining to the class mixing ratio calculated for a range including the periphery of the identification image and the class mixing ratio of the identification image may be used, similar to the first embodiment. Orientation information may be used as well, as in the first variation.
As described thus far, according to the second embodiment, statistic amounts are calculated for each of subcategories, and correction coefficients are obtained and used in the learning. When characteristics are different for each subcategory in each class, some categories are prone to data unevenness, while some categories are not, and thus by performing such calculations, learning for the estimator can be performed while taking into account unevenness in the variation of the training data.
The second embodiment described an example of reducing unevenness in the variation of data using information on subcategories of classes. A third embodiment will describe reducing unevenness in the variation of data by using information on camera parameters used in shooting and other information added to the image.
Apparatus Configuration
The configuration of the learning apparatus 5000 according to the third embodiment is similar to that in
Apparatus Operations
Although the processing flow in the learning apparatus 5000 according to the third embodiment is similar to that in
In step S2300, the statistic information calculation unit 2300 calculates the statistic information based on camera parameters at the time of shooting the captured image data, which is the input data. Here, the “camera parameters” relate to parameters from the time of shooting the image, such as the brightness of the subject (information indicating the brightness of the shot scene), called “Bv”; the shutter speed; the aperture value; the depth value; GPS information; shooting date/time; and the like. In the present embodiment, Bv will be used as an example. Bv can be calculated through Formulas (7) to (10), from the exposure time (seconds) T, the aperture value (F value) A, and the ISO sensitivity value Sx, which are image capturing conditions.
Bv=Tv+Av−Sv (7)
Tv=−log2T (8)
Av=2·log2A (9)
Sv=log2(0.32·Sx) (10)
As a guide, the value of Bv is approximately 7 to 10 during the day outdoors, around 5 in a bright setting indoors, about 1 to 2 in a dark setting indoors, −1 outdoors at night, and so on. The brightness at which a subject has been shot can be grasped from the value of Bv. The information on the camera parameters from the time of shooting is stored in the training data storage unit 5100 as information added to the training image in advance.
The specific flow of the processing in step S2300 is illustrated in
In step S2400, the importance changing unit 2400 calculates the correction coefficient based on the statistic information obtained in step S2300. In other words, there is unevenness in the data, and thus a correction coefficient is obtained to reduce that unevenness.
Although an example of the Bv value has been described here, using information on, for example, the shooting date/time makes it possible to grasp unevenness in the data, such as which months have more data and which months have less data. If such monthly data indicates that there are few images of plants in the winter, increasing the correction coefficient for training images from that period makes it possible to improve the detection performance of plants in the winter, for which there are few images.
Additionally, using GPS data makes it possible to grasp which areas have more images shot and which areas have fewer images shot. For example, if there are more images from Europe and fewer images from Asia, increasing the correction coefficient for images from Asia makes it possible to improve the detection performance for the areas from which there are fewer images.
Information added to images, aside from the camera parameters (e.g., information on the person who added the class labels, information on the continuous working time required for class labeling, and so on), can also be used.
For example, if there is variation in the quality of the class labeling by each person, unevenness will arise in the quality of the class labels when there is only data which has been labeled by a specific person. Increasing the correction coefficient for data labeled by a person who has labeled a lower number of pieces of data makes the quality of the class labels more uniform.
Additionally, labeling classes is a task which requires concentration, and thus the longer work is done continuously before the labeling of the training data in question is added, the more the quality of the class labels may drop. Accordingly, the continuous working time and the quality of the class labels may also be correlated. Suppressing unevenness in the continuous working time makes the quality of the class labels more uniform.
As described thus far, according to the third embodiment, a statistic amount is calculated based on camera parameters, other information added to the images, and so on, and a correction coefficient is obtained and used in the learning. Performing such calculations makes it possible to perform learning for the estimator having taken into account unevenness in the camera parameters, the quality of class labels, and so on included in the training data.
In the first to third embodiments, a method of recognizing mixing states of classes in identification images obtained by dividing an input image into rectangular regions is used as a semantic region segmentation task. In the fourth embodiment, a method of performing class determination for each pixel in the input image is used. The statistic information calculated is the same as the statistic information of the orientation information in a range including the periphery of a pixel of interest, used in the first variation.
Apparatus Configuration
The configuration of the image processing apparatus 1000 according to the fourth embodiment is similar to that in
The configuration of a learning apparatus 6000 according to the fourth embodiment is illustrated in
Apparatus Operations
In step S1250, the estimation unit 1200 estimates class labels at the same resolution as the resolution of the input image. In other words, the resolution of the input image and the resolution of the output map are the same, and the class label having the highest likelihood is estimated for each pixel in the input image.
A network structure called U-Net, described above in Document B, can be used as the estimator, for example. This network can achieve highly-accurate class label estimation at the same resolution as the resolution of the input image by performing processing for reducing the resolution in the pooling layers and then increasing the resolution in upsampling layers. There is also a method called “skip connection”, which inputs the feature map prior to pooling into a convolutional layer having the same resolution after the upsampling. However, the estimator which can be used in the present embodiment is not limited to U-Net.
In step S2100, the data obtainment unit 2100 obtains the training images and the supervisory information pertaining to the region as the training data from the training data storage unit 5100. In the present embodiment, supervisory information on the class label for each pixel is used directly as supervisory information.
In step S2300, the statistic information calculation unit 2300 obtains the supervisory information on the class label for each pixel from the data obtainment unit 2100, and calculates statistic information. The specific flow of the processing in step S2300 is illustrated in
Step S2300 includes step S2340, and the statistic information calculation unit 2300 calculates the statistic information for each pixel in the input image. The statistic information is calculated for all pixels in all N of the training images. The present embodiment assumes that the size of the training images is the QVGA size, i.e. 320×240, and there are thus 76,800 pixels in each image. This applies to the N images, and thus the total amount of information is N×76,800 pixels.
The present embodiment uses the statistic information of the orientation information in a range including the periphery of a pixel of interest, used in the first variation. In other words, a center of gravity position of the class label is obtained in a range including a peripheral region of the pixel of interest, as described with reference to
In step S2400, the importance changing unit 2400 obtains a correction coefficient based on the statistic information obtained from the statistic information calculation unit 2300. When the orientation information is obtained for all pixels in the N training images and a distribution of the frequencies is found, frequencies similar to those in
In step S2500, the learning unit 2500 learns parameters for feature extraction and parameters of the output layer for the estimator, exemplified by U-Net as described above. First, the value of the output signal from the estimator is compared with the supervisory signal, and error is obtained. The estimator can be trained by sequentially propagating the error obtained in this manner from the output layer to the input layer using error back propagation. When applying error back propagation, performing learning while changing the correction coefficient based on the orientation for each pixel makes it possible to reduce unevenness in the variation of the orientation information in the class label data.
As described thus far, the fourth embodiment has described performing class determination on each pixel of an input image as a semantic region segmentation task. Even in such a task, the correction coefficient is increased for relatively large numbers of data, and the correction coefficient is reduced for relatively small numbers of data. Through this, learning can be performed for the estimator, taking into account unevenness in variation included in the training data.
The first to fourth embodiments described examples of the present invention being applied to region detection tasks. A fifth embodiment will describe an example of the present invention being applied to an object detection task. An object detection task is a task of outputting a range in which an object is present in an image as a rectangular frame. A variety of targets can be given as targets for detection, such as a person's entire body, a person's face, a person's eyes, animals, vehicles, and the like. The present embodiment will describe the task of detecting a person's face as an example.
Apparatus Configuration
The configuration of the image processing apparatus 1000 according to the fifth embodiment is similar to that in
The learning apparatus according to the fifth embodiment has a configuration similar to the learning apparatus 6000 of the fourth embodiment, illustrated in
Apparatus Operations
In step S1280, the estimation unit 1200 loads a pre-trained estimator from the estimator storage unit 5200, and estimates a center map and size maps. The estimator 600 that uses the CNN illustrated in
In step S1290, the output unit 1300 calculates and outputs an object detection frame based on the center map and the size maps obtained in step S1280. Specifically, a peak position of a feature on the center map is calculated, and that position is taken as a center position of the object. The values of the size maps are obtained for the same position as the center position of the object, and those values are taken as the size of the object. The size of the X direction is obtained from the X direction size map, and the size in the Y direction is obtained from the Y direction size map.
In step S2300, the statistic information calculation unit 2300 calculates the statistic information. A detailed processing flowchart of step S2300 is illustrated in
As illustrated in
By calculating the length of the diagonal of the object detection frame for all of the supervisory information included in the training data, statistic information on the lengths of the diagonals of the object detection frames is obtained.
In step S2400, the importance changing unit 2400 obtains a correction coefficient based on the statistic information obtained from the statistic information calculation unit 2300. As illustrated in
In step S2500, the learning unit 2500 performs learning based on the training images obtained from the data obtainment unit 2100 and the correction coefficient obtained from the importance changing unit 2400. More specifically, the learning unit 2500 learns the parameters of the feature extraction unit 610 and the output layer 620 of the estimator 600 indicated in
The supervisory signal of the center map gives a value of “1” to a center coordinate position of the object detection frame. The supervisory signal of a size map gives a value of the size to the range of the object detection frame assuming that the maximum size of “400” is normalized to “1”. For example, when the X direction size is 100 pixels, a value of 0.25 (=100/400) is given to the X direction size map.
When performing learning for the estimator 600, the learning unit 2500 compares a supervisory signal with the value of an output signal obtained from the output layer 620 when an image obtained from a predetermined position of a training image is input to the CNN, and obtains error. The error is calculated for both the center map and the size maps. The CNN can be trained by sequentially back propagating the error obtained in this manner from the output layer to the input layer using error back propagation.
When sequentially back propagating error from the output layer to the input layer using error back propagation, changing the correction coefficient for each identification image makes it possible to control the contribution to the learning. The method for controlling the contribution using the correction coefficient is similar to that in the first embodiment, and will therefore not be described. The correction coefficient becomes large for small faces and large faces, which occur less frequently, and thus learning is performed appropriately even for small faces and large faces, which improves the recognition accuracy.
As described thus far, according to the fifth embodiment, learning is performed through correction using statistic information. Accordingly, a drop in the recognition accuracy caused by unevenness in the data can be reduced even in object detection tasks.
Fourth Variation
Although the foregoing fifth embodiment described using statistic information based on the size of a frame around a person's face, a fourth variation will describe an example of using statistic information based on the orientation of a person's face. Note that the configurations of the learning apparatus and the image processing apparatus are similar to those described in the fifth embodiment, and will therefore not be described here. Additionally, operations of the image processing apparatus are similar to those described in the fifth embodiment, and will therefore not be described here.
Apparatus Operations
The flowchart illustrating the operations of the learning apparatus according to the fourth variation is similar to that of the fifth embodiment (FIG. 15A).
In step S2100, the data obtainment unit 2100 obtains the training images and the supervisory information pertaining to the object frame as the training data from the training data storage unit 5100.
In step S2300, the statistic information calculation unit 2300 calculates the statistic information. A detailed processing flowchart of step S2300 is illustrated in
In step S2400, the importance changing unit 2400 obtains a correction coefficient based on the statistic information obtained from the statistic information calculation unit 2300. As illustrated in
In step S2500, the learning unit 2500 performs learning based on the training images obtained from the data obtainment unit 2100 and the correction coefficient obtained from the importance changing unit 2400. The details of the learning are similar to those in the fifth embodiment, and will therefore not be described here. The correction coefficient is large for face orientations for which the absolute value of the angle is high, which appear with lower frequency, and thus learning is performed appropriately even for such face orientations, which improves the recognition accuracy.
Fifth Variation
The foregoing fifth embodiment described a person's face as the detection target, but a fifth variation will describe an animal as the detection target. Additionally, an example in which information on subcategories of animals, and information on backgrounds in which detection targets are present (scene information), is used as the statistic information will be described. Note that the configurations of the learning apparatus and the image processing apparatus are similar to those described in the fifth embodiment, and will therefore not be described here. Additionally, operations of the image processing apparatus are similar to those described in the fifth embodiment, and will therefore not be described here.
Apparatus Operations
The flowchart illustrating the operations of the learning apparatus according to the fourth variation is similar to that of the fifth embodiment (
In step S2100, the data obtainment unit 2100 obtains the training images and the supervisory information pertaining to the object frame as the training data from the training data storage unit 5100.
As illustrated in
Furthermore, information on the background where the subject (the animal) is captured (scene information) is added. Specifically, “indoors” is added to a frame 942 surrounding a cat 941, illustrated in
In step S2300, the statistic information calculation unit 2300 calculates the statistic information. A detailed processing flowchart of step S2300 is illustrated in
In step S2400, the importance changing unit 2400 obtains a correction coefficient based on the statistic information obtained from the statistic information calculation unit 2300. As illustrated in
In step S2500, the learning unit 2500 performs learning based on the training images obtained from the data obtainment unit 2100 and the correction coefficient obtained from the importance changing unit 2400. The details of the learning are similar to those in the fifth embodiment, and will therefore not be described here. The correction coefficient increases for combinations which appear less frequently, and thus learning is performed appropriately even for combinations which appear less frequently, which improves the recognition accuracy.
The first to fourth embodiments and the first to third variations described examples of applications to region detection tasks. Additionally, the fifth embodiment and the fourth and fifth variations described examples of applications to object detection tasks. In this manner, the present invention can be applied to a variety of recognition tasks, and can be applied to scene recognition tasks, image classification tasks, authentication tasks, and the like as well, for example.
Additionally, although the first to fifth embodiments and the first to fifth variations described examples in which two-dimensional image data is used as the input data, the input data to which the present invention is applicable is not limited to image data.
For example, the present invention can also be applied in voice recognition using voice data, which is one-dimensional information. When collecting data for voice recognition, statistic information on attribute information, such as the age, gender, and the like of the person who produced the voice, can be used, and unevenness in the variation thereof causes differences in the performance. For example, if the number of pieces of data for a man in his thirties is lower than that of other data, the performance will drop for that insufficient data. By applying the present invention, the correction coefficient used during learning can be increased for the insufficient data, which reduces unevenness in the variation and leads to an improvement in the overall performance.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2021-206255, filed Dec. 20, 2021, and Japanese Patent Application No. 2022-148404, filed Sep. 16, 2022 which are hereby incorporated by reference herein in their entirety.
Number | Date | Country | Kind |
---|---|---|---|
2021-206255 | Dec 2021 | JP | national |
2022-148404 | Sep 2022 | JP | national |