This application is based on and claims the benefit of priority from Japanese Patent Application No. 2013-235810, filed Nov. 14, 2013, the disclosure of which is incorporated herein in its entirety by reference.
1. Technical Field
The present invention relates to a discrimination apparatus, a discrimination program, and a discrimination method based on supervised learning. In particular, the present invention relates to a discrimination apparatus, a discrimination program, and a discrimination method that use a discriminative model generated through a learning process of training data that has been expanded.
2. Related Art
To construct a discriminator based on supervised learning, training data that accompanies target values is required to be collected. The relationships between input and output of the training data are then required to be learned within the framework of machine learning. The target value refers to the output of the training data. During a learning process, when a certain piece of training data is inputted, retrieval of learning parameters is performed so that the output from a discriminator becomes closer to the target value corresponding to the piece of training data.
A discriminator that is obtained through the learning process described above performs, during operation, discrimination of unknown data that is not included in the training data but is similar in pattern. Discriminative capability for the unknown data that is an object for such discrimination is referred to as generalization capability. The discriminator is required to have high generalization capability.
In general, as the amount of training data increases, the generalization capability of the discriminator trained using such training data increases. However, personnel cost is incurred when collecting training data. Therefore, it is demanded that high generalization capability be achieved with a small amount of training data. In other words, a measure against low distribution density of training data is required.
Here, a heuristic method referred to as data expansion has been proposed. Data expansion is described in P. Y. Simard, D. Steinkraus, J. C. Platt, “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, ICDAR 2003 (hereinafter referred to as P. Y. Simard et al.) and Ciresan, et al., “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”, Neural Computation 2010 (hereinafter referred to as Ciresan, et al.). Data expansion refers to increasing the types of data by subjecting data provided as a sample to parametric deformation. However, these deformations must not compromise the unique features of the class to which the original data belongs.
In P. Y. Simard et al., research on handwritten digit recognition using a convolutional neural network (CNN) is described. Here, training data undergoes transformation referred to as “elastic distortion”. A large amount of data is artificially generated as a result (data expansion). The data that has been generated is then learned. It is described that, as a result of learning such as this, discriminative capability that is significantly higher that when data expansion is not performed can be achieved.
In addition, in Ciresan, et al., research on handwritten digit recognition using a neural network is described. Here, data expansion is performed by transformation of rotation and scale, in addition to elastic distortion. It is described that extremely high recognition capability can be achieved as a result.
In this way, in P. Y. Simard, et al. and Ciresan, et al., regarding the issue of handwritten digit recognition, deformations such as localized elastic distortions, minute rotations, and minute scale changes are applied. As a result, data expansion that does not compromise the features of the digits becomes possible. Generalization capability that is higher than that when data is not expanded can be successfully achieved. Performing discrimination of unknown data after learning through data expansion has been performed is commonly practiced in the field of image recognition in particular.
It is thus desired to improve a discriminative capability of a discriminator when a discrimination of unknown input data is performed based on a learning process using data expansion of training data.
A first exemplary embodiment of the present disclosure provides a discriminator based on supervised learning. The discriminator includes a data expanding unit and a discriminating unit. The data expanding unit performs data expansion on unknown data which is an object to be discriminated in such a manner that a plurality of pieces of pseudo unknown data are generated. The discriminating unit applies the expanded plurality of pieces of pseudo unknown data to a discriminative model so as to discriminate the expanded plurality of pieces of pseudo unknown data. The discriminating unit then integrates the discriminative results of the expanded plurality of pieces of pseudo unknown data to perform class classification such that the unknown data is classified into the classes.
In this configuration, the unknown data is expanded in such a manner that the plurality of pieces of pseudo unknown data are generated. The discrimination results of the plurality of pieces of pseudo unknown data are integrated, and then, the class classification of the unknown data is performed based on the integrated discrimination results. Therefore, discriminative capability is improved compared to when discrimination is performed on the unknown data itself.
In the exemplary embodiment, the data expanding unit may perform data expansion on the unknown data using the same method as the data expansion performed on training data when the discriminative model is generated. In this configuration, the unknown data is expanded by the same method as that for expansion of training data when the discriminative model is generated. Therefore, the probability that their distribution overlaps a posterior distribution of class increases. Discriminative capability is improved for when data expansion of the training data is performed when the discriminative model is generated.
In the exemplary embodiment, the discriminator may perform the class classification based on expected values derived by applying the plurality of pieces of pseudo unknown data to the discriminative model.
In this configuration, class classification is performed using, as a decision rule, minimization of an objective function (also called, e.g., an error function or a cost function) for when the discriminative model is generated. Therefore, discriminative capability is improved for when data expansion of the training data is performed when the discriminative model is generated.
In the exemplary embodiment, the discriminating unit may perform the class classification without applying the unknown data to the discriminative model. In this configuration, class classification of the unknown data is performed without the unknown data itself being used for discrimination.
In the exemplary embodiment, the data expanding unit may perform data expansion on the unknown data using random numbers. In this configuration, the unknown data is expanded using random numbers. Therefore, the probability that their distribution overlaps the posterior distribution of class increases. Discriminative capability is improved for when data expansion of the training data is performed when the discriminative model is generated.
A second exemplary embodiment of the present disclosure provides a computer-readable storage medium storing a discrimination program that enables a computer to function as a discriminator based on supervised learning. The discriminator includes a data expanding unit and a discriminating unit. The data expanding unit performs data expansion on unknown data which is an object to be discriminated in such a manner that a plurality of pieces of pseudo unknown data are generated. The discriminating unit applies the expanded plurality of pieces of pseudo unknown data to a discriminative model so as to discriminate the expanded plurality of pieces of pseudo unknown data. The discriminating unit then integrates the discriminative results of the expanded plurality of pieces of pseudo unknown data to perform class classification such that the unknown data is classified into classes.
In this configuration as well, the unknown data is expanded in such a manner that the plurality of pieces of pseudo unknown data are generated. The discrimination results of the pieces of pseudo unknown data are integrated. And then, class classification of the unknown data is performed based on the integrated discrimination results. Therefore, discriminative capability is improved compared to when discrimination is performed on the unknown data itself.
A third exemplary embodiment of the present disclosure provides a discrimination method based on supervised learning. In the method, by a data expansion unit, data expansion is performed on unknown data which is an object to be discriminated in such a manner that a plurality of pieces of pseudo unknown data are generated. By a discrimination unit, the unknown data that has been expanded by the data expansion unit is applied to a discriminative model so as to discriminate the expanded plurality of pieces of pseudo unknown data. Then, by the discrimination unit, the discrimination results of the expanded plurality of pieces of pseudo unknown data are integrated to perform class classification such that the unknown data is classified into classes.
In this configuration as well, the unknown data is expanded. A plurality of pieces of pseudo unknown data are generated. The discrimination results of the pieces of pseudo unknown data are integrated. And then, class classification of the unknown data is then performed based on the integrated discrimination results. Therefore, discriminative capability is improved compared to when discrimination is performed on the unknown data itself.
As described above, in the first to third exemplary embodiments, unknown data is expanded. Their discrimination results are then integrated and class classification is performed. Therefore, discriminative capability is improved compared to when discrimination is performed on the unknown data itself.
In the accompanying drawings:
A learning apparatus and a discriminator according to an embodiment of the present invention will hereinafter be described with reference to the drawings. The embodiment described below gives an example when the present invention is carried out. The embodiment does not limit the present invention to specific configurations described hereafter. When carrying out the present invention, specific configurations based on the implementation may be used accordingly.
An embodiment of the present invention will hereinafter be described, giving as an example a pattern discriminator and a learning apparatus. The pattern discriminator performs class classification of unknown data, such as image data. The learning apparatus is used to learn a discriminative model used by the pattern discriminator. In addition, an instance in which a feed-forward multilayer neural network is used as the discriminative model will be described. Other models, such as a convolutional neural network may also be used as the discriminative model.
The training data storage unit 11 stored therein training data (hereinafter also referred to as a “data sample”) accompanied by target values. The transformation parameter generating unit 13 generates transformation parameters. The transformation parameters are used by the data expanding unit 12 to expand the training data stored in the training data storage unit 11. The data expanding unit 12 performs data expansion by performing parametric transformation on the training data stored in the training data storage unit 11 using the transformation parameters generated by the transformation parameter generating unit 13.
The learning unit 14 performs a learning process using the training data that has been expanded by the data expanding unit 13. The learning unit 14 thereby generates a discriminative model to be used by the discriminator of the present embodiment. The learning unit 14 decides a weight W of each layer that is a parameter of the multilayer neural network.
The data expansion performed by the data expanding unit 12 will be described.
The data expanding unit 12 increases the number of pieces of data by transforming the training data. The transformation is parametric transformation near data points on a manifold of the data. The transformation includes, for example, localized distortions in an image, localized changes in luminance, affine transformations, and noise superposition.
When one or more transformation parameters (e.g., M transformation parameters θ1, θ2, . . . , θM) are collectively represented by θ and transformation is (x0;θ), and when a sufficient number of (an infinite number of) pseudo data are generated from a single piece of training data, the pieces of pseudo data have a distribution expressed by the following expression.
Here, D denotes a dimension of data and corresponds to a dimension in a space of the data distribution of the class C1 shown in
The learning unit 14 learns the expanded training data. As described above, according to the present embodiment, the learning unit 14 learns a feed-forward multilayer neural network as a discriminative model.
As shown in
The learning unit 14 uses an objective function (also called, e.g., cost function) that obtains a lower value as the output value and the target value become closer. The learning unit 1 uses the object function to retrieve parameters of the discriminative model that minimize the objective function. The learning unit 14 decides a discriminative model that has high generalization capability as a result of the retrieval. According to the present embodiment, cross entropy is used as the objective function.
First, the definitions of the symbols are shown in
Here, f1 is a differentiable (subdifferentiable) monotonically non-decreasing or non-increasing function.
In addition, the number of dimensions of the output is the number of classes. The target values are set such that one of the units of the output layer has the value 1, and the remaining at least one unit have the value 0. In a two-class classification, the output may be one dimensional. In this case, the target value is 0 or 1.
First, a learning process when data expansion is not performed will be described below. A learning process when data expansion is perfofined according to the present embodiment will subsequently be described in comparison with the learning process when data expansion is not performed.
The objective function when data expansion is not performed is expressed by the following expressions (1) and (1′).
Here, Gi (W) denotes the objective function, i denotes the index of the training data, and C denotes the class level.
In this way, a softmax function is applied to the output of the neural network, and then, the vector is normalized and the value is converted to a positive value. Cross entropy defined by the expression (1′) is applied to the vector. As a result, poor classification of a certain training sample is quantified. In the instance of a one-dimensional output y(x0i;W), the expressions (1) and (1′) can be applied by substitution of variables so that y1(x0i;W)=y(x0i;W), y2(x0i;W)=1−y(x0i;W), t1=1, and t2=1−t.
The following gradient of the objective function Gi (W) is calculated.
A gradient obtained by the sum of a plurality of data samples is used to update the elements W0, W1, W2, . . . , WL of the weight W as in expression (2) below through stochastic gradient descent (SGD).
The update is repeatedly performed until the elements W0, W1, W2, . . . , WL of the weight W (the weight of each layer) are converged. Here, RPE in expression (2) is an acronym of randomly picked example, and refers to randomly selecting a data sample for each repetition.
Next, an instance according to the present embodiment in which data expansion is performed will be described. The objective function Gi (W) according to the present embodiment is as expressed by the following expressions (3) and (3′).
Unlike in expression (1′), the training data itself is not inputted into the learning unit 14 in expression (3′). Rather, the data expanding unit 12 generates pseudo data and inputs the pseudo data into the learning unit 14. The pseudo data is artificial data derived by transformation from the training data. In addition, unlike in expression (1′), an expected value of cross entropy for the transformation parameter is obtained. The learning unit 14 uses stochastic gradient descent as the method for optimizing the objective function.
A specific procedure is as follows. The data expanding unit 12 selects a single piece of training data stored in the training data storage unit 11. In addition, the data expanding unit 12 samples a plurality of transformation parameters from the transformation parameter generating unit 14 using random numbers based on an appropriate probability distribution. The data expanding unit 12 performs transformation of the training data using the parameters. The data expanding unit 12 thereby expands the single piece of training data into a plurality of pieces of data.
The learning unit 14 uses the plurality of pieces of pseudo data to calculate the following gradient.
The learning unit 14 uses the gradient that is the sum of the plurality of data samples and updates the elements W0, W1, W2, . . . , WL, of weight W as in expression (4) below through stochastic gradient descent.
The update is repeatedly performed until the elements W0, W1, W2, . . . , WL, of weight W (the weight of each layer) are converged. Here, RPERD in expression (4) is an acronym of randomly picked example with random distortion, and refers to selecting a data sample from data samples that have been deformed using random numbers.
Ordinarily, an error back-propagation method is used to update the parameter of the weight W of the multilayer neural network. The error back-propagation method applies a gradient method in sequence from the output layer side to the input layer side via the at least one hidden layer, as shown in
Next, a discriminator according to the present embodiment will be described.
A discriminator 200 includes a data input unit 21, a data expanding unit 22, a transformation parameter generating unit 23, and a discriminating unit 24. The discriminator 200 is actualized by a computer. The computer includes an auxiliary storage unit, a temporary storage unit, a computation processing unit, an input/output unit, and the like. The data input unit 21 is actualized by, for example, the input/output unit. In addition, the data expanding unit 22, the transformation parameter generating unit 23, and the discriminating unit 24 are actualized by the computation processing unit running a discrimination program according to the embodiment of the present invention.
Data that is unknown and is not used for learning is inputted into the data input unit 21.
Therefore, according to the present embodiment, the discriminator 200 performs data expansion even for discrimination, using a method similar to that for learning. The discriminator 200 then appropriately integrates the discrimination results from the expanded data. In this way, as a result of the data being expanded using random numbers during discrimination as well, the possibility of the distribution overlapping the posterior distribution of class increases. Therefore, the possibility of a correct answer being made when a correct answer could not be made in the past increases. A reason therefore will be described in detail below.
When data expansion is not performed, the most appropriate class classification method when a certain piece of data is inputted is to select a class c that satisfies the following expression (5).
y
c(x0;W)≧yc′≠c(x0;W) (5)
This decision rule minimizes the objective function (1′) for when data expansion is not performed, and is theoretically optimal.
Conventionally, the decision rules for when data expansion is not performed have been used even when data expansion is performed. In other words, even when learning is performed using the expression (3′) during learning, discrimination (class classification) using the decision rule in expression (5), which is theoretically optimal when data expansion is not performed, is performed for discrimination. However, the theoretically optimal decision rule differs between when data expansion is performed and when data expansion is not performed. In other words, the above-described decision rule in expression (5) is a minimization of the objective function Gi (W) in expression (1′) for when data expansion is not performed. However, the decision rule is not a minimization of the objective function Gi (W) in expression (3′) for when data expansion is performed.
When data expansion is performed, the optimal class classification method is to select a class c that satisfies the following expression (6).
E
θ[ln yc(u(x0;θ);W)]≧Eθ[1n yc′≠c(u(x0;θ);W)] (6)
This decision rule minimizes the object function Gi (W) in expression (3′), and is theoretically optimal.
As described above, in the conventional method, regardless of the objective function for data expansion being minimized during learning, the decision rule in expression (5) is applied. Therefore, theoretically optimal class classification cannot be performed. Conversely, the discriminator 200 according to the present embodiment performs discrimination by obtaining the expected value of a logarithm of output for the transformation parameters even during discrimination.
Specifically, the discriminator 200 performs processes at the following steps, i.e., a data expanding step and a discriminating step according to a discrimination method of the present embodiment.
First, the discriminator 200 performs a process at the data expanding step. In the process, the data expanding unit 22 transforms unknown data inputted into the data input unit 21 using the transformation parameters generated by the transformation parameter generating unit 23. The data expanding unit 22 thereby generates a plurality of pieces of pseudo unknown data. The transformation parameters used by the data expanding unit 22 are stochastically generated from a distribution p (θj) that has been used for learning to generate the discriminative mode.
Subsequently, the discriminator 200 performs a process at the data expanding step. In the process, the discriminating unit 24 performs the gradient calculation in expression (6). The discriminating unit 24 then selects a class level at which the expected value of the logarithm of output for the transformation parameters is the highest. In this way, through use of the optimal decision rule for data expansion, discriminative capability higher than that in the past can be achieved even when the amount of collected data is the same and data expansion is performed in the same manner.
According to the present embodiment, cross entropy is used as the objective function. However, the objective function is not limited to cross entropy. A decision rule when the objective function is the total sum of the square of error will be described below. The objective function Gi (W) when data expansion is not performed is expressed by the following expressions (7) and (7′).
The following gradient of the objective function Gi (W) is calculated.
A gradient obtained by the sum of a plurality of data samples is used to update the elements of weight W as in expression (8) below through stochastic gradient descent (SGD). The update is repeatedly performed until the elements of weight W are converged.
Next, an instance according to the present embodiment in which data expansion is performed in the above-described example will be described. The objective function Gi(W) according to the present embodiment is as expressed by the following expressions (9) and (9′).
In this way, unlike in expression (7′), the expected value of the total sum of the square of error for the transformation parameter is obtained in expression (9′).
Conventionally, the following expression (10) has been used as the decision rule.
y
c(x0;W)≧yc′≠c(x0;W) (10)
This decision rule is a minimization of the objective function Gi (W) in expression (7′) for when data expansion is not performed. However, the decision rule is not a minimization of the objective function Gi (W) in expression (9′) for when data expansion is performed. Therefore, when data expansion is performed, a decision rule that minimizes the expected value of the total sum of the square of error for the transformation parameter is used as in the following expression (11).
E
θ
[y
c(u(x0;θ);W)]Eθ[yc′≠c(u(x0;θ);W)] (11)
When data expansion is performed, discriminative capability that is higher than that in the past can be achieved in a manner similar to the above-described embodiment through use of the decision rule in expression (11).
As described above, in the discriminator 200 according to the present embodiment, the data expanding unit 22 performs data expansion on unknown data using a method similar to that for data expansion for learning. The data expanding unit 22 thereby generates pseudo unknown data. The discriminating unit 24 then performs class classification based on the expected values of the pseudo unknown data.
In other words, the discriminator 200 does not perform class classification of the unknown data itself. Rather, the discriminator performs class classification by expanding the unknown data and integrating the results of class classification of the expanded unknown data. In other words, the discriminator 200 performs class classification based on a decision rule that is minimization of an objective function used for learning.
As a result, when a discriminative model is generated through learning of provided training data after data expansion, discriminative capability can be actualized that is higher than that by a conventional method when the amount of collected training data is the same and the training data is expanded in the same manner.
In related art, the same decision rule related to class classification of unknown input data that is classified into classes is used both when data expansion is performed and when data expansion is not performed. As described above, in the present embodiment, based on the understanding that theoretically optimal decision rules differ between when data expansion is performed and when data expansion is not performed, improvements in data expansion have been made in the discriminator.
In the present embodiment, the decision rules related to class classification of unknown input data that is classified into classes are improved as described above. Thus, a discriminative capability of the discriminator can be improved when discrimination of the unknown input data is performed based on learning process using data expansion of the training data.
A test conducted using the learning apparatus and the discriminator according to the present embodiment will be described below. The following conditions were set for the test. A handwritten digit data set (refer to MNIST, http://yam.lecun.com/exdb/mnist, and
As the learning condition of the learning apparatus, the same data expansion was applied for both when discrimination is performed by the conventional method and when discrimination according to the embodiment of the present invention is performed.
In addition, a derivative was calculated only once from a generated sample. No derivative was calculated from the original sample. As the discrimination condition of the discriminator, in the conventional method, only the original sample was discriminated. In the discriminator according to the present embodiment, the expected values were evaluated from a plurality of generated samples. The original sample itself was not used for the expected values in the discriminator according to the present embodiment.
The test results are shown in
From the results in
In the present embodiment, unknown data is expanded. The discrimination results of the expanded unknown data are integrated and class classification is performed. Therefore, the present invention is useful as, for example, a discrimination apparatus that uses a discriminative model generated through a learning process of training data that has been expanded. The discrimination apparatus achieves an effect in which discriminative capability is improved compared to when discrimination is performed on the unknown data itself.
Number | Date | Country | Kind |
---|---|---|---|
2013-235810 | Nov 2013 | JP | national |