The present invention relates to an information estimation apparatus and information estimation method for performing an estimation process using a neural network. The present invention particularly relates to an information estimation apparatus and information estimation method that provide a dropout layer in a neural network to obtain a variance representing a confidence interval for an estimation result.
Estimators using neural networks (NN) can process large amounts of information, such as images and sensor signal data, as input data to perform estimation as compared with other estimators, and so are expected to be used in various fields.
A neural network has a structure in which layers for processing data are arranged. Data is supplied to each layer and subjected to computation set in the layer, and then the processed data is output. In detail, input data from an observed object is first supplied to an input layer of the neural network, processed, and output. The data is then supplied to subsequent layers (intermediate layers) in sequence as input data, processed, and output. The process in each layer is thus repeatedly performed to propagate the data in the neural network. Data eventually output from an output layer which is the last layer is an estimation result. The input data from the observed object is n-dimensional vector data of the observation target to be estimated. For example, input data for a monochrome camera image of 10 pixels by 10 pixels is vector data of 10×10=100 dimensions (i.e. n=100) having elements corresponding to the respective pixels.
Each layer in the neural network can be set so that the number of dimensions of input vector data and the number of dimensions of output vector data are different from each other. In other words, the number of dimensions of vector data may increase or decrease when the vector data passes through each layer. Moreover, the number of dimensions of vector data output from the output layer varies depending on what the designer wants to estimate. For example, in the case of estimating a value such as “speed” or “score”, the output from the output layer is scalar data of n=1 dimension. In the case of classifying an input image as any of “pedestrian”, “car”, and “bicycle” (i.e. in the case of performing 3-class classification), the output from the output layer is vector data of n=3 dimensions storing “score” indicating which of the three classes the input image corresponds to.
Processes executed by an estimator for performing an estimation process using a neural network include a learning phase and an estimation phase.
In the learning phase, the designer prepares training data and causes the neural network to learn neuron weights in the neural network so that a desired specific output is produced from input data having a specific pattern, with use of the training data.
In the estimation phase, unknown new data, i.e. test data, is supplied to the neural network having the rule learned in the learning phase to perform estimation. If the learning has been successful, the neural network produces an estimation result according to the learned concept.
A main difference of a conventional estimator using a neural network from other estimators using probabilistic approaches such as Bayesian estimation is that, in the neural network, the estimation result is output merely as a “value”, and a variance representing a confidence interval for the estimation result cannot be computed.
Thus, the variance representing the confidence interval cannot be computed in the neural network. This makes it difficult to, for example, set a threshold and adopt only reliable estimation results not less than the predetermined level, because the possibility of erroneous determination is likely to be high. For example, in the case of using a neural network in an environment where high safety is required, such as when estimating a car's surroundings, a serious accident may ensue if its estimation result contains erroneous determination.
Non Patent Document 1 listed below proposes a method of computing an output value and its variance in a neural network. The computation method disclosed in Non Patent Document 1 is described below.
In the variance computation method in Non Patent Document 1, dropout which is normally used to prevent overfitting during learning is also used during estimation, to compute the variance of the estimation result. Dropout is a technique of providing a dropout layer in the layers of a neural network and independently setting each element of input vector data supplied to the dropout layer to zero with probability pdrop set by the designer beforehand, as disclosed in Patent Document 1 as an example.
For example, suppose the input vector data has 100 dimensions, i.e., is composed of 100 elements. Each element is independently subjected to the determination of whether or not to set the value included in the element to zero with probability pdrop (the value in the original element is unchanged in the case of not setting the value to zero). This results in statistically 100×pdrop elements being zero from among the 100 elements. Thus, dropout causes computation to be performed in the state where the number of elements corresponding to probability pdrop are missing (set to zero).
During learning, weights are computed so as to minimize the difference of an output result obtained in the state of elements missing with probability Pdrop from desired correct solution data. This computation is repeated many times during learning. In detail, each element of another vector data supplied to the dropout layer is independently subjected to the determination of whether or not to set the value included in the element to zero with probability pdrop, computation is performed for the other vector data in the state where the number of elements corresponding to probability pdrop are missing, and the weights are computed so as to minimize the difference from the desired correct solution data. By repeatedly performing learning using dropout for input vector data in this way, the neural network learns to be able to output the same correct solution data as an estimation result irrespective of which elements of vector data are missing.
This computation method using dropout has conventionally been employed only during learning. In other words, dropout has conventionally been used during learning, but not during estimation.
Non Patent Document 1 introduces a technique whereby, during estimation computation, too, computation involving dropout is repeatedly performed many times on input vector data from the same object to obtain an output value and its variance. Such estimation using dropout is called Monte Carlo (MC) dropout in Non Patent Document 1. The pattern of the group of elements of input vector data set to zero with probability pdrop in the dropout layer differs at each estimation computation due to element missing, so that the final estimation result after passing through the subsequent layers also differs each time. In this specification, the phenomenon that the output estimation result differs for each input of vector data is also referred to as “fluctuation” of the estimation result.
In Non Patent Document 1, computation is performed MC times to collect MC (about 200 or less) values of final output vector data which varies each time, and the variance of these values is computed according to the following expression. The variance yielded from this expression is defined as uncertainty with respect to input data.
In this expression, x* is an input, y* is an output, T is the number of computations (i.e. T=MC), and the left side is the variance of output y*. As shown in the expression, the left side (variance) is represented as the sum of constant term τ−1ID (the first term in the right side) relating to initial variance and the outcome of subtracting the square of the mean of output y* (the third term in the right side) from the variance of output y* (the second term in the right side).
Such computation is intuitively expressed as follows. An estimate of the neural network for the same object is computed many times. At each computation, the value of input vector data to the dropout layer is randomly set to zero to create missing elements randomly in the group of elements of the vector data, thus intentionally fluctuating output data from the dropout layer. If, even in the case of intentionally fluctuating the output data from the dropout layer as mentioned above, the final estimation result output from the output layer does not fluctuate, i.e., the variance is small, the neural network can be regarded as producing an estimate with high reliability. If the final estimation result output from the output layer fluctuates greatly, i.e., the variance is large, the neural network can be regarded as producing an estimate with low reliability.
Patent Document 1 International Publication WO 2014105866 A1
Non Patent Document 1 “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, Yarin Gal, Zoubin Ghahramani: Jun. 6, 2015 (available from https://arxiv.org/pdf/1506.02142v1.pdf)
Non Patent Document 2 “ON THE VARIANCE OF THE SAMPLE MEAN FROM FINITE POPULATION”, Syed Shakir Ali Ghazali, Journal of Scientific Research, Volume XXXIV No. 2: October 2005
However, to obtain the variance for one observation target, the computation subsequent to the dropout layer in the neural network needs to be repeatedly performed many times, as mentioned above. For example, the computation needs to be performed MC (about 200 or less) times. In the case where the number of computations is reduced, the obtained probability density distribution of the output value does not have a smooth distribution profile, making it difficult to accurately estimate the variance. In the case where the number of computations is increased, on the other hand, more accurate variance estimation is possible, but a vast number of computations require time and labor in the computation process. This causes a heavy computational burden in actual use.
To solve the above-mentioned problem, the present invention has an object of providing an information estimation apparatus and information estimation method for performing an estimation process using a neural network, with which a variance as a confidence interval for an estimation result can be computed stably and fast without a vast number of computations.
To achieve the object stated above, the present invention also provides an information estimation apparatus for performing an estimation process using a neural network including an integrated layer that combines a dropout layer for dropping out a part of input data and an FC layer for computing a weight, the information estimation apparatus including: a data analysis unit configured to determine, based on a numerical distribution of terms formed by respective products of each vector element of input data to the integrated layer having a multivariate distribution and the weight, a data type of each vector element of output data from the integrated layer having a multivariate distribution; and an estimated confidence interval computation unit configured to apply an approximate computation method associated with the data type determined by the data analysis unit to computation in the integrated layer, to analytically compute a variance of each vector element of the output data from the integrated layer based on the input data to the integrated layer.
To achieve the object stated above, the present invention provides an information estimation method for performing an estimation process using a neural network including an integrated layer that combines a dropout layer for dropping out a part of input data and an FC layer for computing a weight, the information estimation method including: a data analysis step of determining, based on a numerical distribution of terms formed by respective products of each vector element of input data to the integrated layer having a multivariate distribution and the weight, a data type of each vector element of output data from the integrated layer having a multivariate distribution; and an estimated confidence interval computation step of applying an approximate computation method associated with the data type determined in the data analysis step to computation in the integrated layer, to analytically compute a variance of each vector element of the output data from the integrated layer based on the input data to the integrated layer.
The present invention relates to an estimation technique using a neural network, and has an advantageous effect of computing a variance as a confidence interval for an estimation result stably and fast. The present invention thus accelerates and facilitates the determination of the reliability of the estimation result by the neural network. Moreover, for example, whether or not to adopt the estimation result and whether or not to merge the estimation result with another estimation result by Bayesian estimation or the like can be determined depending on reliability. This greatly widens the range of application of the neural network.
The following describes an embodiment of the present invention with reference to drawings. The process in each layer of a neural network and the notation, which are necessary in the description of the embodiment of the present invention, are explained first.
A neural network is composed of a plurality of layers. Input data supplied to each layer is subjected to a computation process defined in the layer, and the processing result is output as output data. The output data is supplied to the next layer as input data to the next layer. In the next layer, the data is equally subjected to a computation process defined in the layer, and the processing result is output as output data. Input, computation, and output are thus repeatedly performed in the layers in sequence, to propagate the data in the neural network. An estimation result is eventually output from an output layer.
In this specification, it is assumed that input data to given layer l of the neural network is nXinl-dimensional random variable vector Xinl, and output data from layer l is nXoutl-dimensional random variable vector Xoutl, which are written as follows. In this specification, nXinl denotes that the subscript of n is Xinl, and nXoutldenotes that the subscript of n is Xoutl.
Xinl=(xinl1,xinl2, . . . , xinln
Xoutl=(xoutl1,xoutl2, . . . , xoutln
These random variable vectors Xinl and Xoutl are expressed as follows, according to probability density functions Hinl and Houtl of multivariate distributions having any complex profiles.
Xinl˜Hinl
Xoutl˜Houtl
For example, if probability density functions Hini and Houti are Gaussian distributions, the following expressions apply.
Xinl˜Hinl=Gauss(μXin
Xoutl˜Houtl=Gauss(μXout
Here, μXinl is an nXinl-dimensional vector representing a mean, and ΣXinl is a variance-covariance matrix of nXinl×nXinl size. Meanwhile, μXoutl is an nxoutl-dimensional vector representing a mean, and ΣXoutl is a variance-covariance matrix of nXoutl×nXoutl size. In this specification, μXinl denotes that the subscript of μ is Xinl, ΣXinl denotes that the subscript of Σ is Xinl, μXoutl denotes that the subscript of μ is Xoutl, and μXoutl denotes that the subscript of Σ is Xoutl.
According to the present invention, the law of total probability is used to represent each probability density by a mixture of M conditional probability density distributions as shown below.
Xinl˜Hinl=P(Hinl|A1)P(A1)+ . . . +P(Hinl|AM)P(AM)
Xoutl˜Houtl=P(Houtl|B1)P(B1)+ . . . +P(Houtl|BM)P(BM)
The sum of the probabilities of all conditions is 1, and is expressed as follows.
As an example, if multivariate distributions Hinl and Houtl are each a mixture of conditional multivariate Gaussian distributions Gauss, the following expressions apply.
Xinl˜Hinl=Gauss(μ1Xin
Xoutl˜Houtl=Gauss(μ1Xout
Here, data Xinl or Xoutl of a “random variable obeying a multivariate distribution” simply means data “expressed in general form”. This covers the following: In the case of “uni”variate distribution, the data can be a 1-dimensional variable of nXinl=1, nXoutl=1. In the case where variance-covariance ΣXinl, ΣXoutl is zero, the data can be a fixed value and not a random variable.
How such multivariate distribution data is computed in each layer of the neural network is briefly described next. The process of each layer is individually described below.
The following describes the computation process in dropout layer D. Let input data to dropout layer D be nXinD-dimensional random variable vector XinD, and output data from dropout layer D be nXoutD-dimensional random variable vector XoutD. In this specification, nXinD denotes that the subscript of n is XinD, and nXoutD denotes that the subscript of n is XoutD.
Dropout is represented using indicator function z={0, 1}. Here, z is a random variable obeying the Bernoulli distribution as shown below, where z=0 with dropout probability pdrop and z=1 with non-dropout probability (1−pdrop. Each of the nXinD elements of input data XinD is multiplied by z that is independently set to z=0 or z=1. Since the overall sum value decreases due to dropout, the scale of the overall value is increased by multiplication by a given constant c.
The following describes the computation process in fully connected (FC) layer F. Let input data to FC layer F be nXinF-dimensional random variable vector XinF, and output data from FC layer F be nXoutF-dimensional random variable vector XoutF. In this specification, nXinF denotes that the subscript of n is XinF, and nXoutF denotes that the subscript of n is XoutF.
The parameters of FC layer F are defined as follows. Let WF (size: nXoutF×nXinF) be a matrix representing a weight, and bF (size: nXoutF×1) be a vector representing a bias. It is assumed that their optimum values have already been acquired in the learning phase.
The process of computing output data XoutF from input data XinF in FC layer F is performed using the following expressions.
The following describes the computation process in activation layer A. Let input data to activation layer A be nXinA-dimensional random variable vector XinA, and output data from activation layer A be nXoutA-dimensional random variable vector XoutA. In this specification, nXinA denotes that the subscript of n is XinA, and nXout A denotes that the subscript of n is XoutA.
An activation function is, for example, a sigmoid function or a rectified linear unit (ReLU) function. When the activation function is denoted as function f, the process of computing output data XoutA from input data XinA in activation layer A is performed using the following expression.
XoutA=f(XinA)
A characteristic process according to the present invention is performed in the case where input data which is a random variable obeying a multivariate distribution passes through the above-mentioned dropout layer D, enters some FC layer F, and finally passes through activation layer A, as described later. Assuming a layer (FC layer F with dropout) that integrates dropout layer D and FC layer F as integrated layer DF, the following describes the process in integrated layer DF.
Let input data to integrated layer DF be nxinDF-dimensional random variable vector XinDF, and output data from integrated layer DF be nXoutDF-dimensional random variable vector XoutDF, as shown below. In this specification, nXinDF denotes that the subscript of n is XinDF, and nXoutDF denotes that the subscript of n is XoutDF.
XinDF=(xinDF1, . . . , xinDFj, . . . , xinDFn
XoutDF=(xoutDF1, . . . , xoutDFi, . . . , xoutDFn
In this integrated layer DF, the process of computing output data XoutDF from input data XinDF involves computation in the part of dropout layer D of integrated layer DF and computation in the part of FC layer F of integrated layer DF. In detail, the computation in integrated layer DF is performed according to the following expressions.
For simplicity's sake, a given constant C multiplied in scale adjustment for dropout may be assumed to be incorporated in weight WDF.
In particular, the i-th element xoutDFi (1≤i≤nXoutDF) in XoutDF is expressed as follows.
xoutDFi=xinDF1
This is the total sum of the list of terms xinDFj
Random variable xinDFj is input data, derives from a transformation of the Bernoulli distribution with dropout in a layer before integrated layer DF, and takes any distribution depending on the structure of the previous layer. Random variable zj derives from dropout in this integrated layer DF, and is the Bernoulli distribution where zj=0 with probability pdrop of zj={0, 1}. These two types of random variables are therefore independent.
Consider the case of computing the value of xoutDFi. The number of xinDFj
A typical neural network has the number of neurons of about nXinDF=1024, and so 21024 total sums need to be computed. Such computation requires enormous processing, and cannot be completed within practical time.
The present invention proposes a technique of computing a variance, which has conventionally required a vast number of computation processes, with one computation process by using an analytical approach, thus making it unnecessary to perform a vast number of computation processes. According to the present invention, the value of output data that fluctuates at each computation due to dropout is regarded as “random variable”. By determining the original “probability density distribution” from which the random variable derives, it is possible to directly find how the distribution profile of the probability density distribution changes by the computation process in each layer. Then, determining the distribution profile of the probability density distribution of the data output from the output layer and computing its variance enables the obtainment of the confidence interval for the estimation result, i.e. the variance.
The following describes the structure of an information estimation apparatus (estimator for performing an estimation process using a neural network) in the embodiment of the present invention, with reference to
The block diagram used in the description of the apparatus structure in the embodiment of the present invention merely represents the functions related to the present invention, and the functions may be actually implemented by hardware, software, firmware, or any combination thereof. The functions implemented by software may be stored in any computer-readable medium as one or more instructions or codes that are executable by a hardware-based processing unit such as a central processing unit (CPU). The functions related to the present invention may be implemented by various devices including integrated circuits (ICs) and IC chipsets.
The estimated confidence interval computation unit 20 is configured to, in addition to performing a computation process on input data in each layer and outputting an estimation result as in a conventional neural network, compute the distribution profile regarding the distribution with which the input data propagates through and is output from each layer as a result of dropout, and compute the variance output from the final output layer as the confidence interval. If the variance of the estimation result output from the final output layer is large, the estimation result fluctuates greatly, that is, its reliability is low. If the variance is small, the reliability of the estimation result is high. The estimated confidence interval computation unit 20 particularly has a function of performing approximate computation (e.g. approximate computation corresponding to any of the below-mentioned “type 1”, “type 2”, and “mixed type”) corresponding to an approximate computation method determined by the data analysis unit 30, to compute the distribution profile of the data .
For example, the estimated confidence interval computation unit 20 is capable of performing a process of applying the approximate computation method associated with the data type determined by the data analysis unit 30 to the computation in integrated layer DF to thus analytically compute the variance of each vector element of output data from integrated layer DF based on input data to integrated layer DF.
The data analysis unit 30 is configured to analyze the property of data computed in and output from each layer of the neural network, determine an optimum approximate computation method for computing its distribution (data type determination), and notify the approximate computation method to the estimated confidence interval computation unit 20. The data analysis unit 30 particularly has a function of analyzing input data to integrated layer DF combining dropout layer D and FC layer F in the neural network and notifying the estimated confidence interval computation unit 20 of the optimum approximate computation method (e.g. any of the below-mentioned “type 1”, “type 2”, and “mixed type”) for the input data.
For example, the data analysis unit 30 is capable of performing a process of determining, based on the numerical distribution of terms formed by respective products of each vector element of input data to integrated layer DF having a multivariate distribution and a weight, the data type of each vector element of output data from the integrated layer having a multivariate distribution.
The processes in the estimated confidence interval computation unit 20 and the data analysis unit 30 are described in detail below.
The process in the data analysis unit 30 is described first. In integrated layer DF, output data XoutDF is computed based on input data XinDF, as mentioned above. The i-th element xoutDFi (1≤i≤nXoutDF) in XoutDF is expressed as follows.
xoutDFi=xinDF1
The data analysis unit 30 analyzes the property of the xinDFjWi,j term (1≤j≤nXinDF) excluding zj from among the nXinDF number of xinDFj
The following describes the property of the xinDFjWi,j term (1≤j≤nXinDF) included in the i-th element xoutDFi of XoutDF, with reference to
Actually, xinDFj is another independent random variable. However, for example, xinDFj may be substituted by its mean μxinDFj so that the value of the xinDFjWi,j term is treated as a fixed value of μxinDFjWi,j. In this specification, μxinDFj denotes that the subscript of μ is xinDFj.
The data analysis unit 30 analyzes the absolute value |xinDFjWi,j| of each of the nXinDF number of xinDFjWi,j terms (1≤j≤nXinDF). A term having an exceptionally larger absolute value than other terms is referred to as “peak term”, and the other terms as “non-peak terms” in the this specification. For example, the data analysis unit 30 computes standard deviation σμw of all of the nXinDF number of xinDFjWi,j terms (1≤j≤nXinDF), and regards any xinDFjWi,j term larger than or equal to the value σμwDratio obtained by multiplying σμw by a predetermined number (ratio Dratio) set by the designer beforehand, as an exceptional peak term. For example, suppose the value of the xinDFjWi,j term is treated as a fixed value of μxinDFjWi,j. Then, the condition for an exceptional peak term is defined by the following expression.
|μxin
All peak terms satisfying this condition from among the nXinDF number of μxinDFjWi,j terms (1≤j≤nXinDF) are computed. Of these peak terms, a predetermined number (set by the designer beforehand, e.g. several, such as 5) of peak terms with greater exceptions are stored as a peak list. Here, the predetermined number indicates the maximum number of peak terms stored as the peak list. There may be a large number of peak terms, or only a few or no peak terms. For example, in the case where the number of peak terms is less than the predetermined number, fewer peak terms than the predetermined number are stored as the peak list. In the case where the number of peak terms is more than the predetermined number, the predetermined number of peak terms are extracted in descending order of exception, and stored as the peak list. The number of peak terms stored as the peak list is hereafter denoted by npeak (npeak<<nXinDF). Here, npeak takes a value less than or equal to the predetermined number (the maximum number of peak terms stored as the peak list). In the case where there is no peak term, “type 2” is determined as described later, and the peak list does not need to be stored.
The data analysis unit 30 determines “type 1” in the case where the peak terms are a few (npeak in number) and the values of the other remaining (nXinDF−npeak) number of terms are small enough to be regarded as zero. This is a distribution in which, of the values of the xinDFjWi,j terms, a few terms (npeak in number) project like a peak of δ function and the other remaining (nXinDF−npeak) number of terms are substantially zero.
The upper parts of
In the case where xoutDFi is determined as “type 1”, the estimated confidence interval computation unit 20 considers only these exceptional peak terms (i.e. npeak number of xinDFjWi,j terms) while approximating the remaining terms to zero. The estimated confidence interval computation unit 20 can thus compute the distribution of xoutDFi by examining only 2^npeak branching combinations of these peak terms, with no need to examine all 2^nXinDF branches. The distribution computation method by the estimated confidence interval computation unit 20 in “type 1” will be described later.
Various methods are available to determine whether or not the remaining (nXinDF−npeak) number of terms are small enough to be regarded as zero, and the determination method is not limited. As an example, the mean and variance for the distribution of the remaining (nXinDF−npeak) number of terms other than the npeak number of peak terms are computed. In the case where a condition that the mean is less than a first predetermined value (close to zero) and the variance is less than a second predetermined value (small variation) is satisfied, the remaining (nXinDF−npeak) number of terms other than the peak terms can be determined as being small enough to be regarded as zero. In the case where the condition is not satisfied, the remaining (nXinDF−npeak) number of terms other than the peak terms can be determined as not being small enough to be regarded as zero.
The data analysis unit 30 determines “type 2” in the case where there is no peak term. Simply stated, this is the case (such as uniform distribution or Gaussian distribution) where all xinDFjWi,j values are similar as a whole without any significant exception.
The upper part of
Actually, the above-mentioned “type 1” property and “type 2” property are often mixed, as in the state of the values of the xinDFjWi,j terms shown in each of the upper parts of
In the upper part of
In such a case, the data analysis unit 30 determines “mixed type” which is a mixture of “type 1” and “type 2”. In “mixed type”, the estimated confidence interval computation unit 20 first acquires the peak terms assuming “type 1”, and treats these values not as random variables but as conditional fixed values (e.g. μxinDFjWi,j). Regarding the remaining (nXinDF−npeak) number of terms other than the peak terms, the estimated confidence interval computation unit 20 can compute the distribution assuming conditional “type 2”. The distribution computation method by the estimated confidence interval computation unit 20 in “mixed type” will be described later.
The process in the estimated confidence interval computation unit 20 is described next. The following describes the distribution computation method by the estimated confidence interval computation unit 20 in each of “type 1”, “type 2”, and “mixed type” determined by the data analysis unit 30, in detail.
The distribution computation method in “type 1” is described first. In the case where the data analysis unit 30 determines the property of the xinDFjWi,j terms included in the i-th element xoutDFi of XoutDF computed in integrated layer DF as “type 1”, the estimated confidence interval computation unit 20 computes the distribution of xoutDFi using only the npeak number of peak terms stored as the peak list.
First, consider the simplest case where the number of peak terms is 1, i.e., npeak=1. In this case, the values of the xinDFjWi,j terms are, for example, in the state shown in the upper part of
Suppose the j=peaki-th term (1≤peaki≤nXinDF) in xoutDFi is exceptional. When this peak term is denoted by xinDFpeaki
xoutDFi=xinDF1
Regarding the xinDFjWi,j terms excluding zj of these terms, in the case where one term (j=peaki-th term) of the xinDFjWi,j terms has a large value and the other terms are small enough to be regarded as zero as shown in the upper part of
xoutDFi≈xinDFpeak
Since random variable zpeaki={0, 1}, xoutDFi has a value with two branches as shown below.
The probability density function obeyed by xoutDFi given by this expression is the following, when using δ function and simplifying X=xoutDFi.
The lower part of
Likewise, δ function can be used in the case where the number of peak terms is 2 or more. In the case where two terms (two peak terms) of the xinDFjWi,j terms each have a large value and the other terms are small enough to be regarded as zero as shown in the upper part of
The distribution computation method in “type 2” is described next. As in the above-mentioned case, the i-th element xoutDFi of output XoutDF corresponding to input XinDF is expressed as follows.
xoutDFi=xinDF1
In “type 2”, the values of the xinDFjWi,j terms excluding zj from among the xoutDFi terms are in the state shown in the upper part of
Suppose xinDFj is not a random variable, but merely a fixed value μXinDFi. zj is a random variable of the Bernoulli distribution. Given that zj=0 with probability pdrop and zj=1 otherwise as mentioned earlier, the part of the sum of xinDFj
Therefore, xoutDFi is a value obtained by adding bias term bi to the sum. Each time this sampling is performed, M number of different xinDFjWi,j are selected, and the value of xoutDFi which is the sum of the M number of different xinDFjWi,j varies each time while forming a distribution function. This is the “fluctuation of sample sum error”.
In “type 2”, no exceptional peak term is included in the nxinDF number of xinDFjWi,j (1≤j≤nXinDF) terms. Hence, the distribution of the values of the nXinDF number of xinDFjW,i,j(1≤j≤nXinDF) terms as the population is weak in kurtosis and skewness, so that the central limit theorem holds from Lyapunov's theorem. The sum value xoutDFi fluctuating at each sampling can therefore be regarded as the Gaussian distribution, as shown in the lower part of
Since the distribution of xoutDFi can be regarded as the Gaussian distribution as described above, the distribution profile can be identified once its mean E[xoutDFi] and variance Var(xoutDFi) are known.
In the case where the central limit theorem holds, the variance is typically called “variance of sample sum error”, and can be analytically computed according to the following expression as described in Non Patent Document 2.
Here, Varpopulation is the variance of the nxinDF number of xinDFjWi,j (1≤j≤nxinDF) terms of the population where zj=1.
Meanwhile, mean μDFi is simply obtained as follows.
Since xinDFj is fixed value μXinDFj, mean μDFi can be computed according to the following expression.
Typically, given that xinDFj is not μXinDFj but a random variable of a multivariate distribution, the expression in Non Patent Document 2 is further expanded so that the variance of the sum in the case of the random variable with the value of population also obeying the multivariate distribution is expressed as follows. A proof of this expression is given in supplementary note 1 at the end of this specification. The covariance computation method is also described in supplementary note 2.
Since this variance is the Gaussian distribution, the probability density function of data output value xoutDFi is expressed as follows.
xoutDFi˜Gauss(μDFi,ΣDFi)
In the embodiment of the present invention, the case where the central limit theorem holds is “type 2”, which is distinguished from “type 1” where the central limit theorem does not hold. “type 2” is mainly found in input data to integrated layer DF near the final output layer of the neural network.
The distribution computation method in “mixed type” which is a mixture of “type 1” and “type 2” is described next.
There are actually instances where “type 1” and “type 2” are mixed: of the xinDFjWi,j terms excluding zj, a few terms, i.e. npeak (nPeak<<nxinDF) number of terms, have exceptionally larger absolute values than other values and the other remaining (nxinDF−npeak) number of terms cannot be regarded as zero. In these instances, the distribution cannot be computed by focusing only on a few branches from among 2^nXinDF branches as in “type 1”, or by comprehensively treating the sum of xinDFj
In such a case, in the embodiment of the present invention, first the property is regarded as “type 1” to extract the peak terms and identify branches regarding peak term combinations, and then computation is performed for each branch as conditional “type 2”. This is described below.
First, consider the simplest case where the number of peak terms is 1, i.e., npeak=1. The i-th element xoutDFi of vector xoutDF of the following output data is concerned here, as in the foregoing case. xoutDFi is expressed as follows.
xoutDFi=xinDF1
Here, supposing that only the j=peaki-th term (1≤peaki≤nxinDF) in xoutDFi is exceptionally large as in “type 1”, the term is denoted by xinDFpeaki
If this peak term xinDFpeaki
Accordingly, for the exceptional peak term xinDFpeaki
The expression of xoutDFi is divided into two parts as shown below.
xoutDFi=xWDFi+biasDFi
where
xWDFi=xinDF1
biasDFi=xinDFpeak
Here, xWDFi is the part of the sum of (nxinDF−1) number of xinDFj
When zpeaki=1, that is, when peak term xinDFpeakiZpeakiWi,peaki is selected, p(zpeaki=1)=1−Pdrop, and the above-mentioned two parts are expressed as follows.
xWDFi=xinDF1
biasDFi=xinDFpeak
These indicate sampling from a population of a finite number of xWDF terms and computing their sum as in “type 2”. The population in this case are N=(nxinDF−1) number of xinDFjWi,j terms (1≤j, j≠peaki, j≤nxinDF−1), whose variance is denoted by varpopulation. The computation can be regarded as sampling mean M=N(1−pdrop)−1 number of terms from the population.
As mentioned above, the value xoutDF; of the sum fluctuates at each sampling while forming the Gaussian distribution of mean μ1DFi and variance-covariance Σ1DFi as shown below.
When zpeaki=0, that is, when peak term xinDFpeaki
xWDFi=xinDF1
biasDFi=bi
The population are equally N=(nXinDF−1) number of xinDFj
Thus, in these two cases, the part of xWDFi is the Gaussian distribution and bias term biasDFi is different, both when zpeaki=1 and zpeaki=0. When simplifying X=xoutDFi, the probability density function of the value of xoutDFi is the following.
This Gaussian mixture distribution is in the state shown in the lower part of
The same computation can be performed in the case where the number of peak terms is 2 or more. The upper part of
As described above, in “mixed type” which is a mixture of “type 1” and “type 2”, the probability density distribution of the output data is represented by (2 raised to the (number of peak terms)-th power) number of Gaussian mixture distributions.
This can be written in general form as follows. In the case where data xoutDFi has npeak (npeak<<nxinDF) number of peak terms xinDFpeakiWi,peaki there are 2^npeak branch conditions conk (1≤k≤2^npeak) as each peak term corresponds to two cases of being dropped out (zpeaki=0) and not being dropped out (zpeaki=1).
As a result, data X=xoutDFi is defined by a probability density function according to the following conditional Gaussian mixture distribution. In this specification, Xconk denotes that the subscript of X is conk.
X˜PDF(X)=Gauss(Xcon
In the neural network having the multilayer structure, data X needs to be processed individually for each function corresponding to a separate condition in each subsequent layer through which output data propagates. Besides, in each integrated layer FC, these conditions branch further, increasing the number of functions that need be computed individually. The number of dropout layers D in one neural network is, however, 3 or less in most cases, so that the technique proposed in the embodiment of the present invention can achieve a practical computation process.
In activation layer A, output data XoutA resulting from input data XinA through activation function f is computed. In detail, the process in activation layer A involves computation according to the following expression.
XoutA=f(XinA)
Input data is a random variable obeying a multivariate distribution. In the case of being supplied to activation layer A, it is output as a multivariate distribution distorted with nonlinear activation function f. It is usually difficult to compute what kind of function results when a given complex function is distorted. If the function subjected to input is a known function such as Gaussian distribution or delta function, however, the function can be mathematically determined by approximation to a certain extent. To do so, in the embodiment of the present invention, the above-mentioned representation of a mixture of a plurality of “conditional” probability density functions PDF(Xconk|conk) is employed, with each function being expressed by Gaussian distribution or delta function for which the computation method is known. This enables variant computation in activation function f.
Hence, in activation layer A, it suffices to compute transformed f(PDF(Xconk|conk)) by activation function f with each conditional probability density function as shown below.
X˜f(PDF(X))=f(PDF(Xcon
If the layers subsequent to integrated layer DF have no activation layer A and only include simple linear transformation layers, the processes in the subsequent layers may be performed by approximating the mixture distribution to one distribution of up to second moment. In the case where some Gaussian functions in the Gaussian mixture overlap (e.g. individual distributions are similar), too, a speed-up technique such as combining into one Gaussian function may be employed.
In detail, suppose the multivariate Gaussian mixture distribution is expressed as follows.
Regarding the k1-th Gaussian function Gauss(Xconk1|conk1) and the k2-th Gaussian function Gauss(Xconk2|conk2), in the case where their means and variances are close in value, for example, merging into one Gaussian function Gauss(Xconk_1_2|conk_1_2) as shown below can reduce the number of mixture distributions and lighten the computation process. In this specification, Xconk1 denotes that the subscript of X is conk1, Xconk2 denotes that the subscript of X is conk2, and Xconk_1_2 denotes that the subscript of X is conk_1_2.
Gauss(Xcon
For example, two Gaussian functions can be merged by the computation process. When the mean and deviation of Gaussian function Gauss(Xconk1|conk1) before merging are denoted respectively by μk1 and σk1 and the mean and deviation of Gaussian function Gauss(Xconk2|conk2) before merging are denoted respectively by μk2 and σk2, then the mean μk_1_2 and deviation σk_1_2 of Gaussian function Gauss(Xconk_1_2|conk_1_2) after merging can be computed as follows.
In any case, eventually the mixed multivariate distribution of data output from the output layer of the neural network is approximated to one distribution function of up to second moment, and its variance is computed as the confidence interval of the final estimation output result.
The following describes the process in the information estimation apparatus 10, with reference to
The input data to the neural network is supplied to the estimated confidence interval computation unit 20 in the information estimation apparatus 10 (step S11). The estimated confidence interval computation unit 20 is configured to perform the process in the order of the plurality of layers constituting the neural network. The input data is accordingly supplied to the input layer which is the first layer, to start the process in the neural network (step S12).
In the case where the layer supplied with the input data is an FC layer with dropout (integrated layer DF), the estimated confidence interval computation unit 20 performs a data analysis and computation process in cooperation with the data analysis unit 30 (step S14). The process in step S14 will be described later, with reference to
After the computation process in step S14 or S15 is completed, the output data resulting from the computation process is supplied to the next layer as the input data to the next layer (step S16). In the case where the next layer is the final output layer (step S17: “YES”), the variance of conditionally separated multivariate distributions is computed as one combined variance, and output from the output layer (step S18). In the case where the next layer is not the final output layer (step S17: “NO”), the process returns to step S13 to perform the computation process in the next layer.
The following describes the data analysis and computation process in step S14 in
The data analysis and computation process in
The estimated confidence interval computation unit 20 and the data analysis unit 30 then perform a type determination and computation process for each element from i=1 to i=nXoutDF (i.e. all rows from the first to nXoutDF-th rows), for the i-th element xoutDFi of vector XoutDF of the output data computed using input data XinDF, weight WDF, and bias bDF as described above. In detail, the estimated confidence interval computation unit 20 and the data analysis unit 30 first set i=1 (step S142), and perform the type determination and computation process for the i-th output data XoutDFi from among the nXoutDF number of elements (step S143). The type determination and computation process in step S143 will be described later, with reference to
After the type determination and computation process in step S143 is completed, in the case where the processing target xoutDFi is the last row (i.e. i=nxoutDF) (step S144: “YES”), the data analysis and computation process ends. In the case where the processing target xoutDFi is not the last row (i.e. i=nXoutDF) (step S144: “NO”), i is incremented (i.e. i=i+1) (step S145), and the process returns to step S143 to perform the type determination and computation process for xoutDFi of the next row.
The following describes the type determination and computation process in step S143 in
In
In the case where no term satisfies |μxinDFjWi,j|≥σμWDratio (step S1435: “NO”), the data analysis unit 30 determines the i-th element xoutDFi as “type 2”, and the estimated confidence interval computation unit 20 performs the computation process using the distribution computation method in “type 2” (step S1436). The computation process in “type 2” in step S1436 is as described above, and the multivariate sampling error sum is computed for all of the nxinDF number of xinDFjWi,j terms.
In the case where any term satisfies |μxinDFjWi,j|≥σμWDratio (step S1435: “YES”), a predetermined number (n of terms are extracted in descending order of |μxinDFjWi,j|and stored as a peak list (step S1437). The data analysis unit 30 then determines whether or not the remaining terms other than the peak terms stored as the peak list are small enough to be regarded as zero (step S1438).
In the case where the remaining terms are small enough to be regarded as zero (step S1438: “YES”), the data analysis unit 30 determines the i-th element xoutDFi as “type 1”, and the estimated confidence interval computation unit 20 performs the computation process using the distribution computation method in “type 1” (step S1439). The computation process in “type 1” in step S1439 is as described above. For example, the computation is performed concerning all of the 2^npeak cases at the maximum involving, for each of the maximum npeak number of μXinDFWi,j terms stored as the peak list, the case where the term is selected as dropout and the case where the term is not selected as dropout.
In the case where the remaining terms are not small enough to be regarded as zero (step S1438: “NO”), the data analysis unit 30 determines the i-th element xoutDFi as “mixed type”, and the estimated confidence interval computation unit 20 performs the computation process using the distribution computation method in “mixed type” (step S1440). The computation process in “mixed type” in step S1440 is as described above. For example, the computation is performed concerning all of the 2^npeak cases at the maximum involving, for each of the maximum npeak number of μXinDFWi,j terms stored as the peak list, the case where the term is selected as dropout and the case where the term is not selected as dropout. Further, the multivariate sampling error sum is computed for all remaining xinDFjWi,j terms.
The following describes an experiment conducted using the technique proposed in the embodiment of the present invention described above.
With the conventional technique, the fluctuation of the value of y obtained by performing estimation computation MC times for every input x is produced as the variance. Such variance is unstable. With the technique proposed in the embodiment of the present invention, on the other hand, the variance is computed analytically, so that a stable and smooth variance can be produced.
Suppose population yi (123 i≤N) is a random variable obeying an N-dimensional multivariate Gaussian distribution, as shown below. Here, μy is an N-dimensional vector indicating a mean, and Σy is an N×N variance-covariance matrix.
Y=(y1, . . . yi, . . . yN)˜Gauss(μy,Σy) (1≤i≤N)
The variance of the sample mean error in the case of sampling n samples from this is computed, where:
The variance of the sample mean error is expressed as follows.
Since yi is a random variable, yi cannot be taken out of variance Var and covariance Cov. Given that ai and yi are independent, the following expression holds.
Hence, yi as a random variable is represented by expected value E. The following expression is used as in the above.
The part of the first term of the variance of the sample mean error is expressed as follows.
Moreover, the following relational expression holds.
Using this relational expression, the part of the second term of the variance of the sample mean error is expressed as follows.
The mean of yi as a random variable is E(yi). This is a value relating to index i. The mean for all indexes, i.e. the mean of the means, is the following.
Combining these two parts yields the following variance of the sample mean error.
The variance of the sum of the sample population is defined by the following Expression (Formula) 1 using expected value E of yi.
There is no problem with Expression 1. However, Expression 1 uses expected value E of population data yi (1≤i≤N), which is not convenient. It is desirable to represent this using variance Var(yi) and covariance Cov(yi, yj) of each individual value yi of the population data as a random variable. Moreover, while population data yi is a random variable, it is desirable to also use variance Varpopulation(y) as a whole that is expressed as follows if population data yi is a fixed value (employing mean E(yi)).
In view of these demands, the variance of the sum of the sample population is expressed using Varpopulation(y) and variance Var(yi) and covariance Cov(yi, yj) of the population data as a random variable. This leads to the following expression. A proof that this expression is equivalent to Expression 1 is given below.
First, the expression is modified using the following expressions.
The expression is then modified using the following expressions.
The expression is further modified using the following expressions.
The coefficients of the first and second terms in the expression given above are as follows.
Using the above, the expression is modified as follows.
This expression is the variance of the sum of the sample population in Expression 1 computed above, and the following relationship is satisfied.
Proposed Var(n*
The conclusion is summarized below. Suppose there are N finite number of population data yi (1≤i≤N), and these data yi are not fixed values but random variables obeying N-dimensional multivariate Gaussian distribution as shown below. Here, μy is an N-dimensional vector indicating a mean, and Σy is an N×N variance-covariance matrix.
Y˜Gauss(μy,Σy) (1≤i≤N)
The variance of sample sum error in the case of sampling n random variables from the population of N random variables is as follows.
Here, variance Var(yi) and covariance Cov(yi, yj) are the variance-covariance of the population of random variables obtained from the variance-covariance matrix. Variance Varpopulation(y) is assumed to be the variance of the sample sum error in the case where each population is not a random variable (the value being assumed to be mean E(yi)), and expressed as follows.
The covariance can be computed in the same way as the variance. Suppose two populations Y1 and Y2 are random variables obeying N-dimensional multivariate Gaussian distribution as shown below. Here, μ1y and μ2y are each an N-dimensional vector indicating a mean, and Σ1y and Σ2y are each an N×N variance-covariance matrix.
Y1=(y11, . . . , y1i, . . . y1N)˜Gauss(μ1y,Σ1y) (1≤i≤N)
Y2=(y21, . . . , y2i, . . . y2N)˜Gauss(μ2y,Σ2y) (1≤i≤N)
Covariance cov(Y1, Y2) of the sample mean error in the case of sampling n samples in the state where Y1 and Y2 are synchronized with respect to index i (i.e. when y1i is sampled, y2i is sampled, too) is computed.
The covariance can be represented by the variance using the following expression (formula).
Var(Y1) and Var(Y2) are the variances of the sample mean errors mentioned above for populations Y1 and Y2 respectively, and so are computable.
Var(Y1 +Y2) is the variance of the sample mean error from new population Y1+Y2 expressed as follows as a result of adding the corresponding terms of populations Y1 and Y2 together.
Y1_2=(y11+y21, . . . y1i+y2i, . . . y1N+y2N)˜Gauss(μ_1_2yΣ_1_2y) (1≤i≤N)
The mean error variance from this population can be computed by the above-mentioned method, by regarding the respective terms as one term y_1_2i where y_1_2i=y1i+y2i.
The present invention enables stable and fast computation of a variance representing a confidence interval for an estimation result in an estimation apparatus using a neural network, and is applicable to all neural network-related technologies. The present invention also achieves a wider range of application of neural networks, and is highly effective in environments where fast and reliable processing is required, e.g. estimation for mobile objects such as cars or pedestrians.
Number | Date | Country | Kind |
---|---|---|---|
2016-252813 | Dec 2016 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5764557 | Hara | Jun 1998 | A |
5995989 | Gedcke | Nov 1999 | A |
20040172401 | Peace | Sep 2004 | A1 |
Number | Date | Country |
---|---|---|
2014105866 | Jul 2014 | WO |
2016145379 | Sep 2016 | WO |
2016145516 | Sep 2016 | WO |
Entry |
---|
John Wolfe, Pattern Clustering by Multivariate Mixture Analysis, U.S. Naval Personnel Research Activity, Mar. 1969 (Year: 1969). |
Wolfe, Pattern Clustering by Multivariate Mixture Analysis, U.S. Naval Personnel Research Activity, Mar. 1969 (Year: 1969). |
Gal, Dropout as a Bayesian Approximation: representing Model Uncertainty in Deep Learning, University of Cambridge, Jun. 2015 (Year: 2015). |
Hardt, Tight Bounds for Learning a Mixture of Two Gaussians, Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, 753-760, Jun. 2015 (Year: 2015). |
Gal et al (Dropout as a Bayesian Approximation: representing Model Uncertainty in Deep Learning, University of Cambridge, Jun. 2015) (Year: 2015). |
European Search Report, corresponding European Application No. 17203449.8, May 4, 2018. |
David Duvenaud et al., “Avoiding pathologies in very deep networks”, Jun. 8, 2016, (arXiv:1402.5836v3). |
Yarin Gal et al., “Dropout as a Bayesian Approximation: Representing Model Uncertainty In Deep Learning”, Oct. 4, 2016, University of Cambridge (arXiv: 1506.02142v6) (previously submitted). |
Yarin Gal et al., Dropout as a Bayesian Approximation: Representing Model Uncertainty In Deep Learning, Jun. 6, 2015, University of Cambridge (arZiv: 1506.02142v1). |
S.S.A. Ghazali et al., On the Variance of the Sample Mean From Finite Population, Oct. 2005, Journal of Scientific Research, vol. XXXIV No. 2, pp. 19-23. |
Number | Date | Country | |
---|---|---|---|
20180181865 A1 | Jun 2018 | US |