The present application claims the benefit of priority from Japanese Patent Application No. 2019-127103 filed on Jul. 8, 2019. The entire disclosure of the above application is incorporated herein by reference.
The present disclosure relates to a technique for performing neural network learning.
A neural network is a type of machine learning. In the machine learning, sample data derived from a sensor, a database, or the like is input and analyzed, and a useful rule, a rule, a knowledge expression, determination criteria, or the like are extracted from the data to develop an algorithm. The neural network learning often provides correct answer data (supervised learning) and gradually learns the parameters of a neural network (error back propagation method) so as to minimize errors to the correct answer data.
In a method or a device for learning of a neural network, a mathematical expression may be calculated, represent an output with respect to an input in each layer of a neural network, and be expressed by F(X)=K(WTX) when the output may be defined as F, the input may be defined as X, K may be defined as nonlinear conversion, and W may be defined as a parameter matrix. Multiple eigenvalues of a matrix obtained by inputting the parameter matrix to the input of the mathematicalexpression and squaring the matrix may be calculated as multiple square eigenvalues.
The above and other features and advantages of the present disclosure will become more apparent from the following detailed description made with reference to the accompanying drawings. In the drawings,
When performing supervised learning by the error back propagation method, especially when learning a deep neural network (deep learning), an error(gradient) may vanish (gradient vanishment) or an error (gradient) may become excessively large (gradient explosion) in a process of propagating a hierarchy in which the error to be minimized is deep. If the gradient vanishment or the gradient explosion occurs, learning of the neural network may not be successful.
One example of the present disclosure provides a technique for diagnosing a gradient vanishment or a gradient explosion in a neural network learning. Another object of the present disclosure provides a technique for preventing the gradient vanishment or the gradient explosion from occurring during learning.
According to one example embodiment, a diagnostic method includes; calculating a mathematical expression that represents an output for an input in each layer of a neural network and is expressed by the following mathematical expression (1) when the output is defined as F, the input is defined as X, nonlinear conversion is defined as K, and a parameter matrix is defined as W, in learning of the neural network; calculating multiple eigenvalues of a matrix obtained by inputting the parameter matrix to the input of the mathematical expression and squaring the matrix as multiple square eigenvalues; and determining a gradient vanishment or a gradient explosion based on a distribution of the multiple square eigenvalues.
F(X)=K(WTX) [Mathematical expression (1)]
The present inventor has found that an eigenvalue of a conversion matrix of each layer can be used to determine whether a state causes a gradient vanishment or a gradient explosion. In the present disclosure, the determination is not performed based on the gradient itself. Alternatively, the conversion matrix is used to determine whether the parameter of the neural network is in a state that causes the gradient vanishment or the gradient explosion. Here, since the conversion matrix is a matrix obtained by inputting a parameter matrix W to an input X, the conversion matrix is expressed by the following mathematical expression (2).
ΣK,W=F(W)=K(WTW) [Mathematical expression (2)]
Since a nonlinear conversion K is applied to the conversion matrix, a sign of the eigenvalue is unknown, and therefore, in the present disclosure, an eigenvalue (called the square eigenvalue) for a matrix obtained by squaring the conversion matrix is defined, and the gradient vanishment and the gradient explosion are diagnosed based on a distribution of the square eigenvalue.
According to another example embodiment, a learning method learns a neural network model and includes repeatedly: calculating a mathematical expression that represents an output for an input in each layer of a neural network and is expressed by the following mathematical expression (3) when the output is defined as F, the input is defined as X, nonlinear conversion is defined as K, and a parameter matrix is defined as W; calculating multiple eigenvalues of a matrix obtained by inputting the parameter matrix to the input of the mathematical expression and squaring the matrix as multiple square eigenvalues; and learning the neural network model by utilizing a loss function including a penalty for controlling the multiple square eigenvalues.
F(X)=K(WTX) [Mathematical expression (3)]
With the inclusion of a penalty for controlling the square eigenvalue in a loss function in this manner, the square eigenvalue can be controlled to implement learning with reduced occurrence of the gradient vanishment or gradient explosion.
According to the present disclosure, learning with reduced occurrence of the gradient vanishment or gradient explosion can be implemented.
Hereinafter, a diagnostic method and a learning method according to an embodiment of the present disclosure will be described. In the following description, a method for diagnosing the occurrence of a gradient vanishment and a learning method in which the occurrence of the gradient vanishment is reduced will be described.
(Neural Network)
A neural network has one or more layers between an input layer and an output layer, and has a structure in which an output from each layer is input to a next layer.
A middle node group Z1 outputs a value corresponding to the input value. In those nodes, a value corresponding to the input value is output by nonlinear conversion utilizing a sigmoid function, a ReLU function, or the like. The nonlinear conversion is expressed by K(X). The function used in this case is not limited to the sigmoid function and the ReLU function, and various functions such as a truncated power function and a step function can be used.
Therefore, the input/output conversion performed in each layer of the neural network can be expressed by the following mathematical expression (4).
F(X)=K(WTX) [Mathematical expression (4)]
(Diagnostic Methods)
In the diagnostic method according to the embodiment, first, a conversion expression (the above mathematical expression (1)) of input and output in each layer of the neural network model being learned is obtained, and a conversion matrix (the above expression (2) is obtained from the conversion expression (S10). Next, an eigenvalue of the matrix obtained by inputting a parameter matrix W to the input X of the conversion expression and squaring the resulting matrix is obtained as a square eigenvalue (S11), and it is determined whether a gradient vanishment occurs based on the distribution of the square eigenvalue (S12). Although the multiple square eigenvalues exist in the conversion matrix of each layer, when the square eigenvalues are widely distributed from a large value to a small value, the parameters of the corresponding layer are not degenerated and the gradient vanishment is unlikely to occur. Conversely, when the values of all the square eigenvalues become too small and the parameters are degenerated, the gradient vanishment is likely to occur.
In the present embodiment, the following criterion is used in order to determine the distribution of the square eigenvalue.
(1) Ratio of Square Eigenvalues
As the ratio of the square eigenvalues, for example, a ratio of the maximum square eigenvalue and the minimum square eigenvalue may be taken, and whether the ratio is larger than a predetermined threshold may be determined, and when the ratio is larger than the predetermined threshold, it may be determined that the square eigenvalues are widely distributed.
(2) Absolute Value of Square Eigenvalue
As an absolute value of the square eigenvalue, an absolute value of the maximum square eigenvalue may be used. When the maximum square eigenvalue is larger than a predetermined threshold, it is determined that the square eigenvalues are widely distributed. The least square eigenvalue may also be used to determine whether the least square eigenvalue is very close to zero. When the square eigenvalue is very close to 0, a column vector of the linear conversion is not linearly independent, so that the gradient vanishment occurs. Whether the square eigenvalue is very close to 0 can be determined by whether a difference between the square eigenvalue and 0 is equal to or less than the predetermined threshold.
(3) Variance of Square Eigenvalue
When the variance of the square eigenvalues is larger than a predetermined threshold, it may be determined that the square eigenvalues are widely distributed.
(4) Average of square eigenvalues
When an average of the square eigenvalues is larger than a predetermined threshold, it may be determined that the square eigenvalues are widely distributed.
Although an example of the determination criterion for determining the distribution of the square eigenvalues has been described above, other criteria for determining whether the square eigenvalues are widely distributed are also conceivable.
In the diagnostic method according to the present embodiment, after determining whether a gradient vanishment occurs for a certain layer, it is determined whether gradient vanishment has been determined for all layers of the neural network model (S13), and if the determination has not been made for all layers (NO in S13), the gradient vanishment is determined based on the distribution of the square eigenvalues (S12).
When the gradient vanishment has been determined for all the layers (YES in S13), the determination result is output (S14). When there is no gradient vanishment for all layers, it is determined that the neural network does not lose the gradient, and when even one layer loses the gradient, it is determined that the neural network loses the gradient, and the determination result is output (S14). When outputting the determination result, the distribution state of the square eigenvalue may be displayed in a graph.
(Learning Device)
(Loss Function)
In the present embodiment, the loss function is a function for preventing the square eigenvalues from becoming too small. If all eigenvalues are greater than 0 (Positive Definite), then all column vectors of the matrix are linearly independent of each other. A loss function is used to ensure linear independence of the matrix so that the eigenvalues are not too small.
As a method of normalization, a determinant of the matrix (denoted as “ΣK, W
max det(ΣK, W2)↔min log det(ΣK, W−2) [Mathematical expression (5)]
Assuming that the eigenvalues λi of the matrix Σk, w2 are obtained, the following mathematical expression is satisfied.
ΣK, W2=QΛQT, QQT=I, Λi,iλi [Mathematical expression (6)]
Since the value of the matrix expression is equal to a product of the eigenvalues, a logarithmic inverse determinant is expressed by the sum of the logarithmic eigenvalues as in the following mathematical expression (7).
In this example, the property of φ(Λ)=−Σ log λi in the mathematical expression (7) will be described. In −log λi, as λi approaches 0, a function φ(Λ) approaches +∞ (logarithmic barrier). By use of the above property, as shown in
Each of terms 1 to 5 on a right side can be calculated by the following mathematical expression. The tr( ) in the first expression below is trace of the matrix and is the sum of main diagonal components of the matrix. The “∘” of the mathematical expression (9) is Hadamard product.
In the above mathematical expression, the following abbreviations are used.
From the above description, the gradient indicated in the above mathematical expression (8) is obtained as follows.
An inverse of the above gradient (minus multiple) is added to the update expression of W and used as a loss function when the parameter is updated. As a result, the parameter matrix W can be moved in the opposite direction of the gradient.
Incidentally, the update expression including the loss function obtained in the mathematical expression (11) has a large calculation amount. Therefore, as a modification, low-rank approximation may be performed focusing only on small eigenvalues,
To further reduce the amount of calculation, only the smallest eigenvalue may be used to generate the following loss function:
In the mathematical expression (13), λmin is the smallest eigenvalue and vmin is the eigenvector corresponding to the smallest eigenvalue.
Although the configuration of the learning device 1 according to the present embodiment has been described above, an example of hardware of the learning device 1 described above is a computer including a CPU, a RAM, a ROM, a hard disk, a display, a keyboard, a mouse, a communication interface, or the like. The learning device 1 is implemented by storing a program having modules for realizing the functions described above in a RAM or a ROM and executing the program by a CPU. The program described above also fall within the scope of the present disclosure.
The learning device 1 obtains a conversion expression (the above mathematical expression (1)) of input and output in each layer of the neural network model being learned, and obtains a conversion matrix (the above mathematical expression (2)) from the conversion expression (S22). Next, the eigenvalue of the matrix obtained by inputting the parameter matrix VV to the input X of the conversion expression, and squaring the resulting matrix is obtained as the square eigenvalue (S23), and a loss function to which a penalty such that the square eigenvalue does not become 0 is added is generated (S24). The calculation of such a penalty is described above.
Next, the learning device 1 updates the parameter of the neural network by the error back propagation method by use of the generated loss function (S25). Next, the learning device 1 determines whether the gradient vanishment occurs in each layer of the neural network whose parameters have been updated, by use of the diagnostic method of the present embodiment described above (S26). In the above flowchart, the reason why the determination of the gradient vanishment is drawn by a dotted line is that the determination of the gradient vanishment does not need to be performed every time the parameter is updated, but may be performed, for example, when the learning of one to several epochs is completed.
As a result of the determination, if the gradient vanishment occurs (YES in S26), the learning device 1 ends the learning process, At this time, the parameters before the update may be stored, and after the learning is aborted, the parameters immediately before the gradient vanishment starts to occur may be returned (S28), S28 of returning to the immediately preceding parameter is optional.
When the gradient vanishment does not occur (NO in S26), it is determined whether the learning is continued (S27). Whether to continue the learning can be determined according to whether the update of the parameter has converged. If the learning is to be continued (YES in S25), the process returns to the inference process and the above-described process is repeated. If the learning is not to be continued (NO in S25), the learning process is terminated. The learning device 1 may calculate the square eigenvalue in each layer of the neural network and display the distribution of the square eigenvalues in a timely manner or in response to a request from a user.
Since the learning device 1 according to the present embodiment performs learning by use of a loss function including the penalty for preventing the square eigenvalue of each layer of the neural network from becoming 0, the independence of the linear conversion in each layer can be ensured and the occurrence of gradient vanishment can be reduced.
The learning device 1 according to the present embodiment determines whether the gradient vanishment occurs based on the distribution of the square eigenvalues of each layer, and when the gradient vanishment occurs, the learning is terminated, so that the learning can be terminated as soon as the gradient vanishment begins to occur.
In the present embodiment, the method for diagnosing the gradient vanishment and the learning device 1 for reducing the occurrence of the gradient vanishment have been described. Alternatively, the gradient explosion can be diagnosed or learning with reduced gradient explosion can be implemented by finding the square eigenvalues of each layer of the neural network.
If the square eigenvalue is too large, a gradient explosion is likely to occur. Whether the gradient explosion is likely to occur can be determined based on whether the square eigenvalue is equal to or more than a predetermined threshold. In addition, with the inclusion of a penalty for preventing the square eigenvalue from becoming too large in the loss function, learning with reduced occurrence of gradient explosion can be performed. Further, the loss function can be generated by performing the low-rank approximation in the same manner as in the embodiments described above, and when the occurrence of the gradient explosion is reduced, a predetermined number (including one) of square eigenvalues in descending order from the square eigenvalues are used for the calculation of the penalty.
In the diagnosis, when the square eigenvalues are widely distributed from a large value to a small value, the parameters of the corresponding layer are not degenerate, and gradient explosion is unlikely to occur. Conversely, if the value of the square eigenvalue is too large and the parameter is diverging, a gradient explosion is likely to occur.
In the embodiments described above, an all-coupled neural network has been described as an example, but the present disclosure can also be applied to a convolution neural network. A convolutional neural network can be considered as a matrix product of multiple pieces of data cropped in a sliding window and multiple filters, Therefore, in the convolutional neural network as in the case of the all-coupled neural network described above, the conversion in each layer can be expressed in the form of the conversion expression of the mathematical expression (1) described above.
The present disclosure is useful, for example, as a technique for learning the neural network.
The methods described in the present disclosure may be implemented by a special purpose computer created by configuring a memory and a processor programmed to execute one or more particular functions embodied in computer programs. Alternatively, the methods described in the present disclosure may be implemented by a special purpose computer created by configuring a processor provided by one or more special purpose hardware logic circuits. Alternatively, the methods described in the present disclosure may be implemented by one or more special purpose computers created by configuring a combination of a memory and a processor programmed to execute one or more particular functions and a processor provided by one or more hardware logic circuits. The computer programs may be stored, as instructions being executed by a computer, in a tangible non-transitory computer-readable storage medium.
Here, the process of the flowchart or the flowchart described in this application includes a plurality of sections (or steps), and each section is expressed as, for example, S1. Further, each section may be divided into several subsections, while several sections may be combined into one section. Furthermore, each section thus configured may be referred to as a device, module, or means.
While the present disclosure has been described with reference to embodiments thereof, it is to be understood that the disclosure is not limited to the embodiments and constructions. The present disclosure is intended to cover various modification and equivalent arrangements. In addition, while the various combinations and configurations, other combinations and configurations, including more, less or only a single element, are also within the spirit and scope of the present disclosure,
Number | Date | Country | Kind |
---|---|---|---|
2019-127103 | Jul 2019 | JP | national |