The present invention relates to a trained model generation system, a trained model generation method, an information processing device, a program, a trained model, and an estimation device.
In the related art, a gradient descending method is known as a technique (optimizer) of calculating parameters such as weightings of a neural network in machine learning using the neural network or the like. As representative examples of the gradient descending method, a stochastic gradient descent (SGD) method, a momentum SGD method, an adaptive gradient algorithm (Adagrad), a root mean square propagation (RMSprop) method, and an adaptive moment estimation (Adam) method are known (for example, see Patent Literature 1).
In the SGD method, the gradient approaches zero in the vicinity of a saddle point and thus learning may not progress. Therefore, in the momentum SGD method, a calculation expression including an inertia term is used. Accordingly, runaway from the vicinity of a saddle point is possible. In such an optimizer, a learning rate is fixed and thus it takes a lot of time for learning to converge. Therefore, Adagrad, the RMSprop method, the Adam method, and the like for adaptively changing the learning rate by parameters have been proposed.
In Adagrad, a sum of squares of gradients in directions of parameters is stored in a cache. The learning rate for a rare feature can be set to be higher by dividing the learning rate by the square root of the cache. However, there are a problem in which the learning rate approaches zero because the cache increases when epochs progress and a problem in which the learning rate in a certain axis direction decreases in subsequent learning because the cache increases when the gradient in the axis (parameter) direction goes over a high gradient field.
In the RMSprop method, an exponential moving average of gradient information is used. With the exponential moving average, since past information attenuates exponentially, gradient information deeper in the past is removed and latest gradient information is reflected more.
In the Adam method, by estimating primary (average) and secondary (variance) moments of gradient values, it is possible to update more rare information as in the Adagrad method and to remove gradient information deeper in the past as in the RMSprop method.
However, since the exponential moving average with which the learning rate and the amount of update of parameters decrease monotonically with progress of learning is used, there is a problem in that learning may stagnate without rare information being efficiently learned or a correction coefficient may have an extremely large value and diverge at the beginning of learning.
The present disclosure was made in consideration of the aforementioned circumstances and provides a trained model generation system, a trained model generation method, an information processing device, a program, a trained model, and an estimation device that enable learning to exit from a state in which the learning stagnates.
In order to solve the aforementioned problems, according to an aspect of the present disclosure, there is provided a trained model generation system that generates a trained model, the trained model generation system including: an estimation unit configured to perform estimation on learning data; a loss gradient calculating unit configured to calculate a gradient of loss for a result of estimation from the estimation unit; and an optimizer unit configured to calculate a plurality of parameters constituting the trained model on the basis of the gradient of loss, wherein the optimizer unit uses an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases as an expression for calculating the learning rate used to calculate the plurality of parameters.
Another aspect of the present disclosure provides the trained model generation system, in which the first factor enables an effect of suppressing the learning rate to be achieved more as the absolute value of the gradient increases and increases the effect of suppressing the learning rate as the number of epochs increases.
Another aspect of the present disclosure provides the trained model generation system, in which the expression for calculating the learning rate includes a second factor which suppresses the learning rate and of which a maximum value is 1 according to a cumulative amount of update of each of the plurality of parameters through learning at the beginning of learning and does not include the second factor subsequently to the beginning of learning.
Another aspect of the present disclosure provides the trained model generation system, in which the second factor has an absolute value which is less than 1 when the cumulative amount of update is less than a threshold value and monotonically decreases when the cumulative amount of update is greater than the threshold value.
According to another aspect of the present disclosure, there is provided a trained model generation method of generating a trained model, the trained model generation method including: a first step of performing estimation on learning data: a second step of calculating a gradient of loss for a result of estimation from the first step; and a third step of calculating a plurality of parameters constituting the trained model on the basis of the gradient of loss, wherein an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases is used as an expression for calculating the learning rate used to calculate the plurality of parameters in the third step.
According to another aspect of the present disclosure, there is provided an information processing device including an optimizer unit configured to calculate a plurality of parameters constituting a trained model on the basis of a gradient of loss calculated from a result of estimation of learning data, wherein the optimizer unit uses an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases as an expression for calculating the learning rate used to calculate the plurality of parameters.
According to another aspect of the present disclosure, there is provided a program causing a computer to serve as an optimizer unit configured to calculate a plurality of parameters constituting a trained model on the basis of a gradient of loss calculated from a result of estimation of learning data, wherein the optimizer unit uses an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases as an expression for calculating the learning rate used to calculate the plurality of parameters.
According to another aspect of the present disclosure, there is provided a trained model that is generated by calculating a plurality of parameters constituting the trained model on the basis of a gradient of loss calculated from a result of estimation of learning data, wherein an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases is used as an expression for calculating the learning rate used to calculate the plurality of parameters.
According to another aspect of the present disclosure, there is provided an estimation device that estimates for input information using a trained model that is generated by calculating a plurality of parameters constituting the trained model on the basis of a gradient of loss calculated from a result of estimation of learning data, wherein an expression including a first factor of which an absolute value becomes greater than 1 to achieve an effect of increasing a learning rate when learning stagnates and in which the effect of increasing the learning rate when the learning stagnates increases as the number of epochs increases is used as an expression for calculating the learning rate when calculating the plurality of parameters.
According to the present disclosure, it is possible to enable learning to exit from a state in which the learning stagnates.
Hereinafter, an embodiment of the present disclosure will be described with reference to the accompanying drawings.
The trained model generation and registration system includes a learning data DB 100, a trained model generation device 200, and a monitoring camera device 300. The learning data DB 100 stores images serving as learning data for machine learning. As will be described later, the monitoring camera device 300 includes a model which has been trained according to time periods and positions, and thus the learning data DB 100 stores images serving as learning data according to time periods and positions. The learning data DB 100 may divisionally store daytime images captured in the daytime and nighttime images captured at nighttime as an example of a model according to time periods. The learning data DB 100 may divisionally store entrance images and parking lot images as an example of learning data according to positions.
The trained model generation device 200 generates a trained model by performing machine learning using the images stored in the learning data DB 100. For example, the trained model generation device 200 learns the daytime images to generate a trained model for the daytime and learns the nighttime images to generate a trained model for the nighttime. The trained model generation device 200 learns the entrance images to generate a trained model for an entrance and learns the parking lot images to generate a trained model for a parking lot.
The monitoring camera device 300 stores trained models according to time periods and positions generated by the trained model generation device 200. These trained models may be stored in memory or the like built in the monitoring camera device 300 at the time of production of the monitoring camera device 300. Alternatively, the monitoring camera device 300 may acquire the trained models from a server storing trained models via a network such as the Internet. The monitoring camera device 300 detects an object in a captured image using the stored trained models according to time periods and positions. The monitoring camera device 300 notifies a user of the monitoring camera device 300 of the detected object.
Here, a loss is a difference between a result of estimation and an ideal value such as a correct answer. For example, when a result of estimation of a certain image is (0.3, 0.2, 0.1, 0.9) and a correct answer of the image is (0.0, 0.0, 0.0, 1.0), the loss may be an average of squares of differences of each term (((0.3−0.0)2+(0.2−0.0)2+(0.1−0.0)2+(0.9−1.0)2)/4=0.0375). Alternatively, the loss may be a root mean square error (RMSE), a mean absolute error (MAE), or a root mean square logarithmic error (RMSLE). The loss gradient calculating unit 220 calculates the gradient of loss in a direction of each of a plurality of parameters in a neural network of the estimation unit 210. For example, the gradient of loss in the direction of parameter xi is approximately calculated as a value ((Loss(xi+h)−(Loss(xi−h))/2 h obtained by dividing a difference between the loss Loss(xi+h) when the value of parameter xi is xi+h and the loss Loss(xi−h) when the value of parameter xi is xi−h by 2 h. The optimizer unit 230 calculates a plurality of parameters in the neural network of the estimation unit 210 on the basis of the gradient of loss calculated by the loss gradient calculating unit 220. A trained model includes the plurality of parameters.
Then, the optimizer unit 230 performs optimization of calculating the parameters in the neural network on the basis of the gradient of loss calculated by the loss gradient calculating unit 220 (Step S5). This optimization is performed by solving an optimization problem of determining a combination of the values of the parameters such that the loss is minimized. The gradient method according to the embodiment is used to solve the optimization problem. Details of the gradient method according to the embodiment will be described later.
Then, the optimizer unit 230 determines whether ending conditions for ending generation of a trained model have been satisfied (Step S6). Any conditions such as the number of times and convergence of a loss may be used as the ending conditions. When it is determined that the ending conditions have been satisfied (Step S6: YES), the optimizer unit 230 ends the trained model generating process and sets a model including the parameters at that time as a trained model.
When it is determined in Step S6 that the ending conditions have not been satisfied (Step S6: NO), the optimizer unit 230 updates the parameters in the neural network of the estimation unit 210 with the parameters optimized in Step S5 (Step S7). Then, the estimation unit 210 determines whether learning data is to be changed (Step S8). For example, when the processes of Steps S3 to S7 are performed on the same learning data a predetermined number of times or more, the estimation unit 210 may determine that the learning data is to be changed and may otherwise determine that the learning data is not to be changed. Alternatively, when the loss does not decrease even if the processes of Steps S3 to S7 are repeated, the estimation unit 210 may determine that the learning data is to be changed, and may otherwise determine that the learning data is not to be changed.
The routine proceeds to Step S2 when it is determined in Step S8 that the learning data is to be changed (Step S8: YES), and the routine proceeds to Step S3 when it is determined that the learning data is not to be changed (Step S8: NO).
In this embodiment, determination of the ending conditions in Step S6 is performed after Step S5, but the present disclosure is not limited thereto. For example, Step S6 may be performed after Step S2 or may be performed after Step S3.
The trained model generation device 200 generates a trained model by repeatedly performing estimation on learning data, calculation of the gradient of loss, and optimization of the parameters. The number of repetitions is also referred to as the number of epochs. The number of pieces of learning data acquired in Step S2 may be one or more.
Out of the L intermediate layers L1, L2, . . . , and LL, the m-th intermediate layer Lm includes n nodes u(m)0, u(m)1, . . . , and u(m)n-1. In the example illustrated in
A value input to a node of a certain layer is determined on the basis of a value input to the node in the previous layer as expressed by Expression (1). In Expression (1), w(l)i,j is a weighting, and b(l)i is a bias. The values of the weighting w(l)i,j and the bias b(l)i are parameters in the neural network and are determined by the optimizer unit 230. f( ) is an activation function. The activation function in the embodiment may be any function that is used as an activation function of a neural network, such as a sigmoid function or a normalization linear function.
The numbers of nodes included in the layers in
The gradient method used by the optimizer unit 230 will be described below with reference to the following expressions. The optimizer unit 230 optimizes the parameters in the neural network of the estimation unit 210 using Expression (2) at the time of performing Step S5 in
In Expression (2), x is a parameter to be optimized, that is, one of the weighting w(l)i,j and the bias b(l)i which are the parameters in the neural network. That is, the optimizer unit 230 performs the process of Expression (2) on each of the weighting w(l)i,j and the bias b(l)i. β1, β2, and β3 are predetermined numbers which are greater than 0 and less than 1. These values may be set by an operator of the trained model generation device 200.
dx is the gradient of loss in the x direction which is calculated by the loss gradient calculating unit 220. That is, when x is a weighting w(l)i,j, dx is the gradient in the direction of the weighting w(l)i,j. When x is a bias b(l)i, dx is the gradient in the direction of the bias b(l)i. Here, t is the number of epochs. at is the learning rate. a0 is an initial value of the learning rate. The value of a0 may be set by an operator of the trained model generation device 200. eps is a very small predetermined constant for preventing the denominator from being zero.
The operator “{circumflex over ( )}” denotes an exponent. log is a common logarithm. sqrt( ) is a positive square root. :=denotes that the value of the left side is close to the value of the right side. That is, an absolute value of a difference between the value of the left side and the value of the right side is equal to or less than a predetermined number. +=denotes that the value of the right side is added to the value of a variable of the left side and the resultant value is stored in the variable of the left side. Accordingly, cache has a cumulative value of the value of x in the epochs. Here, cache is referred to as a cumulative amount of update.
In this way, the absolute value of the first factor becomes greater than 1 when learning stagnates, and an effect of increasing the learning rate is obtained. Since the reciprocal of m is raised to the t-th power, the first factor provides an increasing effect of increasing the learning rate when learning stagnates as the number of epochs increases. Accordingly, runway from the state in which learning stagnates is possible.
However, when the gradient dx continuously has a large value, the absolute value of m which is an exponential moving average of the gradient dx increases and the reciprocal l/m decreases. Accordingly, the absolute value of the first factor decreases as the absolute value of the gradient dx increases, and an effect of suppressing the learning rate is obtained. When the absolute value of m is greater than 1, the effect of suppressing the learning rate increases as the number of epochs increases due to the first factor. Accordingly, when the number of epochs increases, that is, learning is progressing, and the gradient dx continuously has a large value, the learning rate is suppressed. As a result, it is possible to avoid great separation from the result of learning up to now, that is, destruction of learning information up to now.
In the example illustrated in
At the beginning of learning, the gradient may be large, for example, when an initial value of a parameter is greatly away from an optimal value. Due to the large gradient, the amount of update of the parameter may increase and learning may diverge. However, when the learning rate is set to a smaller value in the section T1 in which the cumulative amount of update is small, it is possible to decrease the learning rate at the beginning of learning due to the small cumulative amount of update at the beginning of learning. Since the gradient is large when the learning rate is low, the amount of update of the parameter increases and thus it is possible to prevent divergence of learning.
However, since learning does not progress with a low learning rate, the value of the second factor is equal to or less than 1 until the cumulative amount of update reaches 1 which is a threshold value but is 1 when the cumulative amount of update is 1. In the section T2 in which the cumulative amount of update is greater than 1, the value of the second factor decreases monotonically and thus the update value also decreases except the protruding portions marked by a circle. Accordingly, it is possible to prevent divergence of learning and continuation of fluctuation because learning runs away from that up to now.
When the cumulative amount of update increases and if 1/cache:=0 is satisfied, the calculation expression of the learning rate at includes the first factor but does not include the second factor. Accordingly, it is possible to prevent protrusion of the update value due to the first factor from being suppressed by the second factor. On the other hand, when learning is progressing, the learning rate is suppressed by the first factor even if the gradient dx continuously has a large value. Accordingly, it is possible to avoid great separation from the learning result up to now, that is, destruction of learning information up to now.
The initial values of the parameters wi and wj are indicated by a circle in
The image input unit 301 includes an imaging device and an optical system that forms an image of a subject on an imaging plane of the imaging device. The image input unit 301 converts an image of a subject formed on the imaging plane to an electrical signal. The feature extracting unit 302 estimates an object included in the image of a subject converted to an electrical signal using the neural network. The trained model primary storage unit 303 stores a trained model which is a parameter in a neural network of the feature extracting unit 302.
The object recognizing unit 304 recognizes an object included in the image from the result of estimation from the feature extracting unit 302. The abnormality detecting unit 305 determines whether the object recognized by the object recognizing unit 304 is an abnormal object for which an alarm should be issued. The recognition result display unit 306 displays a name or the like of the object recognized by the object recognizing unit 304 on a screen to notify an operator. When the abnormality detecting unit 305 determines that the object is abnormal to issue an alarm, the recognition result notification unit 307 issues an alarm in voice to notify the operator. At this time, the recognition result notification unit 307 may change voice which is issued according to details of the abnormality.
The high-accuracy locator 308 detects a position at which the monitoring camera device 300 is installed using a global positioning system (GPS) or the like. The timepiece 309 informs of a current time. The learning information exchange unit 310 acquires a trained model corresponding to the position detected by the high-accuracy locator 308 or the current time informed of by the timepiece 309 from the positional learning DB 311 or the temporal learning DB 312. The learning information exchange unit 310 stores the acquired trained model in the trained model primary storage unit 303 via the feature extracting unit 302. The positional learning DB 311 stores the trained model generated by the trained model generation device 200 according to positions. The temporal learning DB 312 stores the trained model generated by the trained model generation device 200 according to time periods.
The positional learning DB 311 and the temporal learning DB 312 may be used by a plurality of monitoring camera devices 300. For example, a plurality of monitoring camera devices 300 may access the positional learning DB 311 and the temporal learning DB 312 via a network.
The monitoring camera device 300 may include the learning data DB 100 and the trained model generation device 200. In this case, learning data stored in the learning data DB 100 may be data of an image output from the image input unit 301.
The optimizer unit 230 may include extreme value regression using a Hessian matrix. That is, when parameters are calculated using a calculation method other than Expression (2) such as when parameters are calculated using a high-order derivative such as a second derivative of a loss in addition to the gradient of loss dx, the optimizer unit 230 may calculate the learning rate at in the same way as expressed by Expression (2). When the learning rate at is calculated, the optimizer unit 230 may perform multiplication of another factor or addition of another term in addition to the first factor and the second factor.
The trained model generation device 200 or the monitoring camera device 300 may be realized by recording a program for realizing the functions of the trained model generation device 200 or the monitoring camera device 300 in
The “computer-readable recording medium” may be a portable medium such as a flexible disk, a magneto-optical disc, a ROM, a CD-ROM, or a DVD or a storage device such as a hard disk or an SSD incorporated in a computer system. The “computer-readable recording medium” may include a medium that dynamically holds a program for a short time like a communication line in a case in which a program is transmitted via a network such as the Internet or a communication circuit line such as a telephone line or a medium that holds a program for a predetermined time such as volatile memory in a computer system serving as a server or a client in that case. The program may be a program for realizing some of the aforementioned functions or may be a program which can realize the aforementioned functions in combination with another program stored in advance in the computer system.
The functional blocks of the trained model generation device 200 illustrated in
When integration technology that can be used instead of LSI would have been developed with advancement in semiconductor technology, an integrated circuit based on the integration technology may be used.
While an embodiment of the present invention has been described above in detail with reference to the drawings, the specific configuration thereof is not limited to the embodiment and includes various modifications in design without departing from the gist of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/044400 | 12/3/2021 | WO |