This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-147507, filed on Sep. 10, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a non-transitory computer-readable storage medium storing a speed-up program, a speed-up method, and an information processing device.
For speed-up of machine learning of a machine learning model, use of a graphics processing unit (GPU) is effective, and in addition, distribution of processing to a plurality of GPUs is also effective. Up to now, speed-up has been achieved by mounting the plurality of GPUs in nodes that are information processing devices and executing machine learning processing in parallel in the nodes. However, processing for aggregating gradient information among the GPUs and reflection processing have needed time.
In such machine learning of the machine learning model, it is needed to frequently update a new learning portion each time of learning. Therefore, it is needed to set a learning rate (LR) that determines an update frequency to be higher. Meanwhile, an existing learning portion, on which machine learning has been already completed, closer to an input side has a lower learning rate. In an extreme case, the learning rate is often set to zero. Although it is not needed to perform machine learning processing for the portion with the learning rate of zero, the processing for aggregating the gradient information, the reflection processing, and weight calculation processing are executed at the same frequency as those of the new learning portion, and many unnecessary processes are performed.
For this reason, in recent years, a gradient skip technology is used that identifies a layer that does not need machine learning and does not perform and skips calculation and aggregation processing (Allreduce processing) of gradient information (Aw).
Japanese National Publication of International Patent Application No. 2018-520404, Japanese Laid-open Patent Publication No. 10-198645, and U.S. Pat. No. 6,119,112 are disclosed as related art.
According to an aspect of the embodiments, there is provided a computer-implemented method of a speed-up processing. In an example, the method includes: calculating variance of weight information regarding a weight updated by machine learning, for each layer included in a machine learning model at a predetermined interval at time of the machine learning of the machine learning model; and determining a suppression target layer that suppresses the machine learning on the basis of a peak value of the variance calculated at the predetermined interval and the variance of the weight information calculated at the predetermined interval.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
However, with the technology described above, speed-up can be achieved by skipping machine learning, but accuracy of machine learning is deteriorated depending on a layer to skip machine learning or timing of skipping, and there is a case where machine learning ends without attaining target accuracy.
In one aspect, an object is to provide a speed-up program, a speed-up method, and an information processing device that can implement both of reduction in time to learning convergence and improvement in accuracy.
Hereinafter, examples of a speed-up program, a speed-up method, and an information processing device disclosed in the present application will be described in detail with reference to the drawings. Note that the present embodiment is not limited by the examples. Furthermore, each of the examples may be appropriately combined within a range without inconsistency.
[Overall Configuration]
An information processing device 10 according to an example 1 executed generating a machine learning model by distributed processing using a plurality of GPUs in order to implement speed-up of machine learning processing.
In recent years, the gradient skip technology is used that identifies a layer that does not need machine learning using a learning rate of each layer (each layer) and suppresses (skip) learning without performing calculation and aggregation processing (Allreduce processing) of gradient information.
Here, a reference technique for learning skip (hereinafter may be simply referred to as “skip”) will be described.
For example, the reference technique detects a layer in which the learning rate indicating a progress status of learning is decreased and omits learning for the layer so as to shorten the learning time. For example, in each layer in which a difference between the error gradient at the time of latest iteration and the error gradient at the time of immediately previous iteration is equal to or larger than a threshold, learning is executed as usual also at the time of next iteration, and in each layer in which the difference is less than the threshold, learning skip is executed at the time of next iteration. For example, in the layer in which the learning rate is decreased, subsequent machine learning processing of calculating the error gradient and the like is suppressed.
However, in the reference technique, an influence of accuracy deterioration in a case of completely skipping the machine learning is partially unknown. For example, in the machine learning model such as a deep neural network (DNN), the accuracy tends to be deteriorated in a case where backward propagation processing for a plurality of layers is determined with reference to the learning rate (LR) or the like and stopped at the same time. Furthermore, if the timing to perform learning skip (the number of epochs and the number of iterations) is not proper, there is a case where the accuracy is deteriorated and the final accuracy does not attain the target.
Therefore, in recent years, regarding a layer determined to be a learning skip target in which each processing such as the calculation of the error gradient and the backward propagation is suppressed, processing of not suddenly performing learning skip, gradually reducing a learning rate of the layer in which learning is stopped, performing learning processing to some extent, and then performing learning skip is known.
Here, an example of learning skip will be described.
As for the final achievement accuracy of machine learning using the above-described learning skip, a decrease in the final achieved accuracy tends to be smaller when the layer for stopping learning is stopped after the learning rate becomes small to some extent. For this reason, introduction of braking distance (BRAKING_DISTANCE: BD) of the learning rate is proceeding, for stopping the learning after lowering the learning rate when a command to stop (skip) learning is given instead of suddenly stopping the learning of the target layer. For example, machine learning in which layers that are stopped in order from a layer closest to bottom are dropped into a familiar localized solution is used.
Here, an example of introducing the braking distance (BRAKING_DISTANCE: BD) of the learning rate as a hyperparameter, and suppressing learning in a stepwise manner for each layer belonging to a skip candidate block will be described. Note that a block is a collection of a plurality of layers.
Then, when the first block is determined as a learning skip target, the information processing device executes machine learning with a learning rate significantly decreased as compared with that of normal learning for each iteration. Thereafter, when the second block is determined as the learning skip target, the information processing device executes machine learning with a learning rate significantly decreased as compared with that of normal learning for each iteration. In this way, the information processing device executes learning skip in the order from the block closest to the input layer.
In determination of whether to perform such learning skip, a last updated value “WI+1−WI” of a weight W is often used for each layer. Note that WI is the weight at the time of the Ith learning.
However, there are the following problems. First, the threshold for each layer needs to be set. Secondly, the threshold is set from the weight (W) or the weight gradient (ΔW) itself, but the tendency of the determination index substantially varies depending on the setting of the learning rate (LR), weight attenuation (Weight-Decay), and the like. Therefore, after optimizing the hyperparameters LR and Weight-Decay, the start threshold of learning skip is optimized. That is, it is very time consuming and difficult to automate.
Therefore, it is desired to implement a method of setting the start threshold of learning skip that does not depend on the weight (W) or the weight gradient (ΔW) itself.
Therefore, the information processing device 10 according to the example 1 calculates variance of weights or weight gradients updated by machine learning for each layer of the machine learning model at a predetermined interval at the time of machine learning of the machine learning model. Then, the information processing device 10 determines a suppression target layer for suppressing machine learning on the basis of a peak value of the variance calculated at the predetermined interval and the variance calculated at the predetermined interval.
For example, the information processing device 10 obtains the variance of the weight (W) or the weight gradient (ΔW) for each layer, determines that the layer with the variance decreased from a variance time direction (epoch or iteration) peak by a threshold or more can undergo learning skip, and executes learning skip or learning skip using the BD.
For example, when the histogram of the elements of the tensor of the weight is displayed for a certain layer, the variance of the weight (W) has variance close to the median as the iteration progresses. Therefore, if the variance is small to some extent, it can be determined that the learning is completed. Meanwhile, if learning skip is performed during warm-up processing period (for example, 0 to 3850 iterations) performed in an initial stage of machine learning, convergence of accuracy will be slowed down, and the learning skip during warm-up may adversely affect the accuracy more than a skipped length.
In consideration of the above, the information processing device 10 sets, for each layer of the machine learning model, the threshold at a decrease rate from the peak of the variance of the elements of the tensor of the weight after progressing the warm-up processing, and determines learning skip timing. By doing so, the information processing device 10 can implement a method of setting the start threshold of learning skip that does not depend on the weight (W) or the weight gradient (ΔW) itself, and can implement both the reduction in time to learning convergence and the improvement in accuracy.
[Functional Configuration]
The communication unit 11 is a processing unit that controls communication with another device, and is implemented by, for example, a communication interface or the like. For example, the communication unit 11 transmits and receives various types of data, various instructions, or the like to and from an administrator's terminal.
The storage unit 12 is a processing unit that stores various types of data, various programs, or the like and is implemented by, for example, a memory, a hard disk, or the like. The storage unit 12 stores a training data DB 13 and a machine learning model 14.
The training data DB 13 is an example of a dataset that stores training data used for machine learning of the machine learning model 14. For example, each piece of the training data stored in the training data DB 13 includes image data and a supervised label. Note that the dataset of the training data can be divided into subsets (batch size) in arbitrary units.
The machine learning model 14 is a model generated by machine learning such as DL and is an example of a model using a multi-layer neural network including a plurality of layers. For example, in a case where image data is input, the machine learning model 14 executes classifying an animal in the image. Note that a DNN, a convolutional neural network (CNN), or the like can be adopted as the machine learning model 14.
The integration processing unit 20 is a processing unit that controls the entire information processing device 10 and is implemented by, for example, a CPU. For example, the integration processing unit 20 instructs each distributed processing unit 30 to, for example, start or terminate the distributed processing of machine learning and machine learning and executes overall control regarding machine learning.
Each distributed processing unit 30 is a processing unit that executes the distributed processing of machine learning and is implemented by, for example, a GPU. For example, each distributed processing unit 30 executes generation of the machine learning model 14 by machine learning using each piece of the training data stored in the training data DB 13 in response to the instruction from the integration processing unit 20.
[Details of Distributed Processing Unit]
Next, details of each distributed processing unit 30 will be described. Note that each distributed processing unit 30 has a similar configuration.
The forward propagation processing unit 31 is a processing unit that executes forward propagation processing on each layer of the machine learning model 14. Specifically, since the forward propagation processing unit 31 executes so-called forward propagation, detailed description is omitted. Briefly, for example, the forward propagation processing unit 31 inputs image data that is training data to a top layer (input layer) of the machine learning model 14 and acquires a prediction result (classification result) that is a result of continuously calculating numerical values from the input layer toward the last layer (output layer) of the machine learning model 14 from the output layer. Then, the forward propagation processing unit 31 calculates an error between the prediction result and the supervised label using a squared error or the like, calculates an error function, and outputs the calculated values to the backward propagation processing unit 32.
The backward propagation processing unit 32 is a processing unit that includes an error gradient calculation unit 33 and a communication control unit 34, calculates an error of each parameter of the machine learning model 14 by a backward propagation method using the error function input from the forward propagation processing unit 31, and executes update of the parameter. For example, the backward propagation processing unit 32 executes so-called backward propagation.
For example, the backward propagation processing unit 32 calculates an error gradient of edge weight between nodes of a neural network in an order (reverse order) from the output layer toward the input layer of the machine learning model 14. The error gradient corresponds to a value obtained by partially differentiating an error by a weight in a case where the error is assumed as a function of the weight and represents a change amount of the error when the edge weight is slightly changed. Then, the backward propagation processing unit 32 executes update of each parameter such as each edge weight so as to reduce the error, using the error gradient.
The error gradient calculation unit 33 is a processing unit that calculates an error gradient indicating a gradient of an error with respect to each parameter of the machine learning model 14 and the like for each of the plurality of layers included in the machine learning model 14. For example, the error gradient calculation unit 33 calculates the error gradient (g), weight (W), and momentum (m) for each layer of the machine learning model 14 for each iteration, and outputs these pieces of information to the information management unit 35.
Here, at the time of calculating the error gradient, the error gradient calculation unit 33 suppresses the calculation of the error gradient for a layer determined to suppress learning (learning skip layer). Furthermore, the error gradient calculation unit 33 can set only a specific layer as an error gradient calculation target, a convolutional layer as the error gradient calculation target, or arbitrarily set the error gradient calculation target. Note that various known methods can be adopted as the method for calculating the error gradient and the like.
The communication control unit 34 is a processing unit that executes AllReduce communication between the GPUs. For example, the communication control unit 34 sums the error gradients calculated by the plurality of GPUs for each edge weight by transmitting and receiving the error gradients between the GPUs, and adds up the error gradients of the plurality of GPUs. Update of various parameters of the machine learning model 14 is executed by the backward propagation processing unit 32, using information regarding the error gradient added up in this way.
Furthermore, the communication control unit 34 stops the communication to the skip target layer according to a control instruction by the variance calculation unit 36 to be described below. Furthermore, the communication control unit 34 specifies a layer for which the error gradient calculation and the communication (Allreduce) are continued without stopping learning and a layer for which learning is stopped from among the layers of the machine learning model 14 and controls the communication.
The information management unit 35 is a processing unit that acquires and manages the error gradient (g), weight (W), and momentum (m) from the error gradient calculation unit 33. For example, the information management unit 35 outputs the weight (W) of each layer acquired from the error gradient calculation unit 33 to the variance calculation unit 36.
The variance calculation unit 36 is a processing unit that calculates weight variance updated by machine learning, for each layer included in the machine learning model 14, at a predetermined interval at a time of the machine learning of the machine learning model 14, and determines a suppression target layer for suppressing the machine learning on the basis of a peak value of the variance calculated at the predetermined interval and the variance calculated at the predetermined interval.
Specifically, the variance calculation unit 36 sets a layer closest to the input layer among layers for which learning skip has not been executed as a target layer that is a current learning skip determination target, in the order from the input layer to the output layer. Thereafter, the variance calculation unit 36 executes weight variance calculation and learning skip determination for each layer.
For example, the variance calculation unit 36 calculates the variance of the elements of the tensor of the weight (W) notified from the information management unit 35 for each iteration after the warm-up processing is completed. Next, the variance calculation unit 36 specifies the peak value, using the weight variance that continues to be calculated. After specifying the peak value, the variance calculation unit 36 calculates a difference between the peak value and the weight variance that continues to be calculated, and determines to cause the current target layer to undergo learning skip in a case where the difference becomes equal to or greater than the threshold. Then, the variance calculation unit 36 notifies the communication control unit 34 of a determination result, executes learning skip, determines the next target layer, and repeats the determination.
Here, setting of the threshold will be described. Specifically, the variance calculation unit 36 sets the threshold according to the change in the weight variance. For example, there are three threshold setting patterns: the threshold of the first layer (L0), the threshold in a case where the waveform of the change in the variance is mountain-shaped, and the threshold of which the waveform of the change in the variance is downhill. The variance calculation unit 36 sets a threshold for each pattern based on the degree of decrease from the peak.
Furthermore, for the layer having a mountain-shaped waveform illustrated in
Furthermore, for the layer having a downhill waveform illustrated in
This is because the degree of decrease from the peak is different, and the variance calculation unit 36 sets the degree of decrease to satisfy the downhill layer<the mountain-shaped layer<the first layer, as an example. Furthermore, the variance calculation unit 36 may determine that the threshold setting pattern is the mountain-shaped pattern in a case where the variance that continues to be calculated changes between predetermined iterations by a predetermined value or more, and determines that the threshold setting pattern is the downhill pattern otherwise.
Furthermore, the variance calculation unit 36 can also gradually stop learning using the braking distance (BD) of the learning rate in addition to simply performing learning skip, for the layer determined to be the learning skip target.
For example, the variance calculation unit 36 suppresses the learning after lowering the learning rate using the BD that depends on the iteration when a command to stop learning is issued instead of suddenly stopping the learning of the skip candidate layer. More specifically, in a case where an LR scheduler used for learning of the machine learning model 14 is a POW2, the variance calculation unit 36 reduces the BD similarly to the POW2 using the equation (1).
[Math. 1]
BD attenuation rate=((BD−iteration)/BD)2 Equation (1)
The variance calculation unit 36 gradually reduces the BD that is the braking distance of the LR to 7700 that is the set number of iterations for each iteration by multiplying the attenuation rate indicated in the equation (1) to decrease the learning rate. Note that the BD in the equation (1) is a predetermined set value, and the iteration (iteration) is the number of iterations at the time of calculation.
Here, an example of executing learning skip using the above-described attenuation rate for the layer determined as the learning skip candidate will be described. The variance calculation unit 36 executes machine learning with LR=5 when warm-up (3850 iterations) is completed. Then, when the layer is determined as the learning skip candidate when the iteration is 7980, the variance calculation unit 36 calculates the LR at the time of the iteration using the equation (2) and executes machine learning using the calculated LR. In this way, the variance calculation unit 36 calculates the LR for each iteration and executes machine learning using the calculated LR.
[Math. 2]
LR=End LR+(LR at the start of BD−End LR)×((BD−(iter.−iter. at the start of BD))/BD)2 Equation (2)
Note that “LR” in the equation (2) is the learning rate to be calculated that is used for learning. “End LR” is the LR when it is determined to perform learning skip, and the learning rate is repeatedly attenuated (decreased) until the LR reaches the “End LR”. “LR at the start of BD” is the LR at the time of initial setting. “Iter.” is the number of iterations at the time of calculation. Since “LR” is calculated for each iteration after the layer is determined as the skip candidate, the number of iterations at that time is adopted. “iter. at the start of BD” is the number of iterations of when the attenuation of the learning rate is started. For example, the BD=7700 iterations, the warm-up=3850 iterations, the initial value (Base LR) corresponding to “LR at the start of BD”=5, the final LR (End LR)=0.0001, and “iter. at the start of BD”=7980 iterations are obtained.
As described above, the variance calculation unit 36 gradually lowers the learning rate using the BD that depends on the iteration instead of suddenly stopping learning of the learning skip candidate layer, and performs the learning skip when or after the learning rate becomes the target learning rate. At this time, the variance calculation unit 36 can improve the learning accuracy and reduce the final number of epochs by performing learning skip in the order from the layer closest to the input layer.
[Flow of Processing]
Next, the backward propagation processing unit 32, the information management unit 35, and the variance calculation unit 36 execute the backward propagation processing (S104). Thereafter, in the case where the machine learning is continued (S105: No), the processing of S102 and the subsequent steps is executed, and in the case where the machine learning is terminated (S105: Yes), the information processing device 10 terminates the machine learning.
(Flow of Backward Propagation Processing)
In the case where the target layer n has already been set (S201: No) or when the target layer n is set (S202), the variance calculation unit 36 determines whether it is the learning progress determination timing (S203). For example, the variance calculation unit 36 determines that it is the learning progress determination timing in the case where the warm-up processing has been completed and after the iteration is completed.
Here, in the case where the variance calculation unit 36 determines that it is not the learning progress determination timing (S203: No), the backward propagation processing unit 32 executes normal backward propagation processing (S204).
On the other hand, in the case of determining that it is the learning progress determination timing (S203: Yes), the variance calculation unit 36 calculates the variance V(n) of the weight (W) of the target layer n (S205). Note that variance σ2 is calculated such that, when an expected value (population mean) of a random variable X is μ=E(X), the expected value E((X−μ)2) of the square of the difference between X and the population mean is calculated as the variance of X.
Next, the variance calculation unit 36 obtains the peak of the variance V(n) (S206). For example, the variance calculation unit 36 calculates a weight variance value in or after a certain iteration (here, 5000) and obtains Vmax(n) with Vi(n) as a variance value of the n layer at the ith learning progress timing. Here, the variance calculation unit 36 sets the “downhill threshold” in a case where a state where the value of “Vi(n)−, Vi-1(n)” is less than 0 (“Vi(n)−, Vi-1(n)”<0) continues a predetermined number. The variance calculation unit 36 sets the “mountain-shaped threshold” in a case where the value of “Vi(n)−, Vi-1(n)” changes from a state of being larger than 0 (“Vi(n)−, Vi-1(n)”>0) to less than 0 (“Vi(n)−, Vi-1(n)”<0).
Thereafter, the variance calculation unit 36 determines whether the variance V(n) is lowered from the peak to the threshold T % or less (S207). Here, in the case where the variance V(n) is lowered from the peak to the threshold T % or less (S207: Yes), the variance calculation unit 36 determines to perform learning skip because the n layer has attained the threshold (S208). On the other hand, when the variance V (n) is not lowered from the peak to the threshold T % or less (S207: No), the variance calculation unit 36 executes the normal backward propagation processing because the n layer has not attained the threshold (S204). For example, the variance calculation unit 36 determines to perform learning skip in the case where “(Vi(n)/Vmax(n))<the threshold” is satisfied.
[Effects]
As described above, the information processing device 10 focuses on the fact that the variation of the weight W decreases as the learning progresses, and determines whether to perform learning skip according to the decrease rate of the variance value from the peak, using the weight (W) or the weight gradient (ΔW) in the determination of the learning progress. As a result, since the hyperparameter is only the variance value, the threshold can be specified in a short time and automation can be easily performed. For example, even in the case of normally determining the parameter by 60 times of execution, experimental results from which the parameter can be determined can also be obtained by about 5 times of execution by using the method of the example 1.
By the way, an influence of learning skip on a next layer in a case where learning stop is executed by learning skip for a certain layer will be described.
As illustrated in
Furthermore,
As illustrated in
As described above, since the variance spreads once in the layer immediately after the layer in which the learning skip is executed, learning skip may occur in a state of peak misidentification or insufficient learning in the method of the example 1. Therefore, not to decrease final learning accuracy after the learning skip, threshold determination timing is improved.
As illustrated in
However, when the BD period of L6 ends and the learning of L6 ends during the BD period of L9, the weight variance of L9 spreads even during the BD period of L9, as described in
Therefore, to improve the improvement point of
As illustrated in
Then, the variance calculation unit 36 starts the variance calculation of the weight W of L9 at the timing when the BD period of L6 has ended, sets the BD period of L9 when the weight attains the threshold, and gradually stops learning of L9.
In this way, the variance calculation unit 36 suppresses the weight variance calculation for the subsequent layers during the BD period, and becomes able to perform learning skip of L9 after detecting a decrease in the weight variance including the influence of L6 in the layer (L9) next to L6. As a result, the information processing device 10 can improve the accuracy of the final machine learning model 14.
Next, variations of learning skip and threshold determination timing control executed by an information processing device 10 will be described.
[Variance Calculation According to BD Cycle]
As illustrated in
At this time, since the learning of L0 is stopped in the iteration “n+1000”, the distributed processing unit 30 starts weight variance calculation for L3, which is the next layer of L0, from the iteration “n+1000” as the same timing. Then, the distributed processing unit 30 executes weight variance of L3 even in the iteration “n+2000”, and starts learning skip by BD control for L3 when the variance of the weight (W) calculated in the iteration “n+3000” attains the threshold.
Thereafter, the BD period of L3 ends in iteration “n+4000” and the distributed processing unit 30 stops learning of L3. At this time, since the learning of L3 is stopped in the iteration “n+4000”, the distributed processing unit 30 starts weight variance calculation for L6, which is the next layer of L3, from the iteration “n+4000” as the same timing.
Then, the distributed processing unit 30 executes weight variance of L6 even in iteration “n+5000”, and starts learning skip by BD control for L6 when the variance of the weight (W) calculated in the iteration “n+6000” attains the threshold.
Thereafter, the BD period of L6 ends in iteration “n+7000” and the distributed processing unit 30 stops learning of L6. At this time, since the learning of L6 is stopped in the iteration “n+7000”, the distributed processing unit 30 starts weight variance calculation for L9, which is the next layer of L6, from the iteration “n+7000” as the same timing.
In this way, the distributed processing unit 30 can cope with an influence on L3 by variance calculation of L3 even in a case where the braking distance (BD) period of L0 ends and an influence by learning stop of a layer is actually reflected on L3, for example, by adjusting a width of BD to a variance calculation cycle. As a result, the distributed processing unit 30 reliably stops learning for each layer and thus can suppress accuracy deterioration.
Specifically, the distributed processing unit 30 starts variance calculation of the weight (W) of each layer of the block 1 in the iteration “n” where the machine learning is started and the warm-up period ends. Then, when detecting that the weight variance of L9, L12, and L15 has attained the threshold in the weight variance of each layer calculated in iteration “n+20”, the distributed processing unit 30 stops subsequent weight variance calculation of L9, L12, and L15.
Thereafter, when detecting that the weight variance of L3 and L6 has attained the threshold in the weight variance of each layer (L0, L3, or L6) calculated in iteration “n+40”, the distributed processing unit 30 stops subsequent variance calculation of L3 and L6.
Thereafter, when detecting that the weight variance of L0 calculated in iteration “n+60” has attained the threshold, the distributed processing unit 30 stops subsequent variance calculation of L0. As a result, since the weights of all the layers in the block 1 have attained the threshold, the distributed processing unit 30 executes learning skip for each layer of the block 1 from the next iteration “n+80” and starts weight variance calculation of each layer in the block 2.
In this way, the distributed processing unit 30 stops the weight variance calculation in the case where the weight has attained the threshold, and can perform learning skip for target layers in the case where all the plurality of layers as calculation targets have fallen within the threshold range. Furthermore, the distributed processing unit 30 can increase the number of layers to undergo learning skip at one time by varying the number of target layers included in a block. Therefore, the distributed processing unit 30 can simultaneously perform weight variance calculation for a plurality of layers from an input side of a machine learning model 14 (neural network) and increase the number of layers to undergo learning skip, and thus can implement speed-up of machine learning.
Specifically, the distributed processing unit 30 starts variance calculation of the weight (W) of each layer of the block 1 in the iteration “n” where the machine learning is started and the warm-up period ends. Then, when detecting that the weight variance of L9, L12, and L15 has attained the threshold in the weight variance of each layer calculated in iteration “n+20”, the distributed processing unit 30 stops subsequent weight variance calculation of L9, L12, and L15.
Thereafter, when detecting that the weight variance of L3 and L6 has attained the threshold in the weight variance of each layer (L0, L3, or L6) calculated in iteration “n+40”, the distributed processing unit 30 stops subsequent variance calculation of L3 and L6.
Thereafter, when detecting that the weight variance of L0 calculated in iteration “n+60” has attained the threshold, the distributed processing unit 30 stops subsequent variance calculation of L0. As a result, since the weights of all the layers in the block 1 have attained the threshold, the distributed processing unit 30 executes learning skip for each layer of the block 1 from the next iteration “n+80” by BD control.
Then, the distributed processing unit 30 executes learning skip of gradually lowering an LR from the iteration “n+80” to “n+160”, which is the BD period, and the BD period ends in iteration “n+180” and the learning skip of each layer in the block 1 is completed. Here, the distributed processing unit 30 starts weight variance calculation of each layer in the block 2 in the iteration “n+180” because the learning skip of each layer in the block 1 is completed in the iteration “n+180”.
In this way, the distributed processing unit 30 stops the weight variance calculation in the case where the weight has attained the threshold, and can set the BD period and perform learning skip for target layers in the case where all the plurality of layers as calculation targets have fallen within the threshold range. Therefore, the distributed processing unit 30 can achieve both the speed-up of machine learning and suppression of accuracy deterioration.
While the examples of the present embodiment have been described above, the present embodiment may be implemented in various different modes in addition to the above-described examples.
[Numerical Values, Etc.]
The number of blocks, the number of layers, various thresholds, numerical values, the number of GPUs, or the like used in the above-described examples are merely examples, and can be arbitrarily changed. Furthermore, the learning rate can be not only decreased but also increased. Further, the determination of learning skip is not limited to each iteration, but can be determined for each epoch. Note that it is preferable that the same scheduler be used as the LR scheduler and the scheduler that controls the learning rate.
Furthermore, the weight variance calculation timing is not limited to each iteration, but can be arbitrarily set or changed for each predetermined number of training data, for each epoch, or the like. Furthermore, although the examples of using the weight variance as the weight information have been described, variance of the weight gradient can also be used.
[Block Control, Etc.]
For example, in the above-described example, it is possible to determine whether a layer is the skip target according to whether the error gradient of the final layer among the layers belonging to the block or an average value of the weight variance of the layers belonging to the block is less than the threshold.
In the above-described examples, the same BD period can be set or different BD periods can be set for each layer or each block. For example, to a block close to the output layer in which machine learning is stopped in a state where the machine learning is progressed, a BD period shorter than a block close to the input layer in which the machine learning is stopped at a relatively early stage can be set.
[System]
Pieces of information including a processing procedure, a control procedure, a specific name, various types of data, and parameters described above or illustrated in the drawings may be optionally changed unless otherwise specified.
Furthermore, each component of each device illustrated in the drawings is functionally conceptual, and is not always physically configured as illustrated in the drawings. For example, specific forms of distribution and integration of individual devices are not limited to those illustrated in the drawings. For example, all or a part of the devices may be configured by being functionally or physically distributed or integrated in optional units depending on various loads, use status, or the like.
Moreover, all or an optional part of individual processing functions performed in each device may be implemented by a central processing unit (CPU) and a program analyzed and executed by the CPU or may be implemented as hardware by wired logic.
[Hardware]
Next, a hardware configuration example of the information processing device 10 described in the above examples will be described.
The communication device 10a is a network interface card or the like, and communicates with another server. The HDD 10b stores a program that activates the functions illustrated in
The CPU 10d controls the entire information processing device 10 and, for example, each GPU 10e that operates each process of machine learning by reading a program regarding machine learning from the HDD 10b or the like and expanding the program to the memory 10c operates the process for executing each function described with reference to
As described above, the information processing device 10 operates as an information processing device that executes various processing methods by reading and executing the program. Furthermore, the information processing device 10 may also implement functions similar to the functions of the above-described examples by reading the above-described program from a recording medium by a medium reading device and executing the above-described read program. Note that the program referred to in other examples is not limited to being executed by the information processing device 10. For example, the present embodiment may be similarly applied also to a case where another computer or server executes the program, or a case where these computer and server cooperatively execute the program.
This program may be distributed via a network such as the Internet. Furthermore, this program may be recorded in a computer-readable recording medium such as a hard disk, flexible disk (FD), compact disc read only memory (CD-ROM), magneto-optical disk (MO), or digital versatile disc (DVD), and may be executed by being read from the recording medium by a computer.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-147507 | Sep 2021 | JP | national |