This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-080210, filed on May 11, 2021, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a storage medium, an information processing method, and an information processing device.
Various training methods for machine learning models are being studied. For example, there is a technique of stopping update of weight information of a layer for which training of a machine learning model has progressed to some extent. In the following description, this technique will be referred to as “existing technique”. Furthermore, stopping the update of target weight information is referred to as “skip”.
In the existing technique, “Forward Propagation” and “Backward Propagation” are executed for all the layers 1-0 to 1-6 in a stage from the start of training for the machine learning model to before the training progresses to some extent, and the weight information for all the layers 1-0 to 1-6 are updated.
In the existing technique, in the stage where the training of the machine learning model has progressed to some extent, the update of the weight information for the layer in which the training has progressed is skipped in order from the input-side layer. If the update is skipped from the output-side layer, the training accuracy does not reach target accuracy, but the update is skipped from the input-side layer, the training accuracy can be improved. In the example illustrated in
Assuming that a total processing amount of Forward Propagation is “1”, a processing amount of Backward Propagation is “2”. For example, in a state where Backward Propagation is not performed at all, a processing speed will be tripled, which is the limit of speedup.
Graphs G1 and G2 in
The weight difference indicates a difference between the weight information in the case where the (n−1)th iteration has been executed and the weight information in the case where the nth iteration has been executed. A layer with a large weight difference indicates that the layer has been trained. A layer with the weight difference less than a threshold value indicates that the layer has not been trained.
In the example illustrated in the graph G1, the weight difference is equal to or larger than a threshold value Th in all the layers (for example, the 0th to 158th layers), and all the layers have been trained. In the example illustrated in the graph G2, the weight differences of the input-side layers Ls 1-1 are less than the threshold value and have not been trained. On the other hand, the weight differences of the output-side layers Ls 1-2 are equal to or larger than the threshold value and have been trained.
In the existing technique, a calculation amount and a communication amount for calculating an error gradient are reduced by skipping the processing of updating the weight information of the input-side layers Ls 1-1. For example, as illustrated in graph G2′, a processing amount 2-1 needed for normal one iteration becomes a processing amount 2-2, and a processing amount 2-3 is reduced. In other words, the reduction effect per epoch is also the processing amount 2-3. Note that, as will be described below, to specify the progress of training in each layer of the machine learning model, a norm of the weight of each layer is calculated.
U.S. Patent Application Publication No. 2020/0380365 and U.S. Patent Application Publication No. 2020/0285992 are disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing an information processing program that causes at least one computer to execute a process, the process includes acquiring a value that indicates a progress status of training for an input-side layer among a plurality of layers included in a machine learning model; and when the value is more than or equal to a threshold value, repeating acquiring each value for a plurality of layers that follows the input-side layer.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The above-described existing technique has a problem that calculation cost is high.
In the existing technique, the norm of the weight of each layer is calculated in order to specify the progress of training of each layer of the machine learning model. Meanwhile, in the case of updating the weight information of all the layers regardless of the progress of training for the machine learning model, the norm calculation of the weight of each layer is not needed.
Here, the technique of updating the weight information of all the layers at all times regardless of the progress of training for the machine learning model will be referred to as “another existing technique” in order to be distinguished from the existing technique (gradient skip technique).
Line 5a illustrates the relationship between the processing amount and epoch of the another existing technique. Line 5b illustrates the relationship between the processing amount and epoch of the existing technique.
In the existing technique, the norm of the weight of the layers 150 is calculated for each iteration in 50 convolutional layers (hereinafter Cony layers) in the period of about 0 to 40 epochs. Therefore, as illustrated in (1) of
Next, in the existing technique, in the period of about 40 to 50 epochs, the weight difference of the Cony layers of 30 layers (10 layers for the Conv layers) gradually reaches the threshold value, the number of layers to be skipped gradually increases, and the norm calculation is skipped. As the calculation amount of norm decreases in this manner, the processing amount gradually increases in the existing technique, as illustrated in (2) of
Next, in the existing technique, it is possible to skip 30 layers during the transition period of 50 epochs, which reduces the calculation amount, but the norm calculation for the remaining Conv layers of 40 layers for the next skip remains. The processing amount is about 9200 samples/sec. As illustrated in (3) of
As described in
In one aspect, an object of the present embodiment is to provide an information processing program, an information processing method, and an information processing device capable of reducing the calculation cost.
The calculation cost can be reduced.
Embodiments of an information processing program, an information processing method, and an information processing device disclosed in the present application are hereinafter described in detail with reference to the drawings. Note that the present embodiment is not limited to the embodiments.
[Embodiment]
An information processing device according to the present embodiment calculates a weight difference and specifies whether training of a target layer of a machine learning model has progressed. In the following description, among a plurality of layers included in the machine learning model, the layer for which the weight difference is to be calculated is appropriately referred to as “target layer”.
The weight difference is defined by the equation (1). The subscript “I” in the equation (1) corresponds to the number of iterations. For example, “WI+1−WI” indicates the weight difference between weight information of the (I+1)th iteration and weight information of the Ith iteration. In the equation (1), constants are preset for “LR”, “Decay”, and “mom”.
[Math. 1]
W
I+1
−W
I
=LR×ΔW
I−(WI×LR×Decay)+mom×VI−1 (1)
ΔWI in the equation (1) represents a difference between a weight of the previous iteration and a weight of the iteration of this time in tensor in the target layer. WI in the equation (1) represents the weight updated by the iteration of this time in tensor in the target layer. VI−1 is a tensor indicating momentum. For example, the equation (2) defines a relationship between VI and VI−1.
[Math. 2]
momentum=mom×VI=mom×(LR×ΔWI−(WI×LR×Decay)+mom×Vi−1) (2)
The information processing device calculates a norm (g_weight_norm) of ΔWI, a norm (weight_norm) of WI, and a norm (momentum_norm) of VI, respectively, in order to convert the value of the equation (1) into a scalar value comparable to a threshold value. The norm (g_weight_norm) of ΔWI is calculated by the equation (3). The norm of WI is calculated by the equation (4). The norm of VI is calculated by the equation (5).
A threshold value is set for each layer, and in a case where the weight difference has reached the threshold value, calculation of the weight difference of the target layer is skipped. For example, in L6, the weight difference reaches a threshold value Th6 in 10 epochs. In L36, the weight difference reaches a threshold value Th36 in 22 epochs. In L75, the weight difference reaches a threshold value Th75 in 35 epochs. In L132, the weight difference reaches a threshold value Th75 in 46 epochs. In other words, the weight difference reaches the threshold value from the input-side layer.
Next, an example of processing of the information processing device according to the present embodiment will be described. In the information processing device according to the present embodiment, the initial target layer is only one layer. Next, the information processing device repeatedly executes the processing of calculating the weight difference for a plurality of layers as the target layers, following the layer with the weight difference having reached the threshold value, after the weight difference of the target layer has reached the threshold value. In the present embodiment, the value of the weight difference having become less than the threshold value is described that the weight difference has reached the threshold value.
The information processing device inputs training data into the machine learning model, executes Forward Propagation and Backward Propagation, and starts training the machine learning model. As illustrated in
The norm calculation (initial norm calculation) in 1 epoch will be described. The information processing device starts the norm calculation for L0 as the target layer.
The norm calculation in 2 epochs to (n−1) epochs will be described. The information processing device continues the norm calculation for L0 as the target layer.
The norm calculation in n epochs will be described. n is a natural number. When specifying that the weight difference of L0 has reached the threshold value, the information processing device starts the norm calculation for the three layers “L3, L6, and L9” on the output side with respect to the layer that has reached the threshold value. At the stage of n epochs, training of each layer has progressed to some extent.
In the example illustrated in
The description returns to the description of
The norm calculation in (n+2) epochs will be described. The information processing device skips the norm calculation of L0, L3, and L6. When specifying that the weight difference of L9 has reached the threshold value, the information processing device starts the norm calculation for “L18” on the output side with respect to the layer that has reached the threshold value. The information processing device continues the norm calculation for L12 and L15.
The norm calculation in (n+3) epochs will be described. The information processing device skips the norm calculation of L0, L3, L6, and L9 When specifying that the weight difference of L18 has reached the threshold value, the information processing device starts the norm calculation for “L21” on the output side with respect to the layer that has reached the threshold value.
The norm calculation in (n+4) epochs will be described. The information processing device skips the norm calculation of L0, L3, L6, and L9. The information processing device waits for stop of the norm calculation of L18 because the norm calculation of the layers L12 and 15 on the input side with respect to L18 with the weight difference having reached the threshold value has not been skipped. When specifying that the weight differences of L12 and L15 have reached the threshold value, the information processing device starts the norm calculation for “L24 and L27” on the output side with respect to the layers that have reached the threshold value. The information processing device continues the norm calculation for L21.
The norm calculation in (n+5) epochs will be described. The information processing device skips the norm calculation of L0, L3, L6, L9, L12, L15, and L18. The information processing device continues the norm calculation for L21, L24, and L27. Description of the norm calculation in (n+6) epochs is omitted.
As described above, the information processing device according to the present embodiment can narrow down the target layer for which the norm calculation is to be executed and reduce the calculation cost when training the machine learning model.
Next, one example of a configuration of the information processing device according to the present embodiment will be described.
The communication unit 110 receives various data from an external device via a network. The communication unit 110 is an example of a communication device. For example, the communication unit 110 may also receive training data 141 or the like, which will be described below, from an external device.
The input unit 120 is an input device that inputs various types of information to the control unit 150 of the information processing device 100. The input unit 120 corresponds to a keyboard, a mouse, a touch panel, and the like,
The display unit 130 is a display device that displays information output from the control unit 150.
The storage unit 140 has training data 141 and a machine learning model 142. The storage unit 140 corresponds to a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk drive (HDD).
The training data 141 is data used when training of the machine learning model 142 is executed. For example, the training data 141 has a plurality of pairs of input data and correct answer data.
The machine learning model 142 is model data corresponding to a neural network having a plurality of layers.
The control unit 150 includes a forward propagation (FP) processing unit 151, a backward propagation (BP) processing unit 152, and a selection unit 153. The control unit 150 is implemented by a central processing unit (CPU), a graphics processing unit (GPU), a hard-wired logic such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA), or the like.
The FP processing unit 151 executes Forward Propagation for the machine learning model 142, For example, the FP processing unit 151 inputs input data of the training data 141 to an input layer of the machine learning model 142, and calculates an output value of the machine learning model 142. The FP processing unit 151 outputs output value information to the BP processing unit 152.
The BP processing unit 152 executes Backward Propagation for the machine learning model 142. For example, the BP processing unit 152 calculates an error between an output value output from an output layer of the machine learning model 142 and the correct answer data of the training data 141, and updates the weight information of each layer of the machine learning model 142 so that the error becomes small by error back propagation.
Furthermore, when receiving a notification of the target layer from the selection unit 153 to be described below, the BP processing unit 152 executes the norm calculation for the target layer among all the layers included in the machine learning model 142. For example, the BP processing unit 152 calculates the norm of ΔWI, the norm of W, and the norm of VI on the basis of the above equations (3), (4), and (5), and outputs a calculation result of each target layer to the selection unit 153.
When receiving selection of a layer for which the norm calculation is to be skipped from the selection unit 153, the BP processing unit 152 skips the norm calculation for the target layer for which the selection is accepted. Furthermore, the BP processing unit 152 stops the error propagation of the target layer for which the norm calculation is to be skipped and layers on the input side with respect to the target layer.
The selection unit 153 selects the target layer for which the norm calculation is to be executed on the basis of the result of the norm calculation output from the BP processing unit 152, and notifies the BP processing unit 152 of the selected target layer. The selection unit 153 sets only one layer as an initial target layer. For example, the selection unit 153 selects L0 as the initial target layer and outputs the selected target layer to the BP processing unit 152.
When acquiring the calculation result of the norm calculation of the target layer from the BP processing unit 152, the selection unit 153 calculates the weight difference of the target layer on the basis of the equation (1) and determines whether the weight difference has reached the threshold value. In the case where the weight difference of the target layer has reached the threshold value, the selection unit 153 notifies the BP processing unit 152 that the norm calculation for the target layer with the weight difference having reached the threshold value is skipped.
In the case where the weight difference of the initially selected L0 has reached the threshold value, the selection unit 153 selects a plurality of layers (L3, L6, and L9) on the output side with respect to L0 as target layers and outputs the target layers to the BP processing unit 152, as described in
By the way, the selection unit 153 waits for skipping L18 in the case where the weight difference of L18 has reached the threshold value and the weight differences of L12 and L15 have not reached the threshold value among the plurality of layers for which the norm calculation is performed, as described in (n+3) epochs of
Next, an example of a processing procedure of the information processing device 100 according to the present embodiment will be described.
The selection unit 153 of the information processing device 100 selects input-side one layer of the machine learning model 142 as the target layer (step S102). The BP processing unit 152 executes the norm calculation of the target layer (step S103).
The selection unit 153 specifies whether the weight difference has reached the threshold value on the basis of the result of the norm calculation (step S104). In the case where the target layer with the weight difference having reached the threshold value is present (step S105, Yes), the selection unit 153 moves onto step S106. On the other hand, in the case where the target layer with the weight difference having reached the threshold value is not present (step S105, No), the selection unit 153 moves onto step S108.
The BP processing unit 152 skips the norm calculation of the target layer with the weight difference having reached the threshold value (step S106). The selection unit 153 selects the target layers so that the number of layers for which the norm calculation is to be executed are M (for example, three) (step S107).
In the case of terminating the training (step S108, Yes), the information processing device 100 terminates the processing. On the other hand, in the case of not terminating the training (step S108, No), the information processing device 100 proceeds to the training of the next epoch (step S109) and proceeds to step S103.
Next, effects of the information processing device 100 according to the present embodiment will be described. The information processing device 100 executes the norm calculation for only one layer as the initial target layer, Next, the information processing device 100 repeatedly executes the processing of calculating the weight difference for a plurality of layers as the target layers, following the layer with the weight difference having reached the threshold value, after the weight difference of the target layer has reached the threshold value. In this way, the information processing device 100 can reduce the calculation cost by narrowing down the target layers for which the norm calculation is to be executed.
The information processing device 100 can further reduce the calculation cost by skipping the norm calculation for the target layer with the weight difference having reached the threshold value.
In the case where values of a first layer and a second layer on the output side with respect to the first layer have reached the threshold value, or in the case where the calculation of the value for the first layer has been skipped (stopped) and the value of the second layer has reached the threshold value, among a plurality of layers, the information processing device 100 skips the calculation of the value for the second layer. Furthermore, in the case where the value of the second layer has reached the threshold value and the value of the first layer has not reached the threshold value, the information processing device 100 continues (waits for skipping) the calculation of the value for the second layer until the value of the first layer reaches the threshold value. As a result, it is possible to skip the layers in order from an input-side layer, and the training accuracy of the machine learning model 142 can be improved.
The vertical axis of graph G30 in
From 0 to about 50 epochs will be discussed. Compared with the existing technique (line 6b), the number of target layers (line 6c) of the information processing device 100 is 1/50. Thereby, the processing amount (line 5c) of the information processing device 100 is comparable with the processing amount (line 5a) of the another existing technique.
50 epochs and subsequent epochs will be discussed. In the information processing device 100, the weight difference of each layer sequentially reaches the threshold value and skipping is started, the processing amount (line 5c) increases due to the calculation amount for the error gradient calculation and the stop of back propagation. Even after skipping, since the total number of norm calculations is set to 3, the speed is not reduced by the processing for 37 layers and the processing amount is improved.
In the example illustrated in
By the way, in the machine learning model 142, the mass of the number of elements of each layer is divided into four stages (four types), any layer (the last layer L36, L75, or L132 of each stage) may also be selected for each mass and the norm calculation may also be performed.
Next, an example of a hardware configuration of a computer that implements functions similar to the information processing device 100 described in the above embodiment will be described.
As illustrated in
The hard disk device 207 has an FP processing program 207a, a BP processing program 207b, and a selection program 207c. The CPU 201 reads the FP processing program 207a, the BP processing program 207b, and the selection program 207c and expands them in the RAM 206.
The FP processing program 207a functions as an FP processing process 206a. The BP processing program 207b functions as a BP processing process 206b. The selection program 207c functions as a selection process 206c.
Processing of the FP processing process 206a corresponds to the processing of the FP processing unit 151. Processing of the BP processing process 206b corresponds to the processing of the BP processing unit 152. Processing of the selection process 206c corresponds to the processing of the selection unit 153.
Note that the programs 207a to 207c do not need to be stored in the hard disk device 207 beforehand. For example, the programs are stored in a “portable physical medium” such as a flexible disk (FD), a compact disc read only memory (CD-ROM), a digital versatile disc (DVD) disk, a magneto-optical disk, or an integrated circuit (IC) card to be inserted in the computer 200. Then, the computer 200 may also read the programs 207a to 207c and execute the programs.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-080210 | May 2021 | JP | national |