This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-185986, filed on Nov. 6, 2020, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a computer-readable recording medium storing therein a machine learning program, a machine learning method, and an information processing apparatus.
In order to speed up machine learning of a machine learning model, it is effective to use a graphics processing unit (GPU) and it is also effective to distribute processing by using a plurality of GPUs. The speed of machine learning processing has been increased by mounting a plurality of GPUs in a node, which is an information processing apparatus, and executing machine learning processing in parallel within the node. However, it takes no small amount of time to execute Allreduce processing and reflection processing of gradient information between the GPUs.
In the machine learning of the above-mentioned machine learning model, since a new learning portion has to be frequently updated every time the learning is performed, it is desirable to set a relatively high learning rate (LR) for determining the update frequency. On the other hand, in an existing learning portion in which machine learning has already been completed, as the learning portion is closer to the input side, the learning rate is lower and may often become 0 in an utmost case. In the portion where the learning rate is 0, although the machine learning processing does not have to be performed, a number of processing, such as Allreduce processing and reflection processing of gradient information, and weight calculation processing, are performed in vain at the same frequency as that in the new learning portion.
For this reason, in recent years, the Gradient Skip technique has been used in which a layer that does not request machine learning is determined and gradient information (Δw) calculation, Allreduce processing, and the like are skipped without being performed thereon.
Examples of the related art include as follows: Japanese Laid-open Patent Publication No. 4-232562; International Publication Pamphlet No. WO 2019/167665; U.S. Pat. Nos. 9,047,566; and 5,243,688.
However, with the above technique, the speed may be increased by skipping the machine learning, but the accuracy of the machine learning may deteriorate depending on the layer to be skipped, the skip timing, or the like, so that the machine learning may end without reaching the target accuracy.
In one aspect, it is an object to provide a computer-readable recording medium storing therein a machine learning program, a machine learning method, and an information processing apparatus that are capable of shortening the processing time of machine learning to reach the target accuracy.
According to an aspect of the embodiments, a computer-implemented method includes: calculating error gradients with respect to a plurality of layers included in a machine learning model at a time of machine learning of the machine learning model, the plurality of layers including an input layer of the machine learning model; specifying, as a layer to be suppressed, a layer located in a range from a position of the input layer to a predetermined position among the layers in which the error gradient is less than a threshold; and suppressing the machine learning for the layer to be suppressed.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
According to an aspect of the embodiments disclosed below, there is provided a solution to shorten the processing time of machine learning to reach the target accuracy.
Hereinafter, embodiments of a computer-readable recording medium storing a machine learning program, a machine learning method, and an information processing apparatus disclosed herein will be described in detail based on the drawings. These embodiments do not limit the present disclosure. The embodiments may be combined with each other as appropriate within the scope without contradiction.
[Overall Configuration]
An information processing apparatus 10 according to Embodiment 1 generates a machine learning model by distributed processing using a plurality of GPUs in order to achieve high speed machine learning processing.
In recent years, the Gradient Skip technique has been used in which a layer that does not request machine learning is determined by using a learning rate of each layer and learning is suppressed (skipped) without performing gradient information (Δw) calculation, Allreduce processing, and the like thereon.
A reference technique of learning skipping will be described below.
For example, the reference technique detects a layer in which a learning rate indicating a progress state of learning decreases, and omits the learning for the above layer, thereby shortening the learning time. For example, the learning is executed as usual in the next iteration for each layer in which a difference between an error gradient in the current iteration and an error gradient in the immediately preceding iteration is equal to or larger than a threshold, while the learning is skipped in the next iteration for each layer in which the difference is smaller than the threshold. For example, for the layer in which the learning rate is lowered, subsequent machine learning processing such as the calculation of the error gradient is suppressed.
However, in the reference technique, there is a portion in which the influence of accuracy deterioration when the machine learning is completely skipped is unknown. For example, the accuracy of the machine learning may deteriorate depending on the layer in which the learning is skipped, the skip timing, or the like, whereby the target accuracy may not be reached.
In contrast, the information processing apparatus 10 according to Embodiment 1 determines whether the layer selected as a skip candidate based on the difference between the error gradient in the current iteration and the error gradient in the immediately preceding iteration is a layer truly satisfying the condition that allows the skipping, thereby achieving both the suppression of deterioration in accuracy and the shortening of the learning time.
An example of learning skipping used in Embodiment 1 will be described below.
In this way, the information processing apparatus 10 is able to reduce not only the error gradient calculation but also the error backward propagation calculation, thereby making it possible to reduce the amount of calculation and also making it possible to shorten the learning time to reach the target accuracy.
[Functional Configuration]
The communication unit 11 is a processing unit that controls communication with other apparatuses and is achieved by, for example, a communication interface. For example, the communication unit 11 transmits and receives various data, various instructions, and the like to and from an administrator terminal.
The storage unit 12 is a processing unit that stores various types of data, various programs, and the like, and is implemented by, for example, a memory, a hard disk, or the like. The storage unit 12 stores a training data DB 13 and a machine learning model 14.
The training data DB 13 is an example of a data set configured to store training data used for machine learning of the machine learning model 14. For example, each training data stored in the training data DB 13 includes image data and a teacher label. The data set of the training data may be divided into subsets (batch sizes) in optional units.
The machine learning model 14 is a model generated by machine learning such as DL, and is an example of a model using a multilayer neural network constituted by a plurality of layers. For example, when image data is input, the machine learning model 14 performs classification of animals in the image. The machine learning model 14 may employ a deep neural network (DNN), a convolutional neural network (CNN), or the like.
The integration processing unit 20 is a processing unit that supervises the overall information processing apparatus 10, and is implemented by, for example, a CPU. For example, the integration processing unit 20 instructs each of the distributed processing units 30 to execute the distributed processing of machine learning, start or stop the machine learning, and the like, and executes the overall control related to the machine learning.
Each distributed processing unit 30 is a processing unit configured to execute the distributed processing of the machine learning, and is implemented by, for example, a GPU. For example, each distributed processing unit 30 generates the machine learning model 14 by machine learning using each training data stored in the training data DB 13 in response to the instruction from the integration processing unit 20.
[Details of Distributed Processing Unit]
Next, details of each distributed processing unit 30 will be described. Each of the distributed processing units 30 has the same configuration.
The forward propagation processing unit 31 is a processing unit configured to perform forward propagation processing on each layer of the machine learning model 14. For example, the forward propagation processing unit 31 executes so-called forward propagation, and therefore detailed description thereof is omitted. For example, the forward propagation processing unit 31 inputs image data, which is training data, to the front layer (input layer) of the machine learning model 14, and acquires a prediction result (classification result), which is a result of numerical calculations continuously performed from the input layer toward the last layer (output layer) of the machine learning model 14, from the output layer. The forward propagation processing unit 31 calculates an error between the prediction result and the teacher label by using the square error or the like to calculate an error function, and outputs the calculated error function to the error backward propagation processing unit 32.
The error backward propagation processing unit 32 is a processing unit that includes an error gradient calculator 33 and a communication controller 34, calculates an error of each of parameters of the machine learning model 14 by an error backward propagation method using the error function that is input from the forward propagation processing unit 31, and updates the parameters. The error backward propagation processing unit 32 executes so-called backward propagation, for example.
For example, the error backward propagation processing unit 32 calculates an error gradient of a weight of an edge between nodes in the neural network in the order (reverse order) from the output layer toward the input layer of the machine learning model 14. The error gradient corresponds to a value obtained by partially differentiating the error with the weight when the error is regarded as a function of the weight, and represents a change amount of the error when the weight of the error edge is changed by a minute amount. The error backward propagation processing unit 32 updates each parameter such as the weight of each edge so as to reduce the error by using the error gradient.
The error gradient calculator 33 is a processing unit configured to calculate an error gradient indicating a gradient of an error with respect to each parameter of the machine learning model 14, for each of the plurality of layers included in the machine learning model 14. For example, the error gradient calculator 33 calculates an error gradient with respect to each layer included in the machine learning model 14 for each iteration, and outputs error gradient information regarding the error gradient to the candidate extraction unit 35.
In the error gradient calculation, the error gradient calculator 33 suppresses the error gradient calculation for the layer determined as the layer in which the learning is suppressed (learning skipping layer). The error gradient calculator 33 may take only the last layer at the position farthest from the input layer in each of blocks to be described later, as an error gradient calculation target. Various known methods may be employed as the method of calculating the error gradient.
The communication controller 34 is a processing unit configured to perform Allreduce communication between the GPUs. For example, the communication controller 34 transmits and receives the error gradients between the GPUs to thereby sum the error gradients calculated by the plurality of GPUs for each weight of the edge, and aggregates the error gradients between the plurality of GPUs. By using the information regarding the error gradients aggregated as described above, the error backward propagation processing unit 32 updates various parameters of the machine learning model 14.
The communication controller 34 stops the communication of the target layer determined as a skip target by the determination unit 36 to be described later. The communication controller 34 specifies, from among the layers of the machine learning model 14, a layer where the error gradient calculation and the communication (Allreduce) are to be continued without stopping the learning and a layer where the learning is to be stopped, and controls the communication.
The candidate extraction unit 35 is a processing unit configured to extract, by using the error information calculated by the error gradient calculator 33, a layer to be a candidate for a skip target in which the learning is stopped. For example, the candidate extraction unit 35 extracts, as a skip candidate, a layer in which the displacement of the error gradient between iterations is small among the layers.
For example, the candidate extraction unit 35 calculates and holds an error gradient #1 at the timing when iteration 1 of epoch 1 ends. Thereafter, when iteration 2 of epoch 1 ends, the candidate extraction unit 35 calculates and holds an error gradient #2, calculates a difference #2, which is a difference between the error gradients #1 and #2 (for example, a difference in absolute value), and compares the difference #2 with a threshold.
When the difference #2 is smaller than the threshold, the candidate extraction unit 35 determines that the learning of the current layer has sufficiently progressed and notifies the determination unit 36 of information specifying the current layer as a skip candidate. On the other hand, when the difference #2 is equal to or larger than the threshold, the candidate extraction unit 35 determines that the learning of the current layer is insufficient, does not consider the current layer as a skip candidate, and maintains normal learning. The candidate extraction unit 35 may perform the above skip candidate determination only on the last layer of each block.
The determination unit 36 is a processing unit that determines whether to cause the skip candidate layer extracted by the candidate extraction unit 35 to experience learning skipping or normal learning. For example, the determination unit 36 determines whether a predetermined determination condition is satisfied, determines the skip candidate layer satisfying the determination condition as a skip target, and notifies the communication controller 34 of the determination result. On the other hand, the determination unit 36 maintains the normal learning for the skip candidate layer that does not satisfy the determination condition, by suppressing the notification to the communication controller 34.
One or a plurality of determination criteria may be set as the determination condition used by the determination unit 36. Hereinafter, the determination criteria and the reliability of the determination criteria using experimental data and the like will be described.
(Determination Criterion: Skip Frequency)
First, the frequency of skipping machine learning will be examined.
As indicated in
(Determination Criterion: Warm-Up)
Next, warm-up processing (hereinafter, may be described as “warm-up”) will be examined. Generally, in machine learning, when the machine learning is started, a warm-up is executed in an initial stage of the machine learning. In the following, experimental data in a case where skipping is started during a warm-up and in a case where skipping is started after the warm-up is finished will be examined.
As illustrated in
(Determination Criterion: Selection of Skip Layer)
Next, selection of a layer in which machine learning is skipped will be examined.
(Determination Criterion: Selection of Non-Target Layer)
Next, a layer in which machine learning is not skipped will be examined.
In
(Determination Criterion: Division of Skip Layers)
Next, how to divide layers to be skipped among a plurality of layers included in the machine learning model 14 will be examined.
(Determination Criterion: Order of Layers to be Skipped)
Finally, the order of efficient skipping of the layers from the front input layer to the last output layer will be examined.
For example, the example of the first row in the case where the learning is stopped in order from the preceding layer is an example in which the skipping is made in the order of 24, 28, 41, 43, and 44th layers. The example of the first row in the case where the learning is stopped in which the layers to be skipped are exchanged in the order is an example in which the skipping is made in the order of 24, 32, 37, 35, and 41st layers, where the 37th layer is skipped earlier than the 35th layer.
As illustrated in
(Determination of Skip Candidate)
The determination unit 36 determines whether to skip a skip candidate in accordance with a determination condition in consideration of the determination criteria described above.
In this state, when the determination unit 36 is notified of a skip candidate layer (for example, a layer number), the determination unit 36 determines whether the above layer is a skip target in accordance with determination procedures 1 to 4. For example, the determination unit 36 determines whether the reported layer number corresponds to the last layer of each block (determination procedure 1), whether the warm-up is completed (determination procedure 2), whether a block preceding the block to which the reported layer belongs has already been skipped (determination procedure 3), and whether a predetermined number of iterations have been executed since the skipping of the preceding block (determination procedure 4).
The block division is a matter to be determined based on the consideration of the selection of the non-target layer described with reference to
In a case where a skip candidate layer satisfies the determination procedures 1 to 4, the determination unit 36 determines to execute skipping and notifies the communication controller 34 of the determination result. Thereafter, in the next iteration, machine learning (error gradient calculation, parameter update, and the like) for the above-discussed layer is skipped.
In this manner, in the case where the last layer of the block is detected as a skip candidate, the determination unit 36 executes the final determination of skipping under the condition that a block preceding the block to which the skip candidate layer belongs has already been skipped, the warm-up has been completed, and a predetermined number of iterations have been completed since the skipping of the preceding block.
Accordingly, the determination unit 36 is able to execute learning skipping at a predetermined interval in order from the front block. Next, a control image of learning skipping by the determination unit 36 will be described.
As illustrated in
For example, the determination unit 36 may reduce the total skip amount represented in Expression (1), in the relations of “the number of epochs “e_1” for stopping the machine learning of the block 1>the number of epochs “e_END” for completely finishing the learning”, and “s_n”<the fully connected layer”. For example, the amount of skipping equivalent to areas of the respective learning stop frames depicted in
[Example of Processing]
Subsequently, the error backward propagation processing unit 32 calculates an error gradient for each layer excluding a skip target (S104). Then, the candidate extraction unit 35 selects one layer at the end of each block (S105), and determines whether the selected layer satisfies requirements of a skip candidate (S106).
In a case where the layer satisfies the requirements of the skip candidate (S106: Yes), the determination unit 36 determines whether the block to which the skip candidate layer belongs satisfies a skip condition (S107); in a case where the skip condition is satisfied (S107: Yes), the determination unit 36 determines the block as a block to be skipped and notifies the communication controller 34 of the block (S108). Thus, the communication controller 34 makes the block known as a skip target, and in the next iteration, machine learning for each layer belonging to the block is skipped.
On the other hand, when the block of the skip candidate does not satisfy the skip condition (S107: No), the determination unit 36 determines the block as a non-skip target, and causes normal machine learning to be executed (S109). When the block does not satisfy the requirements of the skip candidate in S106 (S106: No), the normal machine learning is executed (S109).
Thereafter, in a case where there is an unprocessed block (S110: Yes), the candidate extraction unit 35 repeats the processing in S105 and the subsequent processing; in a case where no unprocessed layer is present (S110: No), the forward propagation processing unit 31 determines whether to end the machine learning (S111). For example, the forward propagation processing unit 31 determines whether an optional termination criterion is reached, such as whether the accuracy has reached the target accuracy or whether a specified number of epochs has been executed.
In a case where the machine learning is to be continued (S111: No), the forward propagation processing unit 31 repeats the processing in S102 and the subsequent processing; in a case where the machine learning is to be ended (S111: Yes), the machine learning is ended, and a learning result and the like are displayed.
[Effects]
As described above, in each block in the machine learning model 14, the information processing apparatus 10 may suppress the weight update processing, the error backward propagation processing, and the like of the layer for which machine learning has been finished, and may obtain significant reductions in unnecessary calculation and update processing. As a result, the information processing apparatus 10 may achieve an increase in the speed of the overall calculation model while suppressing deterioration in the accuracy.
In a case where an information processing apparatus (node) mounting a plurality of GPUs is used, parallel processing configured with a plurality of nodes is performed, or the like, the proportion of resources consumed for inter-GPU communication, inter-node communication, aggregation processing, and reflection processing increases. However, with the information processing apparatus 10 described above, the effect of increasing the speed by reducing the undesired calculation and update processing is further enhanced.
The determination condition of the skip target is not limited to that described in Embodiment 1. Then, a method for achieving high speed, another determination condition of the skip target, and the like will be described below.
[Blocking]
For example, efficient learning skipping may be achieved by making a block for each of element sizes of layers.
By generating the blocks as described above, it is possible to manage the layers predicted to have the same learning progress as the same block and control the learning skipping. Thus, when skipping a certain block, it is possible to lower the probability that a layer in a learning state is present in the block. As a result, it is possible to achieve efficient learning skipping and further increase the speed.
[Block Configuration]
For example, the convolution layer and the BatchNormalization layer are formed alternately in many cases, and efficient machine learning may be executed when these layers are disposed as a pair in the same block so as not to extend over different blocks.
For example, the “input layer, convolution layer, BatchNormalization layer, convolution layer, BatchNormalization layer, convolution layer, BatchNormalization layer, convolution layer, BatchNormalization layer, and the like” are not set as a block of the “input layer, convolution layer, BatchNormalization layer, convolution layer, BatchNormalization layer, and convolution layer”, but are set as a block of the “input layer, convolution layer, BatchNormalization layer, convolution layer, BatchNormalization layer, convolution layer, and BatchNormalization layer”.
By generating the block as discussed above, layers in a range having the same influence on machine learning may be disposed as a pair in the same block, whereby efficient learning skipping may be achieved without obstructing efficient machine learning.
[Skip Control]
For example, as described with reference to
Thereafter, in a case where an average value of error gradients in respective layers of [from the first layer to s_1 layer] of the block is less than a threshold, the determination unit 36 determines learning skipping of the block. At this time, the determination unit 36 causes the layers from the first layer to the second layer in the block to be skipped, and causes the other layers to execute machine learning as usual.
As described above, the determination unit 36 may improve the accuracy while shortening the overall learning time by making the machine learning skipped stepwise in the block. The determination unit 36 may divide the inside of a block into a plurality of child blocks, and may skip learning for each child block stepwise at a predetermined iteration interval with respect to a block in which the average value of the error gradients once becomes less than the threshold.
While the embodiments of the present disclosure have been described, the present disclosure may be implemented in various different forms other than the above-described embodiments.
[Numerical Values and the Like]
The number of blocks, the number of layers, the various thresholds, the numerical values, the number of GPUs, and the like used in the above-described embodiments are merely examples, and may be optionally changed. The determination criteria described with reference to
[Block Control and the Like]
For example, in the above examples, it is described that the error gradient is calculated for the last layer among the layers belonging to the block and whether the last layer corresponds to a skip candidate is determined. However, the present disclosure is not limited thereto. For example, each of the blocks may be determined as a skip target depending on whether the average value of error gradients in a layer belonging to the block is less than a threshold.
In the above-described embodiments, the example in which the skipping is controlled in units of blocks is described. However, the present disclosure is not limited thereto, and the skipping may be controlled in units of layers. For example, when the information processing apparatus 10 detects a plurality of layers in which a difference between the error gradients is smaller than a threshold, the information processing apparatus 10 may determine a predetermined number of layers as skip targets in order from the layer closest to the input layer.
[System]
Unless otherwise specified, processing procedures, control procedures, specific names, and information including various kinds of data and parameters described in the above-described document or drawings may be optionally changed.
Each element of each illustrated apparatus is of a functional concept, and may not be physically constituted as illustrated in the drawings. For example, the specific form of the distribution or integration of the apparatuses is not limited to the apparatuses illustrated in the drawings. For example, the entirety or part of the apparatus may be constituted so as to be functionally or physically distributed or integrated in any units in accordance with various kinds of loads, usage states, or the like.
All or any part of the processing functions performed by each apparatus may be achieved by a CPU and a program analyzed and executed by the CPU or may be achieved by a hardware apparatus using wired logic.
[Hardware]
Next, a hardware configuration example of the information processing apparatus 10 described in the above embodiments will be described.
The communication device 10a is a network interface card or the like and communicates with another server. The HDD 10b stores a program, a DB, and the like for enabling the functions illustrated in
The CPU 10d is configured to control the overall information processing apparatus 10, and, for example, reads out a program related to machine learning from the HDD 10b or the like and loads it on the memory 10c, so that each of the GPUs 10e configured to operate each processing of the machine learning reads out, from the HDD 10b or the like, a program that executes the same processing as that of each processing unit illustrated in
As described above, the information processing apparatus 10 operates as an information processing apparatus that executes various information processing methods by reading out and executing programs. The information processing apparatus 10 may also achieve the functions similar to the functions of the above-described embodiments by reading out the above-described programs from a recording medium with a medium reading device and executing the above-described read programs. The programs described for another embodiment are not limited to the programs to be executed by the information processing apparatus 10. For example, the present disclosure may be similarly applied to when another computer or server executes the programs or when another computer and server execute the programs in cooperation with each other.
The program may be distributed via a network such as the Internet. The program may be executed by being recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a compact disc read-only memory (CD-ROM), a magneto-optical disk (MO), or a Digital Versatile Disc (DVD) and being read out from the recording medium by a computer.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-185986 | Nov 2020 | JP | national |