This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-121539, filed on Jul. 26, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a machine learning program, an information processing device, and a machine learning method.
In recent years, in machine learning in a neural network, as a size of a machine learning model increases, increase in a speed of learning is required.
For example, in a simulation using CosmoFlow that estimates cosmological parameters from dark matter data, a data amount is 5.1 TB, and machine learning by a single V100 graphics processing unit (GPU) takes one week.
Furthermore, data parallelism that is a mainstream speed-up method in machine learning has a limit in terms of accuracy. In other words, for example, when parallelism is increased, a batch size increases, and there is a possibility that this adversely affects learning accuracy.
Therefore, in recent years, a model parallel method for dividing a machine learning model in a neural network and executing parallel processing by a plurality of computers has been known. Hereinafter, there is a case where the machine learning model in the neural network is simply referred to as a neural network model or a model.
By executing the parallel processing on each of the models created by dividing the neural network model by the plurality of computers, the learning accuracy is not affected, and it is possible to increase the speed of machine learning.
In
In the model parallelized neural network indicated by the reference B, all layers (layer) including a convolution layer and a fully connected layer of the neural network indicated by the reference A are divided and parallelized.
However, in the model parallelized neural network indicated by the reference B, before and after each layer, communication (allgather and allreduce) frequently occurs between the process #0 and the process #1. This increases a communication load and causes a delay or the like due to waiting for synchronization or the like.
Therefore, a method is considered for parallelizing only the convolution layer having a large calculation amount of the plurality of layers included in the neural network.
The neural network illustrated in
Generally, regarding the processing of the convolution layer, although a calculation amount is large, only data exchange between adjacent portions is performed as communication. Therefore, there are few disadvantages caused by dividing the convolution layer. Furthermore, because the number of neurons in the fully connected layer at the subsequent stage is small, the calculation time does not increase even without parallelization, and there is a case where the speed of processing is higher than that when the model parallelization is performed.
Examples of the related art include as follows: Japanese National Publication of International Patent Application No. 2017-514251; and U.S. Patent Application Publication No. 2020/0372337.
According to an aspect of the embodiments, there is provided a non-transitory computer-readable recording medium storing a machine learning program of controlling machine learning of a plurality of distributed neural network models generated by dividing a neural network. In an example, the machine learning program includes instructions for causing a processor to execute processing including: adding, for each of the plurality of distributed neural network models, an individual noise for that distributed neural network model to a non-parallel processing block in that distributed neural network model such that the individual noise for that distributed neural network model is different from the individual noise for other distributed neural network models from among the plurality of distributed neural network models; and assigning, to a plurality of processes, the plurality of distributed neural network models added with the individual noise to cause each of the plurality of processes to perform the machine learning on an assigned distributed neural network model from among the plurality of distributed neural network models added with the individual noise.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
However, in the traditional model parallelized neural network illustrated in
Furthermore, in the model parallelized neural network illustrated in
In one aspect, an object of the embodiment is to efficiently use calculation resources in machine learning of a plurality of distributed neural network models that is model parallel processed.
Hereinafter, embodiments of a machine learning program, an information processing device, and a machine learning method will be described with reference to the drawings. Note that the embodiment to be described below is merely examples, and there is no intention to exclude application of various modifications and technologies not explicitly described in the embodiment. In other words, for example, the present embodiment may be variously modified and implemented without departing from the spirit thereof. Furthermore, each drawing is not intended to include only components illustrated in the drawing and may include another function and the like.
As illustrated in
In the computer system 1, a machine learning model (neural network model) in a neural network is divided, and the plurality of computing nodes 2 realizes model parallel processing.
The computing node 2 is an information processing device (computer) including a processor and a memory (not illustrated) and executes a process assigned by the management device 10 to be described later. Each computing node 2 performs training of an assigned neural network model (machine learning), inference using the corresponding neural network model, or the like.
The management device 10 is, for example, an information processing device (computer) that has a server function and has a function for managing the neural network model.
As illustrated in
The storage device 13 is a storage device such as a hard disk drive (HDD), a solid state drive (SSD), or a storage class memory (SCM) and stores various kinds of data.
The memory 12 is a storage memory including a read only memory (ROM) and a random access memory (RAM). In the ROM of the memory 12, a software program used to manage the machine learning model and data for this program are written. The software program used to manage the machine learning program includes a machine learning model.
The software program in the memory 12 is appropriately read and executed by the processor 11. Furthermore, the RAM of the memory 12 is used as a primary storage memory or a working memory.
The processor (processing unit) 11 controls the entire management device 10. The processor 11 may also be a multiprocessor. The processor 11 may also be, for example, any one of a central processing unit (CPU), a micro processing unit (MPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field programmable gate array (FPGA). Furthermore, the processor 11 may also be a combination of two or more types of elements of the CPU, MPU, DSP, ASIC, PLD, and FPGA.
Then, the processor 11 executes a control program so as to function as a model management unit 100, a training control unit 102, and an inference control unit 103 illustrated in
Note that the program (control program) for implementing the functions as the model management unit 100, the training control unit 102, and the inference control unit 103 is provided, for example, in a form recorded in a computer-readable recording medium such as a flexible disk, a compact disc (CD) (CD-ROM, CD-R, CD-rewritable (RW), or the like), a digital versatile disc (DVD) (DVD-ROM, DVD-RAM, DVD-recordable (R), DVD+R, DVD-RW, DVD+RW, high definition (HD) DVD, or the like), a Blu-ray disc, a magnetic disk, an optical disc, or a magneto-optical disk. Then, the computer reads the program from the recording medium, transfers the program to an internal storage device or an external storage device, and stores the program for use. Furthermore, for example, the program may also be recorded in a storage device (recording medium) such as a magnetic disk, an optical disc, or a magneto-optical disk, and provided from the storage device to the computer via a communication path.
When the functions as the model management unit 100, the training control unit 102, and the inference control unit 103 are implemented, a program stored in an internal storage device (memory 12 in the present embodiment) is executed by a microprocessor (processor 11 in the present embodiment) of a computer. At this time, the computer may also read and execute the program recorded in the recording medium.
The model management unit 100 manages the neural network model.
In the computer system 1, the neural network model is divided, and model parallel processing by the plurality of computing nodes 2 is realized.
In the example illustrated in
Hereinafter, there is a case where each of the plurality of neural network models created by dividing the single neural network model is referred to as a distributed neural network model or a distributed model. Furthermore, the single neural network model before being divided may also be referred to as an original neural network model.
The respective created distributed models are processed by individual computing nodes 2. In other words, for example, each distributed model is processed as a different process. In
Each distributed model illustrated in
The dropout layer suppresses overtraining by performing machine learning while inactivating (invalidating) a certain percentage of nodes. Note that, in the example illustrated in
In the computer system 1, the dropout layer performs inactivation (invalidation) different between a plurality of processes (two in example illustrated in
The model management unit 100 generates the parallelized neural network models as illustrated in
The convolution layer in the original neural network is divided into a process #0 and a process #1 that are processed in parallel respectively by different computing nodes 2. In each distributed model, the convolution layer may also be referred to as a model parallelization unit. Furthermore, in each distributed model, the fully connected layer and the dropout layer, in which the parallel processing is not executed by the processes #0 and #1, may also be referred to as non-model parallelization unit. Moreover, a plurality of processes that executes the processing of the convolution layer in parallel may also be referred to as a model parallel process.
The non-model parallelization unit of each distributed model includes a processing block that executes the same processing. Such processing blocks are duplicately included in the non-model parallelization units of the plurality of distributed models. In this way, the processing blocks that are duplicately included in the non-model parallelization units of the plurality of distributed models may also be referred to as duplicated blocks. The duplicated block is a group of layers that executes duplicate processing without performing model parallelization between the processes that perform model parallelization. In each distributed model, the dropout layer is included in the duplicated block.
As illustrated in
The noise setting unit 101 sets various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models.
For example, the noise setting unit 101 may also set a different percentage of node to be inactivated (hereinafter, referred to as dropout rate) for each distributed model. To set the dropout rate different for each distributed model, for example, arbitrary dropout rate may also be selected from among a plurality of types of dropout rates using random numbers for each dropout layer of each distributed model, and the selection may be appropriately changed and performed.
Furthermore, a noise setting method by the noise setting unit 101 is not limited to differ the dropout rate for each distributed model and can be appropriately changed and performed. For example, a node to be inactivated may also be different for each distributed model, or a probability of a dropout of an input element may also be different for each distributed model.
The noise setting unit 101 reads data configuring the distributed model and determines whether or not the dropout layer is included in the processing block of each layer configuring the corresponding distributed model. Then, in a case where the distributed model includes the dropout layer, parameters of the respective dropout layers are set so as to execute dropout processing different between the plurality of distributed models.
To set various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models may also be referred to as to set different noise for each model parallel process.
The noise setting unit 101 may also manage (store) the dropout processing (for example, dropout rate, node to be inactivated) set to each distributed model as actual achievement information, refer to this actual achievement information, and determine dropout processing to be set to each distributed model so that the dropout processing is not duplicated between the plurality of distributed models.
The training control unit 102 assigns each distributed model set by the model management unit 100 to each computing node 2 so as to make each distributed model perform training (machine learning).
According to an instruction for performing machine learning from the training control unit 102, the plurality of computing nodes 2 performs machine learning of the plurality of distributed neural network models created by dividing the original neural network in parallel.
Each distributed model assigned to each computing node 2 includes the dropout layer in a non-parallel block (duplicated block). Therefore, when each of the plurality of computing nodes 2 executes a process of machine learning of the distributed model, different noise is added in each non-parallel processing block (duplicated block, dropout layer).
The inference control unit 103 makes each computing node 2 perform inference by the distributed model.
Processing by the model management unit 100 of the computer system 1 as an example of the embodiment configured as described above will be described according to the flowchart (steps S1 to S8) illustrated in
In step S1, the model management unit 100 reads information configuring a distributed model created in advance. The model management unit 100 reads, for example, information of the plurality of distributed models created from the original neural network.
In step S2, the model management unit 100 selects one distributed model from among the plurality of read distributed models, confirms processing blocks from the beginning of the corresponding distributed model in order, and searches for the duplicated block duplicated between the plurality of distributed models (model parallel process).
In step S3, the model management unit 100 confirms whether or not there is a duplicated block (candidate), in a case where there is the duplicated block (refer to YES route in step S3), the procedure proceeds to step S4. In step S4, the noise setting unit 101 confirms whether or not the corresponding duplicated block can set the noise. In other words, for example, the noise setting unit 101 confirms whether or not the corresponding duplicated block is the dropout layer.
As a result of the confirmation, in a case where the corresponding duplicated block can set noise different for each model parallel process, in other words, for example, in a case where the duplicated block is the dropout layer (refer to YES route in step S4), the procedure proceeds to step S5.
In step S5, the noise setting unit 101 confirms a user whether or not to set noise different between the plurality of distributed models. For example, the noise setting unit 101 may also display a message inquiring of the user whether or not noise different between the plurality of distributed models may be set on a display (not illustrated) or the like.
The user may input a response to the inquiry using a mouse or a keyboard (both are not illustrated).
In step S6, the noise setting unit 101 confirms whether or not the user agrees to set the noise different between the plurality of distributed models. The noise setting unit 101 confirms, for example, whether or not the user has made an input indicating that the user agrees to set noise different between the plurality of distributed models using the mouse or the keyboard. As a result of the confirmation, in a case where the user does not agree to set the noise different between the plurality of distributed models (refer to NO route in step S6), the procedure returns to step S2.
On the other hand, in a case where the user agrees to set the noise different between the plurality of distributed models (refer to YES route in step S6), the procedure proceeds to step S7.
In step S7, the noise setting unit 101 sets (rewrite) parameters of the respective dropout layers corresponding to the corresponding dropout layer in the plurality of distributed models so that dropout processes different from each other are executed. Thereafter, the procedure returns to step S2.
Furthermore, in a case where it is not possible for the corresponding duplicated block to set noise as the result of the confirmation in step S4, in other words, for example, in a case where the corresponding duplicated block is not the dropout layer (refer to NO route in step S4), the procedure returns to step S2.
Furthermore, in a case where there is no duplicated block as the result of the confirmation in step S3 (refer to NO route in step S3), the procedure proceeds to step S8.
In step S8, information configuring each distributed model is written (stored) in a predetermined storage region such as the storage device 13. Thereafter, the procedure ends.
Note that, in the flowchart described above, the processing in steps S5 and S6 may also be omitted. In other words, for example, without confirming the user whether or not to set the noise different between the plurality of distributed models, the parameters of the respective corresponding dropout layers may also be rewritten in step S7 so that dropout processes different from each other are executed.
Next, machine learning processing by the plurality of distributed models created by the computer system 1 will be described with reference to
In
Note that,
In the forward propagation, as illustrated in
In the example illustrated in
Furthermore, the different parameters are set to these three dropout layers by the noise setting unit 101 described above, and as a result, dropout rates of the respective dropout layers are different.
Therefore, in the processing blocks on the downstream side of these dropout layers in the non-model parallelization unit of each distributed model, outputs different from each other are obtained.
Furthermore, outputs of the respective processing blocks at the final stage of the non-model parallelization unit of each distributed model are combined (refer to reference P5).
Each combined output is input to each subsequent model parallelization unit in the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.
A direction from the bottom to the top in
In the backpropagation, as illustrated in
In the non-model parallelization unit of each distributed model, in each processing block (duplicated block) other than the dropout layer (refer to references P7 to P9), for example, using the gradient descent method, a weight Aw is calculated in a direction for reducing a loss function that defines an error between an inference result of the machine learning model with respect to the training data and correct answer data.
The different parameters are set to the respective dropout layers included in the non-model parallelization unit of each distributed model by the noise setting unit 101 described above, and as a result, dropout rates of the respective dropout layers are different.
Therefore, in the processing blocks on the downstream side of these dropout layers in the non-model parallelization unit of each distributed model, outputs different from each other are obtained.
Outputs of the respective processing blocks at the final stages of the non-model parallelization units of the respective distributed models are combined (refer to reference P10). Each combined output is input to each subsequent model parallelization unit in the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.
In the weight update, as illustrated in
In this way, according to the computer system 1 as an example of the embodiment, the noise setting unit 101 sets various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models.
As a result, at the time of machine learning, each process for processing the distributed model executes different dropout processing in each dropout layer of the non-model parallelization unit (duplicated block).
Therefore, by generating the noise by a method different for each process in the non-model parallelization unit for executing processing duplicated between the processes, the calculation resource can be efficiently used.
Furthermore, by adding different noise to each distributed model, it is possible to improve robustness of the distributed model that is processed in parallel and to improve learning accuracy.
The processing blocks included in the non-model parallelization unit are originally and duplicately processed in parallel by the plurality of processes (distributed model). Therefore, in the computer system 1, almost no increase in a calculation time occurs to execute the dropout processing by each of the plurality of processes for performing model parallelization. The learning accuracy can be improved.
Each configuration and each processing of the present embodiment may also be selected or omitted as needed or may also be appropriately combined.
Then, the disclosed technology is not limited to the embodiment described above, and various modifications may be made and implemented without departing from the spirit of the present embodiment.
For example, in the embodiment described above, the dropout layer is used as the duplicated block that can set noise. However, the embodiment is not limited to this and can be appropriately changed and executed.
Furthermore, the present embodiment may be implemented and manufactured by those skilled in the art according to the disclosure described above.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-121539 | Jul 2021 | JP | national |