COMPUTER-READABLE RECORDING MEDIUM STORING MACHINE LEARNING PROGRAM, INFORMATION PROCESSING DEVICE, AND MACHINE LEARNING METHOD

CROSS-REFERENCE TO RELATED APPLICATION

This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-121539, filed on Jul. 26, 2021, the entire contents of which are incorporated herein by reference.

FIELD

The embodiment discussed herein is related to a machine learning program, an information processing device, and a machine learning method.

BACKGROUND

In recent years, in machine learning in a neural network, as a size of a machine learning model increases, increase in a speed of learning is required.

For example, in a simulation using CosmoFlow that estimates cosmological parameters from dark matter data, a data amount is 5.1 TB, and machine learning by a single V100 graphics processing unit (GPU) takes one week.

Furthermore, data parallelism that is a mainstream speed-up method in machine learning has a limit in terms of accuracy. In other words, for example, when parallelism is increased, a batch size increases, and there is a possibility that this adversely affects learning accuracy.

Therefore, in recent years, a model parallel method for dividing a machine learning model in a neural network and executing parallel processing by a plurality of computers has been known. Hereinafter, there is a case where the machine learning model in the neural network is simply referred to as a neural network model or a model.

By executing the parallel processing on each of the models created by dividing the neural network model by the plurality of computers, the learning accuracy is not affected, and it is possible to increase the speed of machine learning.

FIG. 8 is a diagram for explaining a traditional model parallel method in a neural network.

In FIG. 8, a reference A indicates a neural network model that is not parallelized. Furthermore, a reference B indicates a model parallelized neural network model and represents two (processes #0 and #1) models created by dividing a single model indicated by the reference A.

In the model parallelized neural network indicated by the reference B, all layers (layer) including a convolution layer and a fully connected layer of the neural network indicated by the reference A are divided and parallelized.

However, in the model parallelized neural network indicated by the reference B, before and after each layer, communication (allgather and allreduce) frequently occurs between the process #0 and the process #1. This increases a communication load and causes a delay or the like due to waiting for synchronization or the like.

Therefore, a method is considered for parallelizing only the convolution layer having a large calculation amount of the plurality of layers included in the neural network.

FIG. 9 is a diagram for explaining a traditional model parallel method in the neural network.

FIG. 9 illustrates a model parallelized neural network created on the basis of the neural network model that is not parallelized and indicated by the reference A in FIG. 8.

The neural network illustrated in FIG. 9 divides only the convolution layer of the neural network model that is not parallelized and indicated by the reference A in FIG. 8 into two. In other words, for example, processing of the convolution layer is executed by the processes #0 and #1 in parallel, and processing of the fully connected layer is executed by only the process #0.

Generally, regarding the processing of the convolution layer, although a calculation amount is large, only data exchange between adjacent portions is performed as communication. Therefore, there are few disadvantages caused by dividing the convolution layer. Furthermore, because the number of neurons in the fully connected layer at the subsequent stage is small, the calculation time does not increase even without parallelization, and there is a case where the speed of processing is higher than that when the model parallelization is performed.

Examples of the related art include as follows: Japanese National Publication of International Patent Application No. 2017-514251; and U.S. Patent Application Publication No. 2020/0372337.

SUMMARY

According to an aspect of the embodiments, there is provided a non-transitory computer-readable recording medium storing a machine learning program of controlling machine learning of a plurality of distributed neural network models generated by dividing a neural network. In an example, the machine learning program includes instructions for causing a processor to execute processing including: adding, for each of the plurality of distributed neural network models, an individual noise for that distributed neural network model to a non-parallel processing block in that distributed neural network model such that the individual noise for that distributed neural network model is different from the individual noise for other distributed neural network models from among the plurality of distributed neural network models; and assigning, to a plurality of processes, the plurality of distributed neural network models added with the individual noise to cause each of the plurality of processes to perform the machine learning on an assigned distributed neural network model from among the plurality of distributed neural network models added with the individual noise.

The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a diagram schematically illustrating a hardware configuration of a computer system as an example of an embodiment;

FIG. 2 is a functional configuration diagram of a management device of the computer system as an example of the embodiment;

FIG. 3 is a conceptual diagram illustrating a neural network model generated by the computer system as an example of the embodiment;

FIG. 4 is a flowchart for explaining processing by a model management unit of the computer system as an example of the embodiment;

FIG. 5 is a diagram for explaining machine learning processing by a plurality of distributed models created by the computer system as an example of the embodiment;

FIG. 6 is a diagram for explaining machine learning processing by the plurality of distributed models created by the computer system as an example of the embodiment;

FIG. 7 is a diagram for explaining machine learning processing by the plurality of distributed models created by the computer system as an example of the embodiment;

FIG. 8 is a diagram for explaining a traditional model parallel method in a neural network; and

FIG. 9 is a diagram for explaining the traditional model parallel method in the neural network.

DESCRIPTION OF EMBODIMENTS

However, in the traditional model parallelized neural network illustrated in FIG. 9, the process #1 does not execute the processing other than the convolution layer, which wastes calculation resources and is inefficient.

Furthermore, in the model parallelized neural network illustrated in FIG. 9, in order to share Loss that is finally calculated by the process #0 with the process #1, data communication from the process #0 to the process #1 is performed. In order to reduce a time period for the data communication, as in the process #0, it is considered to calculate Loss by performing calculation of each fully connected layer by the process #1. However, in this case, the same calculation of the fully connected layer is duplicately performed in the processes #0 and #1, and this is inefficient.

In one aspect, an object of the embodiment is to efficiently use calculation resources in machine learning of a plurality of distributed neural network models that is model parallel processed.

Hereinafter, embodiments of a machine learning program, an information processing device, and a machine learning method will be described with reference to the drawings. Note that the embodiment to be described below is merely examples, and there is no intention to exclude application of various modifications and technologies not explicitly described in the embodiment. In other words, for example, the present embodiment may be variously modified and implemented without departing from the spirit thereof. Furthermore, each drawing is not intended to include only components illustrated in the drawing and may include another function and the like.

(A) CONFIGURATION

FIG. 1 is a diagram schematically illustrating a hardware configuration of a computer system 1 as an example of an embodiment, and FIG. 2 is a functional configuration diagram of a management device thereof.

As illustrated in FIG. 1, the computer system 1 includes a management device 10 and a plurality of computing nodes 2. The management device 10 and each computing node 2 are connected to be communicable with each other via a network 3. The network 3 is, for example, a local area network (LAN).

In the computer system 1, a machine learning model (neural network model) in a neural network is divided, and the plurality of computing nodes 2 realizes model parallel processing.

The computing node 2 is an information processing device (computer) including a processor and a memory (not illustrated) and executes a process assigned by the management device 10 to be described later. Each computing node 2 performs training of an assigned neural network model (machine learning), inference using the corresponding neural network model, or the like.

The management device 10 is, for example, an information processing device (computer) that has a server function and has a function for managing the neural network model.

As illustrated in FIG. 1, the management device 10 includes, for example, a processor 11, a memory 12, and a storage device 13.

The storage device 13 is a storage device such as a hard disk drive (HDD), a solid state drive (SSD), or a storage class memory (SCM) and stores various kinds of data.

The memory 12 is a storage memory including a read only memory (ROM) and a random access memory (RAM). In the ROM of the memory 12, a software program used to manage the machine learning model and data for this program are written. The software program used to manage the machine learning program includes a machine learning model.

The software program in the memory 12 is appropriately read and executed by the processor 11. Furthermore, the RAM of the memory 12 is used as a primary storage memory or a working memory.

The processor (processing unit) 11 controls the entire management device 10. The processor 11 may also be a multiprocessor. The processor 11 may also be, for example, any one of a central processing unit (CPU), a micro processing unit (MPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field programmable gate array (FPGA). Furthermore, the processor 11 may also be a combination of two or more types of elements of the CPU, MPU, DSP, ASIC, PLD, and FPGA.

Then, the processor 11 executes a control program so as to function as a model management unit 100, a training control unit 102, and an inference control unit 103 illustrated in FIG. 2. The control program includes a machine learning program. The processor 11 executes this machine learning program so as to implement the function as the training control unit 102.

Note that the program (control program) for implementing the functions as the model management unit 100, the training control unit 102, and the inference control unit 103 is provided, for example, in a form recorded in a computer-readable recording medium such as a flexible disk, a compact disc (CD) (CD-ROM, CD-R, CD-rewritable (RW), or the like), a digital versatile disc (DVD) (DVD-ROM, DVD-RAM, DVD-recordable (R), DVD+R, DVD-RW, DVD+RW, high definition (HD) DVD, or the like), a Blu-ray disc, a magnetic disk, an optical disc, or a magneto-optical disk. Then, the computer reads the program from the recording medium, transfers the program to an internal storage device or an external storage device, and stores the program for use. Furthermore, for example, the program may also be recorded in a storage device (recording medium) such as a magnetic disk, an optical disc, or a magneto-optical disk, and provided from the storage device to the computer via a communication path.

When the functions as the model management unit 100, the training control unit 102, and the inference control unit 103 are implemented, a program stored in an internal storage device (memory 12 in the present embodiment) is executed by a microprocessor (processor 11 in the present embodiment) of a computer. At this time, the computer may also read and execute the program recorded in the recording medium.

The model management unit 100 manages the neural network model.

In the computer system 1, the neural network model is divided, and model parallel processing by the plurality of computing nodes 2 is realized.

FIG. 3 is a conceptual diagram illustrating a neural network model generated by the computer system 1.

In the example illustrated in FIG. 3, two neural network models created by dividing a single neural network model are illustrated. Furthermore, in the neural network model illustrated in FIG. 3, only a convolution layer is divided and parallelized, and other layers are not parallelized.

Hereinafter, there is a case where each of the plurality of neural network models created by dividing the single neural network model is referred to as a distributed neural network model or a distributed model. Furthermore, the single neural network model before being divided may also be referred to as an original neural network model.

The respective created distributed models are processed by individual computing nodes 2. In other words, for example, each distributed model is processed as a different process. In FIG. 3, an example is illustrated in which two distributed models are generated and each of a process #0 and a process #1 processes the single distributed model.

Each distributed model illustrated in FIG. 3 includes a convolution layer and a fully connected layer. Furthermore, the fully connected layer includes a dropout layer.

The dropout layer suppresses overtraining by performing machine learning while inactivating (invalidating) a certain percentage of nodes. Note that, in the example illustrated in FIG. 3, the fully connected layer includes the dropout layer. However, the embodiment is not limited to this, and the dropout layer may also be included in the convolution layer or the like.

In the computer system 1, the dropout layer performs inactivation (invalidation) different between a plurality of processes (two in example illustrated in FIG. 3). Hereinafter, inactivating a specific node in the dropout layer may be referred to as adding noise. The dropout layer functions as a noise addition layer for adding noise to machine learning.

The model management unit 100 generates the parallelized neural network models as illustrated in FIG. 3. For example, the model management unit 100 adds noise, different from that to a dropout layer included in another distributed model, to each of the plurality of distributed models included in the parallelized neural network models.

The convolution layer in the original neural network is divided into a process #0 and a process #1 that are processed in parallel respectively by different computing nodes 2. In each distributed model, the convolution layer may also be referred to as a model parallelization unit. Furthermore, in each distributed model, the fully connected layer and the dropout layer, in which the parallel processing is not executed by the processes #0 and #1, may also be referred to as non-model parallelization unit. Moreover, a plurality of processes that executes the processing of the convolution layer in parallel may also be referred to as a model parallel process.

The non-model parallelization unit of each distributed model includes a processing block that executes the same processing. Such processing blocks are duplicately included in the non-model parallelization units of the plurality of distributed models. In this way, the processing blocks that are duplicately included in the non-model parallelization units of the plurality of distributed models may also be referred to as duplicated blocks. The duplicated block is a group of layers that executes duplicate processing without performing model parallelization between the processes that perform model parallelization. In each distributed model, the dropout layer is included in the duplicated block.

As illustrated in FIG. 2, the model management unit 100 has a function as a noise setting unit 101.

The noise setting unit 101 sets various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models.

For example, the noise setting unit 101 may also set a different percentage of node to be inactivated (hereinafter, referred to as dropout rate) for each distributed model. To set the dropout rate different for each distributed model, for example, arbitrary dropout rate may also be selected from among a plurality of types of dropout rates using random numbers for each dropout layer of each distributed model, and the selection may be appropriately changed and performed.

Furthermore, a noise setting method by the noise setting unit 101 is not limited to differ the dropout rate for each distributed model and can be appropriately changed and performed. For example, a node to be inactivated may also be different for each distributed model, or a probability of a dropout of an input element may also be different for each distributed model.

The noise setting unit 101 reads data configuring the distributed model and determines whether or not the dropout layer is included in the processing block of each layer configuring the corresponding distributed model. Then, in a case where the distributed model includes the dropout layer, parameters of the respective dropout layers are set so as to execute dropout processing different between the plurality of distributed models.

To set various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models may also be referred to as to set different noise for each model parallel process.

The noise setting unit 101 may also manage (store) the dropout processing (for example, dropout rate, node to be inactivated) set to each distributed model as actual achievement information, refer to this actual achievement information, and determine dropout processing to be set to each distributed model so that the dropout processing is not duplicated between the plurality of distributed models.

The training control unit 102 assigns each distributed model set by the model management unit 100 to each computing node 2 so as to make each distributed model perform training (machine learning).

According to an instruction for performing machine learning from the training control unit 102, the plurality of computing nodes 2 performs machine learning of the plurality of distributed neural network models created by dividing the original neural network in parallel.

Each distributed model assigned to each computing node 2 includes the dropout layer in a non-parallel block (duplicated block). Therefore, when each of the plurality of computing nodes 2 executes a process of machine learning of the distributed model, different noise is added in each non-parallel processing block (duplicated block, dropout layer).

The inference control unit 103 makes each computing node 2 perform inference by the distributed model.

(B) OPERATION

Processing by the model management unit 100 of the computer system 1 as an example of the embodiment configured as described above will be described according to the flowchart (steps S1 to S8) illustrated in FIG. 4.

In step S1, the model management unit 100 reads information configuring a distributed model created in advance. The model management unit 100 reads, for example, information of the plurality of distributed models created from the original neural network.

In step S2, the model management unit 100 selects one distributed model from among the plurality of read distributed models, confirms processing blocks from the beginning of the corresponding distributed model in order, and searches for the duplicated block duplicated between the plurality of distributed models (model parallel process).

In step S3, the model management unit 100 confirms whether or not there is a duplicated block (candidate), in a case where there is the duplicated block (refer to YES route in step S3), the procedure proceeds to step S4. In step S4, the noise setting unit 101 confirms whether or not the corresponding duplicated block can set the noise. In other words, for example, the noise setting unit 101 confirms whether or not the corresponding duplicated block is the dropout layer.

As a result of the confirmation, in a case where the corresponding duplicated block can set noise different for each model parallel process, in other words, for example, in a case where the duplicated block is the dropout layer (refer to YES route in step S4), the procedure proceeds to step S5.

In step S5, the noise setting unit 101 confirms a user whether or not to set noise different between the plurality of distributed models. For example, the noise setting unit 101 may also display a message inquiring of the user whether or not noise different between the plurality of distributed models may be set on a display (not illustrated) or the like.

The user may input a response to the inquiry using a mouse or a keyboard (both are not illustrated).

In step S6, the noise setting unit 101 confirms whether or not the user agrees to set the noise different between the plurality of distributed models. The noise setting unit 101 confirms, for example, whether or not the user has made an input indicating that the user agrees to set noise different between the plurality of distributed models using the mouse or the keyboard. As a result of the confirmation, in a case where the user does not agree to set the noise different between the plurality of distributed models (refer to NO route in step S6), the procedure returns to step S2.

On the other hand, in a case where the user agrees to set the noise different between the plurality of distributed models (refer to YES route in step S6), the procedure proceeds to step S7.

In step S7, the noise setting unit 101 sets (rewrite) parameters of the respective dropout layers corresponding to the corresponding dropout layer in the plurality of distributed models so that dropout processes different from each other are executed. Thereafter, the procedure returns to step S2.

Furthermore, in a case where it is not possible for the corresponding duplicated block to set noise as the result of the confirmation in step S4, in other words, for example, in a case where the corresponding duplicated block is not the dropout layer (refer to NO route in step S4), the procedure returns to step S2.

Furthermore, in a case where there is no duplicated block as the result of the confirmation in step S3 (refer to NO route in step S3), the procedure proceeds to step S8.

In step S8, information configuring each distributed model is written (stored) in a predetermined storage region such as the storage device 13. Thereafter, the procedure ends.

Note that, in the flowchart described above, the processing in steps S5 and S6 may also be omitted. In other words, for example, without confirming the user whether or not to set the noise different between the plurality of distributed models, the parameters of the respective corresponding dropout layers may also be rewritten in step S7 so that dropout processes different from each other are executed.

Next, machine learning processing by the plurality of distributed models created by the computer system 1 will be described with reference to FIGS. 5 to 7.

In FIGS. 5 to 7, an example is illustrated in which model parallelization for dividing an original neural network into three distributed models and processing the three distributed models by processes #0 to #2 is realized.

Note that, FIG. 5 illustrates forward propagation processing, FIG. 6 illustrates backpropagation processing, and FIG. 7 illustrates weight updating processing. Furthermore, a direction from the top to the bottom in FIG. 5 indicates a forward propagation data flow.

In the forward propagation, as illustrated in FIG. 5, outputs of the respective processes executed by the model parallelization units of the distributed models respectively executed by the respective processes #0 to #2 are combined (refer to reference P1). Each combined output is input to each non-model parallelization unit of the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.

In the example illustrated in FIG. 5, each non-model parallelization unit of each distributed model includes three processing blocks (duplicated block) including dropout layers (refer to references P2 to P4).

Furthermore, the different parameters are set to these three dropout layers by the noise setting unit 101 described above, and as a result, dropout rates of the respective dropout layers are different.

Therefore, in the processing blocks on the downstream side of these dropout layers in the non-model parallelization unit of each distributed model, outputs different from each other are obtained.

Furthermore, outputs of the respective processing blocks at the final stage of the non-model parallelization unit of each distributed model are combined (refer to reference P5).

Each combined output is input to each subsequent model parallelization unit in the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.

A direction from the bottom to the top in FIG. 6 indicates a backpropagation data flow direction.

In the backpropagation, as illustrated in FIG. 6, outputs of the respective processes executed by the model parallelization units of the distributed models respectively executed by the respective processes #0 to #2 are combined (refer to reference P6). Each combined output is input to each non-model parallelization unit of the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.

In the non-model parallelization unit of each distributed model, in each processing block (duplicated block) other than the dropout layer (refer to references P7 to P9), for example, using the gradient descent method, a weight Aw is calculated in a direction for reducing a loss function that defines an error between an inference result of the machine learning model with respect to the training data and correct answer data.

The different parameters are set to the respective dropout layers included in the non-model parallelization unit of each distributed model by the noise setting unit 101 described above, and as a result, dropout rates of the respective dropout layers are different.

Therefore, in the processing blocks on the downstream side of these dropout layers in the non-model parallelization unit of each distributed model, outputs different from each other are obtained.

Outputs of the respective processing blocks at the final stages of the non-model parallelization units of the respective distributed models are combined (refer to reference P10). Each combined output is input to each subsequent model parallelization unit in the distributed model executed by each of the processes #0 to #2. The same data is input to each non-model parallelization unit of each distributed model.

In the weight update, as illustrated in FIG. 7, each weight Aw calculated through backpropagation by the non-model parallelization unit of the distributed model executed by each of the processes #0 to #2 is combined, and the weight of each processing block is updated using the combined weight (combined Aw). The combination of the weights Aw may also be, for example, calculation of an average value and can be appropriately changed and performed.

In this way, according to the computer system 1 as an example of the embodiment, the noise setting unit 101 sets various parameters configuring the dropout layer so as to execute different dropout processing for each dropout layer of the plurality of distributed models.

As a result, at the time of machine learning, each process for processing the distributed model executes different dropout processing in each dropout layer of the non-model parallelization unit (duplicated block).

Therefore, by generating the noise by a method different for each process in the non-model parallelization unit for executing processing duplicated between the processes, the calculation resource can be efficiently used.

Furthermore, by adding different noise to each distributed model, it is possible to improve robustness of the distributed model that is processed in parallel and to improve learning accuracy.

The processing blocks included in the non-model parallelization unit are originally and duplicately processed in parallel by the plurality of processes (distributed model). Therefore, in the computer system 1, almost no increase in a calculation time occurs to execute the dropout processing by each of the plurality of processes for performing model parallelization. The learning accuracy can be improved.

(D) OTHERS

Each configuration and each processing of the present embodiment may also be selected or omitted as needed or may also be appropriately combined.

Then, the disclosed technology is not limited to the embodiment described above, and various modifications may be made and implemented without departing from the spirit of the present embodiment.

For example, in the embodiment described above, the dropout layer is used as the duplicated block that can set noise. However, the embodiment is not limited to this and can be appropriately changed and executed.

Furthermore, the present embodiment may be implemented and manufactured by those skilled in the art according to the disclosure described above.

All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.

COMPUTER-READABLE RECORDING MEDIUM STORING MACHINE LEARNING PROGRAM, INFORMATION PROCESSING DEVICE, AND MACHINE LEARNING METHOD

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)