The present invention relates to a distributed deep learning system and a distributed deep learning method, and particularly relates to a distributed deep learning technique that is executed, in distributed coordination, by a plurality of calculation nodes that cooperate with one another in a network.
In recent years, machine learning is utilized with respect to various types of information and data, and accordingly, the development of services and the provision of added values are actively underway. Machine learning at that time often requires a large amount of calculation resources. In particular, in machine learning that uses a neural network called deep learning, it is necessary to process a large amount of learning data in learning, which is a process for optimizing configuration parameters of the neural network. In order to increase the speed of this learning processing, one solution is to perform parallel processing on a plurality of computation apparatuses.
For example, NPL 1 discloses a distributed deep learning system in which four calculation nodes and an InfiniBand switch are connected via an InfiniBand network. Four GPUs (Graphics Processing Units) are installed in each calculation node. In the distributed deep learning system disclosed in NPL 1, an attempt to increase the speed is made by performing parallel processing with respect to learning computation with use of the four calculation nodes.
Also, NPL 2 discloses a configuration in which a calculation node (GPU server) in which eight GPUs are installed and an Ethernet® switch are connected via an Ethernet network. This NPL 2 discloses examples in which 1, 2, 4, 8, 16, 32, and 44 calculation nodes are used, respectively.
In a system disclosed in NPL 2, machine learning is performed using distributed synchronous SGD (Stochastic Gradient Descent). Specifically, machine learning is performed in the following procedure.
(1) Extract a part of learning data. A collection of the extracted learning data pieces is called a minibatch.
(2) The minibatch is divided so that the divided minibatches correspond in number to the GPUs, and the divided minibatches are allocated to respective GPUs.
(3) Each GPU obtains a loss function L(w), which serves as an index indicating a degree at which the values output from a neural network when the learning data allocated in (2) has been input deviate from the truth (referred to as “supervisory data”). In a process for obtaining this loss function, the output values are calculated in order from a layer on the input side toward a layer on the output side of the neural network; thus, this process is called forward propagation.
(4) Each GPU obtains partial differential values (gradients) under respective configuration parameters of the neural network (e.g., weights of the neural network) for the loss function value obtained in (3). In this process, the gradients under configuration parameters of each layer are calculated in order from a layer on the output side toward a layer on the input side of the neural network; thus, this process is called backpropagation.
(5) An average of the gradients that were respectively calculated by the GPUs is calculated.
(6) Using SGD (Stochastic Gradient Descent), each GPU updates each configuration parameter of the neural network so as to further reduce the loss function L(w) with use of the average value of the gradients calculated in (5). SGD is calculation processing for reducing the loss function L(w) by changing the value of each configuration parameter by a small amount in the gradient direction. By repeating this processing, the neural network is updated to a highly accurate neural network that has a small loss function L(w), that is to say, yields an output that is close to the truth.
Furthermore, NPL 3 discloses a distributed deep learning system configured in such a manner that 128 calculation nodes in which 8 GPUs are installed are connected via an InfiniBand network.
In any of the conventional distributed deep learning systems disclosed in NPL 1 to NPL 3, it is apparent that the speed of learning is increased and a learning period can be reduced as the number of calculation nodes increases. In this case, in order to calculate an average value of the configuration parameters of the neural network, such as the gradients calculated by respective calculation nodes, it is necessary to calculate, for example, the average value by exchanging these configuration parameters among the calculation nodes.
On the other hand, if the number of nodes is increased in order to increase the number of sets of parallel processing, necessary communication processing will immediately increase. When computation processing, such as calculation of an average value, and data exchange processing are performed on a calculation node with use of software as in the conventional techniques, there arises a problem that it is difficult to sufficiently increase the learning efficiency due to a large overhead associated with communication processing.
For example, NPL 3 discloses the relationship among a period required to perform 100 cycles of learning processing, a period required for communication among the aforementioned period, and the number of GPUs. According to this relationship, a period required for communication increases as the number of GPUs increases, and in particular, the period increases rapidly when the number of GPUs hits 512 or more.
However, in the conventional distributed deep learning systems, if the number of calculation nodes connected to a communication network increases, there arises a problem that the increase in the speed of coordinated processing among the calculation nodes is suppressed.
Embodiments of the present invention have been made to solve the aforementioned problem, and it is an object thereof to perform coordinated processing among calculation nodes at high speed even if the number of calculation nodes connected to a communication network increases.
In order to solve the aforementioned problem, a distributed deep learning system according to embodiments of the present invention includes a plurality of calculation nodes that are connected to one another via a communication network, wherein each of the plurality of calculation nodes includes a computation apparatus that calculates a matrix product included in computation processing of a neural network, and outputs a first computation result, a first storage apparatus that stores the first computation result output from the computation apparatus, and a network processing apparatus including a first transmission circuit that transmits the first computation result stored in the first storage apparatus to another calculation node, a first reception circuit that receives a first computation result from another calculation node, an addition circuit that obtains a second computation result, the second computation result being a sum of the first computation result stored in the first storage apparatus and the first computation result from the another calculation node received by the first reception circuit, a second transmission circuit that transmits the second computation result to another calculation node, and a second reception circuit that receives a second computation result from another calculation node.
In order to solve the aforementioned problem, a distributed deep learning system according to embodiments of the present invention includes: a plurality of calculation nodes that are connected to one another via a communication network; and an aggregation node, wherein each of the plurality of calculation nodes includes a computation apparatus that calculates a matrix product included in computation processing of a neural network, and outputs a first computation result, a first network processing apparatus including a first transmission circuit that transmits the first computation result output from the computation apparatus to the aggregation node, and a first reception circuit that receives a second computation result from the aggregation node, the second computation result being a sum of first computation results calculated by the plurality of calculation nodes, and a first storage apparatus that stores the second computation result received by the first reception circuit, the aggregation node includes a second network processing apparatus including a second reception circuit that receives the first computation results from the plurality of calculation nodes, an addition circuit that obtains the second computation result which is the sum of the first computation results received by the second reception circuit, and a second transmission circuit that transmits the second computation result obtained by the addition circuit to the plurality of calculation nodes, and a second storage apparatus that stores the first computation results from the plurality of calculation nodes received by the second reception circuit, and the addition circuit reads out the first computation results from the plurality of calculation nodes stored in the second storage apparatus, and obtains the second computation result.
In order to solve the aforementioned problem, a distributed deep learning method according to embodiments of the present invention is a distributed deep learning method executed by a distributed deep learning system including a plurality of calculation nodes that are connected to one another via a communication network, wherein each of the plurality of calculation nodes performs a computation step of calculating a matrix product included in computation processing of a neural network, and outputting a first computation result, a first storage step of storing the first computation result output in the computation step to a first storage apparatus, and a network processing step including a first transmission step of transmitting the first computation result stored in the first storage apparatus to another calculation node, a first reception step of receiving a first computation result from another calculation node, an addition step of obtaining a second computation result, the second computation result being a sum of the first computation result stored in the first storage apparatus and the first computation result from the another calculation node received in the first reception step, a second transmission step of transmitting the second computation result to another calculation node, and a second reception step of receiving a second computation result from another calculation node.
In order to solve the aforementioned problem, a distributed deep learning method according to embodiments of the present invention is a distributed deep learning method executed by a distributed deep learning system including a plurality of calculation nodes that are connected to one another via a communication network, and an aggregation node, wherein each of the plurality of calculation nodes performs a computation step of calculating a matrix product included in computation processing of a neural network, and outputting a first computation result, a first network processing step including a first transmission step of transmitting the first computation result output in the computation step to the aggregation node, and a first reception step of receiving a second computation result from the aggregation node, the second computation result being a sum of first computation results calculated by the plurality of calculation nodes, and a first storage step of storing the second computation result received in the first reception step to a first storage apparatus, the aggregation node performs a second network processing step including a second reception step of receiving the first computation results from the plurality of calculation nodes, an addition step of obtaining the second computation result which is the sum of the first computation results received in the second reception step, and a second transmission step of transmitting the second computation result obtained in the addition step to the plurality of calculation nodes, and a second storage step of storing, to a second storage apparatus, the first computation results from the plurality of calculation nodes received in the second reception step, and in the addition step, the first computation results from the plurality of calculation nodes stored in the second storage apparatus are read out, and the second computation result is obtained.
According to embodiments of the present invention, each of a plurality of calculation nodes that are connected to one another via a communication network includes a network processing apparatus including an addition circuit that obtains a second computation result, which is a sum of a first computation result that has been stored in a first storage apparatus and output from a computation apparatus, and a first computation result from another calculation node received by a first reception circuit. Therefore, even if the number of calculation nodes connected to the communication network increases, coordinated processing among the calculation nodes can be performed at higher speed.
The following describes preferred embodiments of the present invention in detail with reference to
First, an overview of a distributed deep learning system according to the embodiments of the present invention will be described with reference to
One of the characteristics of the distributed deep learning system according to the present embodiments is that each of the plurality of calculation nodes 1-1 to 1-3 includes, in a network processing apparatus that exchanges data, an addition circuit that obtains a sum of the result of calculation in the self-node and the result of calculation from another calculation node 1.
Note that in the following description, the calculation nodes 1-1 to 1-3 may be collectively referred to as calculation nodes 1. Also, although each of the drawings including
In the distributed deep learning system of embodiments of the present invention, training for learning the values of the weights of the neural network with use of learning data in deep learning is performed throughout the entire distributed deep learning system. Specifically, each calculation node 1, which is a learning node, performs predetermined computation processing of the neural network with use of learning data and the neural network, and calculates the gradient of weight data. At the time of completion of this predetermined computation, the plurality of different calculation nodes 1 have different gradients of weight data.
For example, a network processing apparatus, which is realized also by, for example, a computing interconnect apparatus connected to the communication network, aggregates the gradients of weight data, performs processing for averaging the aggregated data, and distributes the result thereof to each calculation node 1 again. Using the average gradient of weight data, each calculation node 1 performs predetermined computation processing of the neural network again with use of learning data and the neural network. By repeating this processing, the distributed deep learning system obtains a learned neural network model.
The calculation nodes 1 have a learning function of calculating the output values of the neural network, which is a mathematical model constructed in the form of software, and further improving the accuracy of the output values by updating configuration parameters of the neural network in accordance with learning data.
The neural network is constructed inside each calculation node 1. As a method of realizing the calculation nodes 1, the calculation nodes 1 may be realized using software on a CPU or a GPU, or may be realized using an LSI (Large Scale Integration) circuit formed as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit). Note that a specific example of a hardware configuration of the calculation nodes 1 will be described later.
As shown in
When the model parallel method is used in which the model of the neural network is divided among the plurality of calculation nodes 1 as stated earlier, the outputs of the hidden layer h2 are calculated by both of the calculation node 1-1 and the calculation node 1-2, specifically, as shown in
In embodiments of the present specification, the result of calculation of a part of matrix products included in the computation processing of the neural network, which was calculated by each calculation node 1, is referred to as a “partial computation result” (first computation result), and a sum of the partial computation results is referred to as a “total computation result” (second computation result).
Similarly, the outputs of the hidden layer h4 are calculated by both of the calculation node 1-2 and the calculation node 1-3. Also, with regard to the outputs of the hidden layers h1, h3, and h5, the computation is completed without being shared among a plurality of calculation nodes 1.
Next, a description is given of a distributed deep learning system according to a first embodiment of the present invention.
As shown in
As shown in
The computation unit 10 calculates a part of matrix products of the neural network, and outputs a partial computation result. As described using
The storage unit 11 includes a region that holds a partial computation result (first storage apparatus) 110 and a total computation result (second storage apparatus) ill. Also, the storage unit 11 holds partial weight parameters w included among the weight parameters w of the neural network.
The partial computation result 110 stores the partial computation result output from the computation unit 10.
The total computation result 111 stores a total computation result obtained by the self-node, and a total computation result received from another calculation node 1.
The network processing unit 12 includes a reception unit (first reception circuit and second reception circuit) 120, an addition unit (addition circuit) 121, and a transmission unit (first transmission circuit and second transmission circuit) 122.
The reception unit 120 receives a partial computation result from another calculation node 1 via the communication network. Also, the reception unit 120 receives a total computation result from another calculation node 1.
The addition unit 121 obtains a total computation result by adding the partial computation result from another calculation node 1, which was received by the reception unit 120, and the partial computation result calculated by the self-node. The addition unit 121 can be configured using, for example, an addition circuit that uses a logic circuit. The total computation result obtained by the addition unit 121 is stored to the storage unit 11.
The transmission unit 122 transmits the partial computation result stored in the storage unit 11, which was calculated by the computation unit 10 of the self-node, to another calculation node 1 via the communication network. Also, the transmission unit 122 distributes the total computation result obtained by the addition unit 121 to another calculation node 1 via the communication network.
Note that each of the plurality of calculation nodes 1-1 to 1-3 has a similar functional configuration.
A description is now given of the configuration of the calculation nodes 1 included in the distributed deep learning system according to the present embodiment and the configuration of a calculation node 100 included in a distributed deep learning system of a conventional example, which is shown in
As shown in
In the calculation node 100 of the conventional example, a partial computation result received from another calculation node 100 is stored to an another-node partial computation result 1112 in the storage unit 1100. In order to obtain a total computation result, the addition unit 1221 included in the computation unit 1000 makes a memory access with respect to a memory that composes the storage unit 1100, which creates an additional memory access period. Therefore, the entire processing period also becomes long compared to the configuration of the present embodiment.
In contrast, in the calculation nodes 1 according to the present embodiment, the sum of the partial computation result received from another calculation node 1 and the partial computation result calculated by the self-node is calculated by the addition unit 121 included in the network processing unit 12, and thus the additional memory access period, which is created on the calculation node 100 of the conventional example, is not created.
Next, one example of a hardware configuration that realizes the calculation nodes 1 provided with the aforementioned functions will be described with reference to a block diagram of
As shown in
A program that is intended for the CPU 101 and the GPU 103 to perform various types of control and computation is stored in the main memory 102 in advance. The CPU 101, the GPU 103, and the main memory 102 realize respective functions of the calculation nodes 1, such as the computation unit 10 and the addition unit 121 shown in
The NIC 104 is an interface circuit for network connection among the calculation nodes 1, and with various types of external electronic devices. The NIC 104 realizes the reception unit 120 and the transmission unit 122 of
The storage 105 includes a readable and writable storage medium, and a driving apparatus for reading and writing various types of information, such as programs and data, from and to this storage medium. For the storage 105, a semiconductor memory, such as a hard disk and a flash memory, can be used as the storage medium. The storage 105 realizes the storage unit 11 described using
The storage 105 includes a program storage region for storing a program that is intended for the calculation node 1 to execute distributed deep learning processing, such as computation of the neural network including matrix products. The storage 105 may include, for example, a backup region and the like for backing up the aforementioned data, programs, and so forth.
The I/O 106 includes a network port to which signals from external devices are input and which outputs signals to external devices. As the network port, for example, two or more network ports can be used.
The addition circuit 107 can use, for example, an addition circuit including a basic logic gate and the like. The addition circuit 107 realizes the addition unit 121 described using
For example, a broadband network, such as 100 Gbit Ethernet, is used as the communication network NW according to the present embodiment.
First, the operations of each calculation node 1 configured in the aforementioned manner will be described using a flowchart of
First, the computation unit 10 calculates a part of matrix products in learning of the neural network (step S1).
Next, once a partial computation result obtained by the computation unit 10 has been stored to the storage unit 11 (step S2: YES), the network processing unit 12 starts group communication (step S3). On the other hand, when the partial computation result calculated by the self-node has not been obtained (step S2: NO), computation in step S1 is executed (step S1).
For example, assume a case where the distributed deep learning system is a synchronous system. In the synchronous system, at the timing of completion of the calculation of parts of matrix products in all of the calculation nodes 1-1 to 1-3, the obtained partial computation results are shared via group communication. Therefore, the calculation nodes 1-1 to 1-3 hold the partial computation result calculated by the self-node in the storage unit 11 until a predetermined timing arrives.
Note that also in the case of the synchronous system, it is not necessarily required to wait for the completion of calculation by the computation units 10 of all calculation nodes 1-1 to 1-3; for example, the timing of completion of calculation by a part of the calculation nodes 1 of the distributed deep learning system may be used.
For example, as the hidden layer h2 can be obtained at the time of completion of calculations by the calculation node 1-1 and the calculation node 1-2, group communication may be started without waiting for the completion of calculation by the calculation node 1-3.
On the other hand, when the distributed deep learning system adopts an asynchronous system in which group communication is started without waiting for the completion of computation by another calculation node 1, group communication with a predetermined calculation node 1 is started at the time of completion of calculation of a partial computation result by each of the calculation nodes 1-1 to 1-3. In this case, in the calculation node 1 that has received data of partial computation results, the received partial computation results are temporarily accumulated in the storage unit 11 until the calculation of partial computation is completed in the self-node.
Once the network processing unit 12 has started group communication in step S3, the transmission unit 122 transmits the partial computation result calculated by the self-node to another calculation node 1 via the communication network. Also, the reception unit 120 receives a partial computation result calculated by another calculation node 1. At this time, as shown in
Next, the addition unit 121 obtains a total computation result, which is a sum of the partial computation result obtained by the self-node and the partial computation result received from another calculation node 1 (step S4).
Next, the network processing unit 12 distributes the total computation result obtained in step S4 to another calculation node 1 (step S5). Specifically, the transmission unit 122 transmits the total computation result obtained by the addition unit 121 to another calculation node 1 via the communication network. Thereafter, the total computation result, which is the sum of the partial computation results calculated in each of the plurality of calculation nodes 1-1 to 1-3, is stored to the storage unit 11.
Next, the operations of the distributed deep learning system will be described with reference to a sequence diagram of
As described using
Similarly, as described using
As shown in
Next, in the calculation node 1-1, the addition unit 121 of the network processing unit 12 obtains a total computation result by adding the partial computation result obtained by the self-node and the partial computation result transmitted from the calculation node 1-2 (step S102). As a result, the total computation result indicating the outputs of the hidden layer h2 is obtained.
Thereafter, the transmission unit 122 of the calculation node 1-1 distributes the outputs of the hidden layer h2 to other calculation nodes 1-2 and 1-3 (step S103).
On the other hand, the computation unit 10 of the calculation node 1-3 obtains a partial computation result by calculating [x3*w34+x4*w44+x5*w54+x6*w64], and transmits the partial computation result to the calculation node 1-2 (step S104). Next, the addition unit 121 of the calculation node 1-2 obtains a total computation result by adding the partial computation result representing the calculation [x1*w14+x2*w24] associated with h4, which was obtained in step S101, and the partial computation result received from the calculation node 1-3 (step S105). The total computation result obtained in step S105 indicates the outputs of the hidden layer h4.
Thereafter, the calculation node 1-2 distributes the total computation result obtained in step S105 to other calculation nodes 1-1 and 1-3 (step S106).
Through the aforementioned steps, the outputs of the hidden layers h2 and h4 are obtained as the sums of partial computation results, and this obtainment is shared among the plurality of calculation nodes 1-1 to 1-3.
On the other hand, as shown in
Here, as shown in
For example, assume a case where respective calculation nodes 1-1 to 1-3 are connected via a ring communication network with use of 100 Gbit Ethernet as stated earlier. In this case, the maximum communication speed is 100 Gbps when only one-way communication is used, whereas the maximum communication speed is 100 Gbps*2=200 Gbps when a bidirectional communication band is used.
Also, in the present embodiment, using communication packets, the transmission unit 122 transmits a partial computation result calculated by the self-node to another calculation node 1, and the reception unit 120 can receive a partial computation result from another calculation node 1. In this case, a communication packet includes an identifier for determining whether the partial computation result is addressed to the self-node.
For example, whether data is addressed to the self-node can be distinguished based on whether a flag is set in a bit location that varies with each of the calculation nodes 1-1 to 1-3 in a header of a communication packet. When a flag is set in a bit location for the self-node in a header of a communication packet received by the reception unit 120, it is determined that a partial computation result included in the received communication packet is data addressed to the self-node. Then, a total computation result, which is the sum of the partial computation result calculated by the self-node and the received partial computation result from another calculation node 1, is obtained.
Furthermore, when the execution of processing is shared among the plurality of calculation nodes 1-1 to 1-3, it is also possible to define the master-subordinate relationship among the calculation nodes 1-1 to 1-3. For example, it is possible to adopt a configuration in which the calculation node 1-1, which calculates partial computation with use of a weight parameter w1n is used as a master calculation node, and other calculation nodes 1-2 and 1-3 transmit a partial computation result to the master calculation node 1-1.
As described above, according to the first embodiment, each of the plurality of calculation nodes 1-1 to 1-3 includes the network processing unit 12 that includes the transmission unit 122, the reception unit 120, and the addition unit 121. Here, this transmission unit 122 transmits a partial computation result obtained by the self-node to another calculation node 1. Also, this reception unit 120 receives a partial computation result from another calculation node 1. Furthermore, this addition unit 121 performs total computation to obtain a sum of the partial computation result from another calculation node 1, which was received by the reception unit 120, and the partial computation result from the self-node.
Therefore, the computation unit 10 no longer needs to perform computation of addition, and reading and writing of a memory associated therewith can be reduced; as a result, even if the number of calculation nodes 1 connected to the communication network increases, coordinated processing among the calculation nodes 1 can be performed at higher speed.
Next, a description is given of a second embodiment of the present invention. Note that in the following description, the same reference signs are given to the constituents that are the same as those of the first embodiment described above, and a description thereof is omitted.
The first embodiment has been described in connection with a case where each of the plurality of calculation nodes 1-1 to 1-3 includes the network processing unit 12 that includes the addition unit 121, and the network processing unit 12 performs processing for adding a partial computation result obtained by the self-node and a partial computation result received from another calculation node 1. In contrast, in the second embodiment, a distributed deep learning system includes an aggregation node 2 that aggregates partial computation results that were respectively obtained by a plurality of calculation nodes 1-1 to 1-3, and performs addition processing. The following description will be provided with a focus on the constituents that differ from the first embodiment.
As shown in
As shown in block diagrams of
The computation unit 10 calculates a part of matrix products for learning of the neural network, and outputs a partial computation result.
The storage unit 11 stores the partial computation result 110 of the self-node, which was obtained by the computation unit 10, and a total computation result 111.
The network processing unit 12A includes a reception unit (first reception circuit) 120 and a transmission unit (first transmission circuit) 122.
The reception unit 120 receives a total computation result, which is a sum of partial computation results calculated by a plurality of calculation nodes 1, from the later-described aggregation node 2.
The transmission unit 122 transmits the partial computation result obtained by the self-node to the aggregation node 2 via the communication network.
As shown in
The storage unit 21 stores the partial computation results 210 that were respectively obtained by the calculation nodes 1-1 to 1-3.
The network processing unit 22 includes a reception unit (second reception circuit) 220, an addition unit (addition circuit) 221, and a transmission unit (second transmission circuit) 222.
The reception unit 220 receives the partial computation results respectively from the plurality of calculation nodes 1-1 to 1-3. The received partial computation results are stored to the storage unit 21.
The addition unit 221 obtains a total computation result, which is a sum of predetermined partial computation results included among the partial computation results from the plurality of calculation nodes 1-1 to 1-3 received by the reception unit 220. The addition unit 221 can be configured using, for example, an addition circuit that uses a logic circuit.
For example, using the specific example that has been described based on
The transmission unit 222 distributes the total computation result obtained by the addition unit 221 to the plurality of calculation nodes 1-1 to 1-3.
Next, one example of a hardware configuration that realizes the aggregation node 2 provided with the aforementioned functions will be described with reference to a block diagram of
As shown in
A program that is intended for the CPU 201 and the GPU 203 to perform various types of control and computation is stored in the main memory 202 in advance. The CPU 201, the GPU 203, and the main memory 202 realize respective functions of the aggregation node 2, such as the addition unit 221 shown in
The NIC 204 is an interface circuit for network connection with the calculation nodes 1-1 to 1-3 and various types of external electronic devices. The NIC 204 realizes the reception unit 220 and the transmission unit 222 of
The storage 205 includes a readable and writable storage medium, and a driving apparatus for reading and writing various types of information, such as programs and data, from and to this storage medium. For the storage 205, a semiconductor memory, such as a hard disk and a flash memory, can be used as the storage medium. The storage 205 realizes the storage unit 21 described using
The storage 205 includes a program storage region for storing a program that is intended for the aggregation node 2 to execute aggregation processing, total computation processing, and distribution processing with respect to the partial computation results from the calculation nodes 1-1 to 1-3. The storage 205 may include, for example, a backup region and the like for backing up the aforementioned data, programs, and so forth.
The I/O 206 includes a network port to which signals from external devices are input and which outputs signals to external devices. For example, network ports that correspond in number to the calculation nodes 1-1 to 1-3 can be provided. Alternatively, one network port can be provided in a case where the aggregation node 2 and the calculation nodes 1-1 to 1-3 are connected via a network switch.
The addition circuit 207 can use, for example, an addition circuit including a basic logic gate and the like. The addition circuit 207 realizes the addition unit 221 described using
Next, the operations of the calculation nodes 1 configured in the aforementioned manner will be described using a flowchart of
First, the operations of each calculation node 1 configured in the aforementioned manner will be described using the flowchart of
First, the computation unit 10 calculates a part of matrix products in learning of the neural network (step S1).
Next, once a partial computation result obtained by the computation unit 10 has been stored to the storage unit 11 (step S2: YES), the transmission unit 122 of the network processing unit 12A transmits a partial computation result obtained by the self-node to the aggregation node 2 (step S13). On the other hand, when the partial computation result calculated by the self-node has not been obtained (step S2: NO), computation in step S1 is executed (step S1).
Thereafter, the reception unit 120 of the network processing unit 12A receives a total computation result from the aggregation node 2 (step S14). Thereafter, the received total computation result is stored to the storage unit 11. Note that the plurality of calculation nodes 1-1 to 1-3 operate in a similar manner.
Next, the operations of the aggregation node 2 configured in the aforementioned manner will be described using a flowchart of
First, the reception unit 220 receives partial computation results obtained by the plurality of calculation nodes 1-1 to 1-3 (step S20).
Next, the network processing unit 22 determines whether to hold the received partial computation results in the storage unit 21 (step S21). The determination processing of step S21 is performed when, for example, the distributed deep learning system adopts an asynchronous system in which the transmission of partial computation results to the aggregation node 2 is started as soon as partial computation in each of the plurality of calculation nodes 1-1 to 1-3 is completed.
For example, when only the partial computation result calculated by the calculation node 1-1 has been received (step S21: YES), the network processing unit 22 causes the storage unit 21 to store the partial computation result from the calculation node 1-1 (step S22). In this case, the aggregation node 2 temporarily accumulates the partial computation results that have already been received by the storage unit 21 until the completion of reception of all partial computation results that are necessary to perform group communication.
Thereafter, for example, when the partial computation result calculated by the calculation node 1-2 has been received, the network processing unit 22 determines that the partial computation result of the calculation node 1-2 is not to be stored in the storage unit 21 (step S21: NO), and transmits this partial computation result to the addition unit 221 (step S23).
The addition unit 221 reads out the partial computation result of the calculation node 1-1 stored in the storage unit 21, and obtains a total computation result, which is a sum of this partial computation result and the partial computation result from the calculation node 1-2 (step S24). Thereafter, the transmission unit 222 distributes the total computation result obtained by the addition unit 221 to the plurality of calculation nodes 1-1 to 1-3 via the communication network.
Next, the operations of the distributed deep learning system, which includes the aggregation node 2 and the calculation nodes 1-1 to 1-3 configured in the aforementioned manner, will be described with reference to a sequence diagram of
As shown in
Next, once the aggregation node 2 has received the partial computation results from the calculation nodes 1-1 and 1-2, the addition unit 221 obtains a total computation result, which is a sum of these partial computation results (step S202).
Thereafter, the aggregation node 2 distributes the total computation result, which indicates the outputs of the hidden layer h2, from the transmission unit 222 by transmitting the same to the calculation nodes 1-1 to 1-3 (step S203).
Note that the distributed deep learning system is not limited to adopting the aforementioned asynchronous system, and can also adopt a synchronous system. In the case of the synchronous system, the plurality of calculation nodes 1-1 to 1-3 start transmitting the partial computation results to the aggregation node 2 at the timing of completion of partial computation in all of the plurality of calculation nodes 1-1 to 1-3. In this case, the processing for determining whether to store in the storage unit 21, which is performed in step S21 of
Furthermore, also in the case where the synchronous system is adopted, for example, as the outputs of the hidden layer h2 can be obtained at the time of completion of calculations by the calculation node 1-1 and the calculation node 1-2, group communication can also be started through the aggregation of partial computation results in the aggregation node 2 without waiting for the completion of calculation by the calculation node 1-3.
As described above, according to the second embodiment, the aggregation node 2 receives partial computation results that were respectively obtained by the plurality of calculation nodes 1-1 to 1-3, and obtains a total computation result by adding these partial computation results. Also, the aggregation node 2 distributes the obtained total computation result to the plurality of calculation nodes 1-1 to 1-3 via the communication network. In the aggregation node 2, it is sufficient to perform only addition processing, and thus the computation unit 10 is unnecessary. Therefore, according to the second embodiment, coordinated processing among calculation nodes can be performed at higher speed even if the number of calculation nodes connected to the communication network increases, compared to the conventional example in which the computation unit 10 performs addition processing in the form of software.
Note that the described embodiment has exemplarily presented a case where learning is performed in the entire neural network by the plurality of calculation nodes 1-1 to 1-3 performing distributed learning with the division of the neural network model, thereby increasing the speed of group communication. However, the distributed deep learning system according to the present embodiment can increase the speed of processing not only by application to learning processing, but also by application to large-scale matrix calculation including multiply-accumulate operations for matrixes, such as inference processing.
Although the above has described embodiments of the distributed deep learning system and the distributed deep learning method of the present invention, the present invention is not limited to the described embodiments, and various types of modifications that can be envisioned by a person skilled in the art within the scope of the invention set forth in the claims can be made to the present invention.
This patent application is a national phase filing under section 371 of PCI application no. PCT/JP2019/044672, filed on Nov. 14, 2019, which application is hereby incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/044672 | 11/14/2019 | WO |