This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-081513, filed on May 13, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a machine learning program, a machine learning method, and an information processing device.
As a machine learning method in deep learning, distributed training by data parallelism is known. In distributed training, a plurality of processes (workers) having the same neural network (model) is provided, different training data portions are input to the plurality of processes, and machine learning is performed. Hereinafter, there is a case where machine learning is referred to as training or simply referred to as learning.
Here, in one process of machine learning, each processing of forward propagation (Fwd), backward propagation (Bwd), and update (Up.) is repeated. In distributed training using the plurality of processes, results of backward propagation in all the processes are aggregated before the update processing so as to acquire an average, and the update processing is executed by each process using this average value.
In backward propagation, weight gradient information indicating how much weight of a neural network is changed next to update the weight so as to reduce an error (Loss) can be obtained. Furthermore, in the update processing, values of various parameters are updated on the basis of the average of the weight gradients obtained by the respective processes.
The training results (weight gradient information) of the plurality of processes are aggregated through communication between the processes, and for example, the aggregation is achieved by Allreduce communication.
In
In distributed training by data parallelism, communication is performed between the processes when the training results of the respective processes are aggregated. However, if even one process delays, the entire processing time is extended for synchronization waiting by other processes. In
Therefore, a method has been known for preventing performance deterioration due to such a delayed process. There is one method for removing a process in which a delay occurs (process P1 in
U.S. Patent Application Publication No. 2016/0092765, Japanese Laid-open Patent Publication No. 2020-177343, and Koichi SHIRAHATA, et al., “Preliminary Performance Analysis of Distributed ANN Training with Relaxed Synchronization” [online]<URL:
According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing a machine learning program that causes at least one computer to execute a process, the process includes in a case where performances of one or more first computing nodes of a plurality of computing nodes are deteriorated in distributed training in machine learning by using the plurality of computing nodes, when a ratio of a process of the first computing node to whole process is equal to or less than a threshold, causing each second computing node other than the first computing node of the plurality of computing nodes perform machine learning in a first mode in which a learning result of the process of the first computing node is not reflected on the machine learning; and when the ratio is more than the threshold, causing each second computing node perform machine learning in a second mode in which training data to be processed by the process of the first computing node is distributed to and processed by the second computing nodes.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In the detach mode, there is a problem in that machine learning becomes insufficient and accuracy is deteriorated by excluding a process in which a delay occurs from a training result aggregation target. For example, there is a case where the accuracy does not reach desired certain accuracy (for example, 73.3%) by excluding the plurality of processes from an aggregation target through the detach mode, in a case where a processing speed of the plurality of processes is reduced.
In one aspect, an object of the embodiment is to improve accuracy of distributed training in machine learning.
According to one embodiment, accuracy of distributed training in machine learning can be improved.
Hereinafter, an embodiment of a machine learning program, a machine learning method, and an information processing device will be described with reference to the drawings. Note that the embodiment to be described below is merely examples, and there is no intention to exclude application of various modifications and technologies not explicitly described in the embodiment. In other words, for example, the present embodiment may be variously modified and implemented without departing from the scope of the gist thereof. Furthermore, each drawing is not intended to include only components illustrated in the drawings, and may include another function and the like.
These computing nodes 10-1 to 10-n are connected to be communicable with each other via a network 2. The computing nodes 10-1 to 10-n have configurations similar to each other. Hereinafter, in a case where the computing nodes 10-1 to 10-n are not particularly distinguished the computing nodes 10-1 to 10-n are referred to as a computing node 10.
This computer system 1 performs machine learning in deep learning, and achieves distributed training by data parallelism using the plurality of computing nodes 10. One or more processes (worker) having the same neural network (model) are provided in each computing node 10, and these computing nodes 10 respectively perform processes of machine learning in parallel. In this computer system 1, different training data portions are input to the plurality of processes in parallel, and distributed machine learning (training) is performed.
In the present embodiment, an example is indicated in which one process is assigned in each computing node 10.
Furthermore, in this computer system 1, the computing node 10-1 of the plurality of computing nodes 10 functions as a master (primary) and achieves a function as a distributed training management unit 100 to be described later with reference to
As illustrated in
The memory 12 is used as a main storage device of the computing node 10. The memory 12 temporarily stores at least a part of operating system (OS) programs and application programs executed by the CPU 11. Furthermore, the memory 12 stores various types of data needed for processing by the CPU 11. The application program may also include a machine learning program (not illustrated) executed by the CPU 11 so as to achieve a function as the distributed training management unit 100 according to the present embodiment by the computing node 10.
The HDD 13 magnetically reads and writes data from and into a built-in disk. The HDD 13 is used as an auxiliary storage device of the computing node 10. The HDD 13 stores OS programs, application programs, and various types of data. Note that, as the auxiliary storage device, a storage class memory (SCM) and a semiconductor storage device (solid state drive (SSD)) such as a flash memory can be used.
The accelerator 14 is a processing device that performs specific arithmetic processing and is, for example, a graphics processing unit (GPU). In the present embodiment, an example is indicated in which the CPU 11 executes the application program described above (machine learning program) so as to achieve the function as the distributed training management unit 100. However, the present embodiment is not limited to this. In other words, for example, the accelerator 14 may also execute the machine learning program so as to achieve the function as the distributed training management unit 100.
The MC 15 is a network interface. The MC 15 is connected to the network 2. The NIC 15 transmits and receives data to and from another computing node 10 and another communication device (not illustrated) via the network 2. The MC 15 may also be referred to as a network card 15.
With the computing node 10 having the hardware configuration as described above, the function (distributed training function) as the distributed training management unit 100 according to the present embodiment to be described later can be achieved.
The computing node 10 that functions as a master (for example, computing node 10-1) executes a program (machine learning program or the like) recorded, for example, in a computer-readable non-transitory recording medium so as to achieve a distributed training management function according to the present embodiment. A program in which processing content executed by a computer 10 is described can be recorded in various recording media. For example, the program executed by the computer 10 can be stored in the HDD 13. The CPU 11 loads at least a part of the program in the HDD 13 on the memory 12 and executes the loaded program.
Furthermore, the program executed by the computer 10 (CPU 11) can be recorded in a non-transitory portable recording medium such as an optical disk, a memory device, or a memory card. The program stored in the portable recording medium can be executed after being installed to the HDD 13, for example, under control by the CPU 11. Furthermore, the CPU 11 may also directly read and execute the program from the portable recording medium.
The CPU (processing unit) 11 controls the entire computing node 10. The CPU 11 may also be a multiprocessor. Instead of the CPU 11, any one of a micro processing unit (MPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field programmable gate array (FPGA) may also be used. Furthermore, the CPU 11 may also be a combination of two or more types of elements of the CPU, MPU, DSP, ASIC, PLD, and FPGA.
As illustrated in
The distributed training management unit 100 has functions as a performance information collection unit 101, a reduction process number determination unit 102, a mode determination unit 103, a processing time calculation unit 104, and an execution process number determination unit 105.
The performance information collection unit 101 collects performance information of each computing node 10 (CPU 11). The performance information collection unit 101 may also collect, for example, a throughput performance of each computing node 10 as the performance information. A throughput indicates a data processing amount per unit time and may also be, for example, the number of processed images per unit time. Furthermore, the collection of the performance information may be achieved by using various known methods, and description thereof will be omitted.
It is desirable that the performance information collection unit 101 collect the performance information of each computing node 10 at a predetermined timing after training start of this computer system 1.
The reduction process number determination unit 102 counts the number of processes in which a delay occurs, for example, in a case where processing of one or more computing nodes 10 (process) delays and other computing node 10 waits for synchronization. Hereinafter, there is a case where the computing node 10 in which a delay occurs is referred to as a delay computing node 10.
For example, in a case where the processing performance of the computing node 10 is equal to or less than a predetermined threshold or in a case where a response delays equal to or more than a predetermined threshold, the reduction process number determination unit 102 may also determine that the performance of the corresponding computing node 10 is deteriorated and may also recognize the corresponding computing node 10 as the delay computing node 10.
The reduction process number determination unit 102 counts the number of processes of one or more delay computing nodes 10. A process learning result corresponding to the delay computing node 10 is not reflected to the machine learning in the detach mode. Hereinafter, there is a case where a total number of processes corresponding to each delay computing node 10 is referred to as the number of reduced processes.
The mode determination unit 103 determines an optimization operation mode of machine learning at the time of the occurrence of the delay computing node 10 on the basis of the number of reduced processes calculated by the reduction process number determination unit 102.
In this computer system 1, as an optimization method of machine learning at the time of the occurrence of the delay computing node 10, detach mode (first mode) and shrink mode (second mode) are included, and one of the two types of operation modes is selectively used. The detach mode (first mode) and the shrink mode (second mode) are countermeasures (performance stabilization method) for stabilizing a calculation speed of the corresponding computer system 1 in a case where processing performances of the plurality of computing nodes 10 vary.
In the detach mode, a process of a delay computing node (first computing node) 10 is separated from a machine learning group, and a process learning result of the delay computing node 10 is excluded from a training result aggregation target.
In the shrink mode, execution of the process by the delay computing node 10 is prevented, and machine learning is performed by a process other than the corresponding process. Training data to be processed by the process of the delay computing node 10 (first computing node) is distributed to and executed by the computing node 10 (process, second computing node) other than the delay computing node 10 from among the plurality of computing nodes 10.
For example, in a case where distributed training is performed with 32-process execution at the time of start, a learning hyperparameter for 32 processes is set. However, in a case where two processes are deleted, 30-process execution is performed in the detach mode as remaining the setting at the time of 32 processes. On the other hand, in the shrink mode, 30-process execution is performed as changing the setting to setting for 30 processes.
In
In the example illustrated in
Here, as indicated by the reference A, a mini batch size per process is set to 100. At this time, because there are five processes in total, 500 pieces of data is used for machine learning for one update. Therefore, processing of one epoch (epoch) needs eight (=4000/500) iterations.
In the example indicated by the reference B in
In the example indicated by the reference C in
In such a shrink mode, 400 (=100×4) pieces of data is processed for one update. Because the total number of pieces of data is 4000, 10 (=4000/400) times of processing (update) is needed so as to perform machine learning by the four processes P0 to P3.
In the shrink mode, data to be processed by the process P4 is distributed to and processed by the remaining processes P0 to P3. Therefore, the number of iterations (time) needed to process one epoch increases.
In the example indicated by the reference C in
Note that, in the shrink mode, all the training data may also be distributed to the computing nodes other than the delay computing node 10, and in the shrink mode, after a certain percentage of the training data is deleted, the remaining training data may also be distributed to and learned by the computing nodes other than the delay computing node 10.
For example, in a case where there are 100 pieces of training data, all the 100 pieces of training data may also be distributed to the computing nodes other than the delay computing node 10. Alternatively, a certain percentage (for example, 6%) is excluded, and remaining 94 pieces of data may also be distributed to and learned by the computing nodes other than the delay computing node 19.
By deleting a certain percentage of the training data and distributing the remaining data, a repetition time in machine learning can be shortened as compared with a case where the data is not deleted.
The mode determination unit 103 determines whether distributed training is performed in the detach mode or the shrink mode on the basis of the number of reduced processes determined by the reduction process number determination unit 102. For example, the mode determination unit 103 compares a ratio of the number of reduced processes with respect to the total number of processes with a threshold (for example, 6%), and in a case where the ratio of the number of reduced processes is equal to or less than the threshold the mode determination unit 103 selects the detach mode. Furthermore, in a case where the ratio of the number of reduced processes is equal to or less than the threshold, the mode determination unit 103 selects the detach mode. Hereinafter, there is a case where the ratio of the number of reduced processes with respect to the total number of processes is referred to as a reduced process number ratio.
As described above, a data processing amount per process in the detach mode is different from that in the shrink mode.
In
In a case where the reduced process number ratio is equal to or less than a predetermined threshold, the mode determination unit 103 determines the detach mode, and in a case where the reduced process number ratio is more than the threshold, the mode determination unit 103 selects the shrink mode. Therefore, in a case where the reduced process number ratio is equal to or less than the threshold, the data processing amount per process is fixed (refer to reference D in
The processing time calculation unit 104 calculates a processing time in distributed training performed using the plurality of computing nodes 10 on the basis of the performance information of each computing node 10 collected by the performance information collection unit 101.
The processing time calculation unit 104 sorts the plurality of computing nodes 10 (process) according to the performance information (throughput) on the basis of the performance information of each computing node 10 collected by the performance information collection unit 101. Then, the processing time calculation unit 104 selects one or more computing nodes 10 in order from the computing node 10 having the lowest processing performance. This processing is synonymous with selecting the process of the selected computing node 10.
There is a case where the one or more selected computing nodes 10 are referred to as reduction computing nodes 10, and the process of the reduction computing node 10 is referred to as a reduction process.
Furthermore, there is a case where the plurality of computing nodes 10 other than the reduction computing node 10 among all the computing nodes 10 is referred to as a remaining computing node group, and the plurality of processes in the remaining computing node group is referred to as a remaining process group.
The processing time calculation unit 104 calculates (estimation, simulation) a processing time of a process (remaining process group) executed by the remaining computing node group. The processing time calculation unit 104 estimates each processing time of machine learning by the plurality of types of remaining computing node groups (node group) having different numbers of included computing nodes 10, on the basis of the throughput of each computing node 10.
The processing time calculation unit 104 adopts a throughput of the computing node 10 having the lowest processing performance (throughput) among the plurality of computing nodes 10 included in the remaining computing node group as a throughput of the remaining computing node group.
The processing time calculation unit 104 calculates (estimates) a processing time of the corresponding remaining computing node group (remaining process group) by dividing a data amount (processing data amount) processed by the remaining computing node group by the throughput of the remaining computing node group.
The processing time calculation unit 104 sequentially calculates the processing times by the remaining computing node group while incrementing the reduction computing node 10 one by one.
For example, in a case where 32 processes are used as a reference (at the time of execution start), the processing time calculation unit 104 first estimates a processing time in a case where the remaining process group includes 31 processes. Next, the processing time calculation unit 104 estimates a processing time in a case where the remaining process group includes 30 processes. Hereinafter, the processing time calculation unit 104 calculates each processing time of the remaining process group with each number of processes while reducing the number of processes of the remaining process group.
The processing time calculation unit 104 may also simulate processing times of the remaining process group in all patterns while reducing the processes of the remaining process group until the number of processes of the remaining process group becomes one. Furthermore, the processing time calculation unit 104 may also perform simulation of the processing time of the remaining process group, for example, until the number of processes of the remaining process group becomes a half of the reference. In other words, for example, in the example described above (reference 32 processes), the simulation may also be terminated at the time when the number of processes of the remaining process group reaches a half of the reference number of processes (16 processes), and the simulation may be performed while appropriately changing the number of processes.
In
In the following example, it is assumed that there be 1200 pieces of training data, Furthermore, it is assumed that the mode determination unit 103 determine that accuracy reaches target accuracy in the detach mode when the number of processes is equal to or less than four and determine that the accuracy reaches the target accuracy in the shrink mode when the number of processes is more than five.
The processing time calculation unit 104 calculates a processing time of the corresponding remaining computing node group (remaining process group) by dividing a data amount (processing data amount) processed by the remaining computing node group by the throughput of the remaining computing node group.
In the example illustrated in
Therefore, the processing time calculation unit 104 calculates a processing time of the remaining process group (refer to reference a) of which the number of computing nodes 10 (the number of processes) is five according to the following formula.
200/60=3.33
On the other hand, in the example illustrated in
Therefore, the processing time calculation unit 104 calculates a processing time of the remaining process group (refer to reference b) of which the number of computing nodes 10 (the number of processes) is four according to the following formula.
300/80=3.75
The execution process number determination unit 105 determines the number of processes to be processed in parallel in distributed training (the number of execution processes) on the basis of the processing time of the remaining process group with each number of processes calculated by the processing time calculation unit 104.
The execution process number determination unit 105 determines the number of processes of the remaining process group of which the processing time is the shortest among the processing times of the remaining process group with each number of processes calculated by the processing time calculation unit 104, as the number of execution processes.
For example, in the example illustrated in
A method for determining the number of computing nodes at the time of distributed training by the computer system 1 as an example of the embodiment configured as described above will be described with reference to the flowcharts in
These flowcharts illustrated in
In step A1 in
In step A2 in
In a case where the deterioration in the performance is not detected in any one of computing nodes 10 (refer to NO route in step A2), the processing ends. On the other hand, in a case where the deterioration is detected in any one of computing nodes 10 (refer to YES route in step A2), the procedure proceeds to step A3 in
In step A3, the processing time calculation unit 104 calculates a processing time in distributed training performed using the plurality of computing nodes 10 on the basis of the performance information of each computing node 10 collected by the performance information collection unit 101.
The details of the processing in step A3 will be described with reference to
In step B1 in
In step B2 in
As a result of the confirmation, in a case where the ratio of the number of reduced processes with respect to the total number of processes is equal to or less than the threshold (refer to YES route in step B2), the procedure proceeds to step B3 in
In step B3, the mode determination unit 103 selects the detach mode. In the detach mode, the data processing amount per process does not change. Thereafter, the procedure proceeds to step B5 in
Furthermore, as a result of the confirmation in step B2, in a case where the ratio of the number of reduced processes with respect to the total number of processes is not equal to or less than the threshold (refer to NO route in step B2), in other words, for example, in a case where the ratio of the number of reduced processes with respect to the total number of processes is more than the threshold, the procedure proceeds to step B4 in
In step B4, the mode determination unit 103 selects the shrink mode. In the shrink mode, the data processing amount per process changes. In other words, for example, the processes of the delay computing node 10 are distributed to the computing nodes 10 other than the delay computing node 10 of the plurality of computing nodes 10. Thereafter, the procedure proceeds to step B5.
In step B5, the processing time calculation unit 104 calculates a processing time of the corresponding remaining computing node group (remaining process group) by dividing a data amount (processing data amount) processed by the remaining computing node group by the throughput of the remaining computing node group. Thereafter, the procedure proceeds to step A4 in
In step A4 in
In this way, according to the computer system 1 as an example of the embodiment, the processing time calculation unit 104 sequentially calculates the processing times by the remaining computing node group while incrementing the reduction computing node 10 one by one. Then, the execution process number determination unit 105 determines the number of processes of the remaining process group of which the processing time is the shortest among the processing times of the remaining process group with each number of processes calculated by the processing time calculation unit 104, as the number of execution processes. This can shorten a time required before training in distributed training is completed, and in particular, for example, even in a case where the processing performances of the computing nodes 10 vary, efficient distributed training can be achieved.
Furthermore, the mode determination unit 103 determines an optimization operation mode of machine learning at the time of the occurrence of the delay computing node 10 on the basis of the number of reduced processes calculated by the reduction process number determination unit 102.
Specifically, for example, the mode determination unit 103 adopts the detach mode in which the processing amount of each computing node 10 of the remaining process group does not increase in a case where the ratio of the reduction process is equal to or less than the threshold. As a result, an effect on the deterioration in the accuracy caused by not reflecting the process on machine learning in the detach mode can be suppressed, the deterioration in the accuracy is prevented, and it is possible to secure desired accuracy. Furthermore, even in a case where some computing nodes 10 delay, the time required before machine learning is completed can be shortened.
Furthermore, in a case where the ratio of the reduction process is more than the threshold, by adopting the shrink mode in which a processing amount increases for each computing node 10 of the remaining process group, it is possible to prevent the deterioration in the accuracy and to secure the desired accuracy.
Each configuration and each processing of the present embodiment may also be selected or omitted as needed or may be appropriately combined.
Then, the disclosed technology is not limited to the above-described embodiment, and various modifications may be made and implemented without departing from the spirit of the present embodiment.
For example, in the embodiment described above, an example is indicated in which the computing node 10-1 of the plurality of computing nodes 10 included in the computer system 1 functions as a master (primary) and achieves the function as the distributed training management unit 100. However, the embodiment is not limited to this. A management device different from the computing node 10 may be provided in the computer system 1, and the corresponding management device may also achieve the function as the distributed training management unit 100.
Furthermore, in the embodiment described above, an example is indicated in which one of the detach mode and the shrink mode is selected as the performance stabilization method in a case where the processing performances of the plurality of computing nodes 10 vary. However, the embodiment is not limited to this. A method other than the detach mode and the shrink mode may also be used as the stabilization method.
Furthermore, the present embodiment may be implemented and manufactured by those skilled in the art according to the disclosure described above.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-081513 | May 2021 | JP | national |