This disclosure relates generally to machine learning and, more particularly, to a distributed architecture for training machine learning models.
Modern deep learning architectures trained on large-scale datasets can obtain impressive performance across a wide variety of domains, including speech and image recognition, image segmention, image/video understanding and analysis, natural language processing, and various applications such as fraud detection, medical systems, and recommendation systems. However, training these machine learning models is computationally demanding. The training can take an impractically long time on a single machine.
Therefore, the task of training a machine learning model may be assigned to be performed by a distributed system that includes multiple machines. However, this introduces its own problems. Training involves a large amount of data. The training set typically contains a large number of training samples, each of which can be quite large such as an image, video, text, or audio. The machine learning model itself can also be quite large, with a large number of layers and a large number of parameters (e.g., weights, biases, and so on) to be trained. Current approaches to training typically assign a single machine (a parameter server) to keep the master version of the parameters of the machine learning modelmodel and to synchronize the parameters and update them for the entire training task. As a result, a large volume of data is communicated between the parameter server and the other machines and the required communication bandwidth can be very significant when training large-scale models on a large-scale distributed system.
If it is desired to efficiently and effectively train multiple machine learning models or to train one model on multiple machines in a large-scale distributed system simultaneously, then the required communication bandwidth increases even more and the parameter server quickly becomes a bottleneck to training. As a result, either a significant investment in communication bandwidth is required or, if communication bandwidth is limited, then the overall training capacity will also be limited.
Therefore, there is a need for improved approaches to training machine learning models on a large-scale distributed system.
The present disclosure overcomes the limitations of the prior art by using a large-scale distributed computer system that includes a job server and multiple compute nodes. The job server allocates jobs for training machine learning models to groups of one or more compute nodes. These training groups execute the training jobs. However, updating the values of the parameters of the models and communicating the updated values preferably is performed within the compute nodes of the training group, rather than between the training group and the job server. In this way, the communications requirements on the job server are reduced.
In one implementation, the job server receives a plurality of jobs for training different machine learning models. The job server allocates the training jobs to training groups of one or more compute nodes, based on the current requirements of the training jobs and the current status of the compute nodes. Examples of j ob requirements include requirements on computing power, data storage, communication bandwidth and/or special capabilities. Node status generally includes node capabilities and node availability. The training groups execute their allocated training jobs. This typically includes updating values of parameters of the models, such as weights and biases, as the training progresses. The training groups preferably include two or more compute nodes. This updating and communicating the updated values is performed among the compute nodes within the training group, thus reducing communications to outside the group.
The architecture within each training group can vary from group to group, and the approach described can be hierarchical. For example, one of the compute nodes might function as a local job server and/or parameter server for the training group, organizing the remaining compute nodes into sub-groups. The allocation of training jobs to training groups and the composition of the training groups may also change dynamically, as training progresses, as training jobs are ordered or are completed and as compute nodes become available or unavailable.
With a reduced workload, the job server (and other servers) may be used to perform additional tasks, such as visualization of the machine learning models and their training or reporting on the status of compute nodes in the system.
Other aspects include components, devices, systems, improvements, methods, processes, applications, computer readable mediums, and other technologies related to any of the above.
Embodiments of the disclosure have other advantages and features which will be more readily apparent from the following detailed description and the appended claims, when taken in conjunction with the accompanying drawings, in which:
The figures depict various embodiments for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.
The figures and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
The computer system 100 is used to train machine learning models. Examples of machine learning models include convolutional neural networks (CNNs), recurrent neural networks (RNNs), neural networks, and support vector machines.
In a typical training job, the machine learning model has an architecture with a certain number of layers and nodes, with weighted connections between nodes. Training the machine learning model typically includes determining the values of the parameters (e.g., weights and biases) of the model, based on a set of training samples. In supervised learning, the training samples are pairs of inputs and known good outputs (aka, ground truth). An input is presented to the machine learning model, which then produces an output, such as whether the input exhibits a target attribute or a confidence level that the input exhibits the target attribute. The difference between the machine learning model's output and the known good output is used to adjust the values in the model. This is repeated for many different training samples until the performance of the machine learning model is satisfactory. The process of determining whether the machine learning model is adequately trained is referred to as validation. Once trained, when a new input is presented, the machine learning model can satisfactorily predict the correct output. Machine learning models can be continuously training, even while being used in active operation. Other types of machine learning methods include semi-supervised learning, unsupervised learning and reinforcement learning.
In the overall system, the job server 110 plays more of a role of managing and monitoring the allocation of training jobs to the compute nodes 130, and the compute nodes 130 play more of a role of executing the training tasks. These components 110, 130 include some sort of processing power and data storage (possibly shared), although the actual implementations can vary widely. For example, the processing power can be provided by conventional central processing units (CPUs), graphics processing units (GPUs), special purpose processors, custom ASICs, multi-processor configurations, and chips designed for training and inference. These components may also be implemented as actual physical components (e.g., blade servers) or through virtualization. The components 110, 130 also are not required to be all the same. For example, different compute nodes 130 may have different capabilities or may be specialized for certain tasks.
The network 120 provides connectivity between the different components. The term “network” is intended to be interpreted broadly. It can include formal networks with standard defined protocols, such as Ethernet and InfiniBand. However, it also includes other types of connectivity between components, such as backplane connection on a server rack, remote direct memory access (RDMA), and high performance computing fabric frameworks. The “network 120” can also combine different types of connectivity. It may include a combination of local area and/or wide area networks, using both wired and/or wireless links. Data exchanged between the components 110, 130 may be represented using any suitable format. In some embodiments, all or some of the data and communications may be encrypted.
Accordingly, the overall computer system 110 can be implemented in different ways. For example, it can be implemented entirely as a proprietary system. Alternately, it may be built on third party services or cloud offerings.
The dashed arrows in
The job server 110 allocates the training jobs based on the current requirements of the training jobs and the current status of the compute nodes 130. Upon allocating a job to a training group, in one embodiment, the job server 110 also transmits the initial set of parameters of the model (and/or other aspects of the training job) to the training group. Alternately, the job server 110 may not physically transmit the parameters to the training group but may provide pointers to the parameters or otherwise communicate the initial values to the training group. When training is completed, the final values of the parameters may or may not be transmitted to the job server 110. Interim values of the parameters preferably are not transmitted to the job server 110 and the job server 110 preferably does not carry out training calculations. However, the job server 110 typically will monitor each training group's progress and may access interim values of the parameters for display or monitoring purposes.
In this example, each training job is to train a different machine learning model, including adaptation of the parameters for the model. Thus, training group 140A trains machine learning model A, training group 140B trains a different machine learning model B, and so on. The training jobs may be ordered 115 at different times. Accordingly, the allocation 125A-D of the training jobs may occur over time.
The compute nodes 130 in each training group 140 work together to execute 143 their allocated training job. This includes calculating 143 updated values of the parameters for the model and communicating 147 these updated parameters among themselves. Take the training group 140A as an example. The compute nodes 130A1-N in the training group execute a training job to train a machine learning model A. As part of this job, different portions of the training set may be allocated to different compute nodes 130Ax, each of which then trains 143 using its training samples. The compute nodes 130Ax produce 143 updated values of the parameters based on their training, and these values are communicated 147 between the compute nodes in order to aggregate the training from all compute nodes 130Ax. The calculation of interim values and final values of the parameters preferably is performed by the compute nodes 130 in the training group. One or more of the compute nodes 130 can also provide local control and monitoring of execution of the training job by the training group.
The job server 110 allocates 125 training jobs to training groups of one or more compute nodes 130 based on current requirements of the training jobs and current status of the compute nodes 130. Examples of training requirements include requirements on computing power, data storage, communication bandwidth and/or special capabilities. The size of a training job often depends on factors such as the number of training samples and the size of the training samples, the size of the machine learning model and the number of parameters in the model, and the effectiveness of the training algorithm.
The status of the compute nodes can include both the node's capabilities and the node's availability. These can also be measures of computing power, data storage, communication bandwidth and/or special capabilities. Indicators of computing power include the number of processors or processor cores, the type and power of the processors, processing throughput rate (e.g., flops rating), clock speed. Indicators of data storage include types and amount of data storage, read/write bandwidth, access time, preloading capacity, number of low memory warnings, and elapsed time since the last low memory warning. Factors such as bandwidth for other connections (e.g., PCI express), and motherboard topology such as NUMA and SMP will also impact data transfer. Indicators of communication bandwidth include types and numbers of network connections, rate of data transfer (e.g., an average of recent data transfer rates), network connection reliability (e.g., probability of network connection availability based on recent connectivity), and latency for data transfer.
In one embodiment, the job server 110 classifies the compute nodes 130 into different classes based on their capabilities. For example, some of the compute nodes 130 may have more processing power or a larger memory or special capabilities compared to the rest of the compute nodes 130. These might be classified as “Special” while the rest are classified as “Regular.” Each class may have further specifications. For example, the “Regular” compute nodes might include numbers to indicate processing power and memory capacity.
In some embodiments, the availability of the compute nodes 130 is classified as “Available,” “Partially Available” and “Unavailable.” For example, a compute node not executing any training job is Available, a compute node executing a training job but not at 100% capacity is Partially Available, and a compute node executing a training job using all of its capacity is Unavailable. In another approach, availability is indicated by a number, for example ranging from 0 to 1, or from 0 to 100. The job server 110 can use the different classifications to determine how many and which compute nodes are allocated to each training job.
In one embodiment, the training job includes a set of training samples and the master 210M partitions the training job into smaller tasks by assigning subsets of training samples to different workers 210W. For example, if the training job includes 300,000 training samples, the master 210M could assign 100,000 training samples to each worker 210W. The master 210M may not assign the same number of training samples to each worker. It could assign the training samples to the workers 210W based on their status. For example, the master might partition the training job into 10 blocks of 30,000 training samples each. It might then assign the first three blocks of 30,000 training samples to the workers 210W1-3 and then assign the remaining blocks as workers 210W become available. The master 210M itself may also perform some training.
In an alternate partitioning, the machine learning model can be subdivided into different components and the master 210M partitions the training job by assigning different model components to different workers 210W. For example, if the model is separable, some workers 210W might train earlier layers in the model and others might train later layers in the model. Alternately, some model components may be designed to detect certain features and those might be trained separately.
A hybrid approach can also be used. For example, one compute node 220P1 might function as the single point of contact with the job server 110. That compute node 220P1 receives the training job from the job server and makes the initial partition of the training job into smaller tasks. It may also assign initial tasks to the other computer nodes 220P. However, the compute nodes 220P then act as peers with respect to executing the tasks and updating the parameters for the machine learning model. The primary compute node 220P1 may maintain the master set of parameters and also the queue of pending tasks.
As mentioned previously, the job server 110 allocates training jobs to groups of compute nodes. For convenience, these groups are referred to as training groups. The job server 110 preferably determines which compute nodes are included in which training groups. In some embodiments, this can change over time in response to changes in the current requirements of the training jobs and/or the current status of the compute nodes.
The training jobs are ordered at different times. As the job server 110 receives a training job, the job server 110 allocates the training job to compute nodes 130 based on the current requirements of the training jobs and the current status of the compute nodes 130. Table 350 is a time log showing the allocation of training jobs to compute nodes over time. In Table 350, a compute node 130 that is assigned to a job is marked with the job letter, a compute node that is on-line and available is marked with a blank cell, and a compute node that is off-line is marked with a diagonal striped pattern. In this example, we assume the computer system is capable of some level of dynamic reallocation. That is, the compute nodes assigned to a training job can be changed while the training job is executing. However, the use of a job server can also be applied to a static situation where the training group is fixed and must remain the same from the beginning to the end of the job. In that case, the allocation policy will be modified based on this additional constraint.
At time t0, five regular nodes R1-R5 and three special nodes S1-S3 are on-line and available, but no training jobs have been received yet. Nodes R6-R12 are off-line, as indicated by the diagonal striped pattern. At time t1, training job A is ordered and starts. Job A requires one regular node R and one special node S, but the job server 110 allocates the training job to two regular nodes R1-2 and two special nodes S1-2. The remaining compute nodes R3-5 and S3 are available for future jobs, and two more compute nodes R6-7 have come on-line.
Training job A is allocated to more compute nodes 130 than it requires because there are a lot of computing resources available at time t1. Accordingly, it takes less time to complete training job A. At the same time, not all available computing resources are assigned to training job A because other training jobs are expected in the near future. For example, the jobs may be scheduled in advance or the demand for future jobs may be predicted based on past history. In an alternate approach, job A could be allocated to the minimum required compute nodes. This may be appropriate if it is difficult to switch compute nodes in the middle of a job, or if a large number of jobs are expected before the current job completes. In the opposite approach, job A could be allocated to all available compute nodes, with dynamic reallocation as new jobs are ordered.
At time t2, training job B starts while job A is still being executed. The job server 110 assigns training job B to the required minimum of five regular nodes R3-7 and one special node S3. Thus, the computing resources of the training group are the same as the requirements for the job. At the same time, the regular nodes R1-2 and special node S1-2 continue to execute training job A. At time t2, there are no idle compute nodes.
At time t3, additional nodes R8-12 come on-line. There is no allocation of these nodes to either existing jobs A or B, which continue to execute the same as before. At time t4, training job C is ordered. However, training job C requires six regular nodes 130R and one special node 130S, but there are only five regular nodes R8-12 and no special nodes available. The currently available computing nodes are insufficient to meet the requirements of job C. The job server 110 dynamically reallocates nodes R2 and S2 from job A to job C, as shown by the arrows between the rows for times t3 and t4. This still meets the minimum required by job A, while freeing up resources to meet the required minimum for job C. Training job B is still executed by the same compute nodes, because the training group for training job B does not have excess compute nodes. The available pool now has no compute nodes.
At time t5, training job D is ordered. However, there are no available compute nodes so job D does not start execution. It must wait for one of the other jobs to complete. At time t6, job B completes, freeing up nodes R3-R7 and S3. The job server allocates job D to nodes R3-R5. This is basically a first come, first serve approach.
In alternate embodiments, when the computer system is oversubscribed, the job server 110 may allocate resources to training jobs based on priority. If job D was higher priority than job C, then at time t5, the job server would dynamically reallocate compute nodes from job C to job D. Priority of training jobs can be determined by various factors including urgency of the training jobs, importance of the training jobs, time of period required to execute the training jobs. In an alternate approach, the allocation may be on a prorated basis.
At time t7, compute nodes R8-9 go offline unexpectedly. As a result, job C no longer has the required number of compute nodes. However, compute nodes R6-7 are available, so those could be allocated to job C. In this example, job C is reallocated to nodes R3-7 and job D is moved to nodes R10-12. This might be done, for example, if nodes R3-7 are in one data center and nodes R8-12 are in a different data center. This way, all regular nodes assigned to a job are in the same data center.
In the above examples, the job server 110 was primarily responsible for managing execution of the training jobs, while the compute nodes 130 were primarily responsible for the computation required in the training jobs and also updating and communicating parameters for the machine learning models. In some embodiments, the job server 110 also performs other functions. For example, the job server may monitor the training groups' execution of their allocated training jobs and/or the status of the compute nodes 130. The job server 110 may also provide a visual display of the parameters of the training jobs and/or status of the compute nodes 130.
In one implementation, the job server 110 provides a visual display in which available compute nodes are marked with green icons versus red icons for unavailable compute nodes and yellow icons for partially available compute nodes. The visual display can also show the internal architecture of the training groups and/or their level of activity. A user of the computer system 100 can use the visual display to control progress of the training jobs and determine whether to send new training jobs to the job server 110.
The buffer node 450 buffers data to be used in a next training job to be executed by the compute nodes 130. For example, the job server 410 pre-loads data (e.g., training samples, initial values of parameters of the model) to the buffer node 450. The compute nodes 130 then access the data from the buffer node 450. The buffer node 450 provides a sort of caching function for the system as a whole, thus increasing overall system performance.
In
The interface module 510 facilitates communication with other devices and/or users. Training jobs are received via the interface module 510 and instructions for the compute nodes are dispatched via the interface module 510. Data transfer also occurs via the interface module 510. The interface module 510 can include a user interface.
The system monitor 520 monitors the status (capability and/or availability) of the compute nodes. The system monitor 520 may include functionality to auto-discover the capabilities of the compute nodes in terms of computing power, storage and communication. The system monitor 520 also determines which compute nodes are on-line, and whether they are available, partially available or unavailable.
The allocation engine 530 determines requirements of training jobs and allocates the training jobs to compute nodes based on the requirements of the training jobs and status of the compute nodes. In one embodiment, the allocation engine 530 determines how many compute nodes are required by each training job and also looks into how many compute nodes are available or partially available. It allocates the training jobs to compute nodes accordingly. The allocation of training jobs, including reallocation, can be done dynamically.
The compute node manager 540 provides the logic for controlling and instructing the compute nodes. It generates instructions for the compute nodes to execute training jobs. The instructions can include a description of the machine learning model of the training job (e.g., ID, purpose, mathematical algorithm, and initial values of the parameters), location of the training samples for the training job, and information about the other compute nodes in the training group.
Depending on the amount of control by the job server over the compute nodes, the compute node manager 540 may also manage other aspects. For example, instructions can additionally define the architecture of the training group, such as identifying which compute node in the training group is a master and which ones are workers. Also, the instruction can specify partitioning of the training job between the compute nodes in the training groups. In some embodiments, the instruction specifies the communication of updated values of the parameters between the compute nodes. For example, the instructions might specify that a particular compute node is to receive updated values from the other compute nodes in the training group, that compute node will reconcile the training results and produce an updated set of parameters and then send the updated values back to the other compute nodes for further training.
The job monitor 550 monitors progress of the various training jobs. It may query for progress reports, or training groups may self-report their progress.
The display module 560 provides displays of information related to execution of the training jobs and/or status of the computer system. In one embodiment, the display module 560 displays status of the compute nodes. The user can determine whether to send more training jobs to the computer system or to specific nodes based on the displayed status. In another embodiment, the display module 560 displays values of the parameters of the machine learning models. For example, the display module 560 might display the initial values and final values of the parameters of a machine learning model. The display module 560 might also display updated values of the parameters as the training progresses.
In
The control module 620 provides the logic for controlling the compute node, including the interaction with the job server and with the other compute nodes. It is partially a counterpart to the compute node manager 540 in the job server.
The training module 630 executes training jobs. In this example, the training module 630 includes an adaptation engine 632 and a validation engine 634. The training module 630 uses training samples to train the machine learning model. In one approach, the training module 630 forms a positive training set of training samples that have the target attribute in question and a negative training set of training samples that lack the target attribute in question. The adaptation engine 632 updates values of the parameters of the machine learning module to fit the positive training set and the negative training set. Different machine learning techniques—such as linear support vector machine (linear SVM), boosting for other algorithms (e.g., AdaBoost), neural networks, logistic regression, naïve Bayes, memory-based learning, random forests, bagged trees, decision trees, boosted trees, or boosted stumps—may be used in different embodiments.
The validation engine 634 validates the trained machine learning model based on additional samples. The validation engine 634 applies the trained model to the validation samples to quantify the accuracy of the trained model. Common metrics applied in accuracy measurement include Precision=TP/(TP+FP) and Recall=TP/(TP+FN), where TP is the number of true positives, FP is the number of false positives and FN is the number of false negatives. Precision is how many outcomes the trained model correctly predicted had the target attribute (TP) out of the total that it predicted had the target attribute (TP+FP). Recall is how many outcomes the trained model correctly predicted had the attribute (TP) out of the total number of validation samples that actually did have the target attribute (TP+FN). The F score (F−score=2*Precision*Recall/(Precision+Recall)) unifies Precision and Recall into a single measure. Common metrics applied in accuracy measurement also include Top-1 accuracy and Top-5 accuracy. Under Top-1 accuracy, a trained model is accurate when the top-1 prediction (i.e., the prediction with the highest probability) predicted by the trained model is correct. Under Top-5 accuracy, a trained model is accurate when one of the top-5 predictions (e.g., the five predictions with highest probabilities) is correct. The validation engine 634 may use other types of metrics to quantify the accuracy of the trained model. In one embodiment, the training module 630 iteratively re-trains the machine learning model until the occurrence of a stopping condition, such as the accuracy measurement indication that the model is sufficiently accurate, or a number of training rounds having taken place.
The parameter coherency module 640 aggregates the training results from different compute nodes. For example, the training on one compute node may create one set of updated values for the parameters, and the training on another compute node may create a different set of updated values. The parameter coherency module 640 combines these results into a single set of updated values.
Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed in detail above. For example, more than one job server can be used with a set of compute nodes. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims. Therefore, the scope of the invention should be determined by the appended claims and their legal equivalents.
Alternate embodiments are implemented in computer hardware, firmware, software, and/or combinations thereof. Implementations can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor; and method steps can be performed by a programmable processor executing a program of instructions to perform functions by operating on input data and generating output. Embodiments can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits) and other forms of hardware.