The use of deep neural networks (DNNs) has led to breakthroughs in various fields, including image and speech recognition, natural language processing, autonomous vehicles, and more, by enabling computers to automatically learn and extract intricate patterns and representations from large datasets. A DNN is a type of artificial neural network that consists of multiple hidden layers between the input and output layers. In a DNN, each layer contains neurons (also known as nodes or units), and each neuron is connected to neurons in the previous and subsequent layers. These connections are associated with weights and biases, which are learned during a training process. The use of multiple layers of neurons has enabled DNNs to learn and represent complex patterns and hierarchical features from input data.
DNNs are trained through a process called backpropagation and optimization. This process involves adjusting the weights and biases of the network's neurons to minimize a predefined loss function, which measures the difference between the network's predictions and the actual target values in the training data. Initially, the weights and biases of a DNN are set to small random values. A forward pass is then performed during which input training data is then fed into the network and is propagated through the layers using activation functions and weighted connections. Each neuron's output becomes the input for the next layer. This process continues until the final output layer produces an output. The difference between the DNN output and the target output is calculated using a loss function. Common loss functions include mean squared error (for regression tasks) and categorical cross-entropy (for classification tasks). A backward pass (i.e., backpropagation) is then performed which involves calculating the gradient of the loss one layer at a time starting with the output layer. The gradients are then used to update the weights and biases of the network based on an optimization algorithm, such as gradient descent or its variants. The process is then repeated until the DNN's performance on test data reaches a predetermined level (i.e., convergence), or until a predetermined number of epochs (i.e., cycles through the full training data set where each cycle includes a forward and a backward pass).
As DNNs have become larger, various strategies have been developed in an effort to improve the efficiency of training DNNs to reduce the computational resources required to train the model while at the same time increasing throughput. One such strategy is distributed training. Distributed training is typically achieved by implementing parallelism into the DNN, such as data parallelism, pipeline parallelism, and tensor model parallelism, to mitigate the large memory requirements of training while also boosting the throughput. Another strategy that has been used to increase efficiency of DNN training is determining the optimal hardware architecture to use to implement a DNN.
However, because the combined exploration of hardware architecture configurations and distribution schemes is a complex task, particularly when distributed execution across multiple devices is considered, these two strategies for increasing DNN training and inference efficiency are typically considered separately. The identification of optimal solutions for co-optimizing accelerator architecture and distributed execution strategy for deep learning training is therefore needed.
In one general aspect, the instant disclosure presents a training optimization system for training a deep learning model. The training optimization system includes a processor and a memory in communication with the processor wherein the memory stores executable instructions that, when executed by the processor alone or in combination with other processors, cause the training optimization system to perform multiple functions. The functions include receiving input data which includes at least one training script for the deep learning model and defines an area constraint for accelerator optimization; extracting operator graphs pertaining to each layer in the deep learning model from the at least one training script using a graph extractor component; estimating a latency for each operator in each of the operator graphs using a runtime estimator component; generating a set of feasible architecture configurations that satisfy the area constraint using an architecture generator component; performing an evaluation process that includes: selecting a current feasible architecture configuration from the set of feasible architecture configurations having a largest area that has not been evaluated; providing the operator graphs, the latency for each operator in each of the operator graphs, and the current feasible architecture configuration to a solver component; using the solver component to solve an integer linear program (ILP) for each layer in the deep learning model based on the operator graphs and the latency of each operator in each of the operator graphs to generate an optimal schedule of operators to execute on a single accelerator architecture; using the solver component to solve a dynamic programming (DP) algorithm with reference to the optimal schedule of operators and the current feasible architecture configuration to determine a distribution strategy that identifies an optimal combination of pipeline parallel depth, data parallel width, and tensor model parallel width for executing the layers across multiple accelerators; providing feedback to the architecture generator component indicating a throughput for the current feasible architecture configuration; and repeating the evaluation process until a convergence is reached, the convergence being indicated by a decrease in the throughput for the current feasible architecture configuration; once the convergence has been reached, selecting an optimal architecture configuration from the feasible architecture configurations that has been evaluated and the distribution strategy determined for the optimal architecture configuration to use as a basis for an accelerator design for training the deep learning model.
In yet another general aspect, the instant disclosure presents a method of training optimization for a deep learning model that includes receiving input data which includes at least one training script for the deep learning model and defines an area constraint for accelerator optimization; extracting operator graphs pertaining to each layer in the deep learning model from the at least one training script using a graph extractor component; estimating a latency for each operator in each of the operator graphs using a runtime estimator component; generating a set of feasible architecture configurations that satisfy the area constraint using an architecture generator component; performing an evaluation process that includes: selecting a current feasible architecture configuration from the set of feasible architecture configurations having a largest area that has not been evaluated; providing the operator graphs, the latency for each operator in each of the operator graphs, and the current feasible architecture configuration to a solver component; using the solver component to solve an integer linear program (ILP) for each layer in the deep learning model based on the operator graphs and the latency of each operator in each of the operator graphs to generate an optimal schedule of operators to execute on a single accelerator architecture; using the solver component to solve a dynamic programming (DP) algorithm with reference to the optimal schedule of operators and the current feasible architecture configuration to determine a distribution strategy that identifies an optimal combination of pipeline parallel depth, data parallel width, and tensor model parallel width for executing the layers across multiple accelerators; providing feedback to the architecture generator component indicating a throughput for the current feasible architecture configuration; and repeating the evaluation process until a convergence is reached, the convergence being indicated by a characteristic change in performance for the current feasible architecture configuration; once the convergence has been reached, selecting an optimal architecture configuration from the feasible architecture configurations that has been evaluated and the distribution strategy determined for the optimal architecture configuration to use as a basis for an accelerator design for training the deep learning model.
In a further general aspect, the instant application describes a non-transitory computer readable medium on which are stored instructions that when executed cause a programmable device to perform functions of receiving input data which includes at least one training script for a deep learning model and defines an area constraint for accelerator optimization; extracting operator graphs pertaining to each layer in the deep learning model from the at least one training script using a graph extractor component; estimating a latency for each operator in each of the operator graphs using a runtime estimator component; generating a set of feasible architecture configurations that satisfy the area constraint using an architecture generator component; performing an evaluation process that includes: selecting a current feasible architecture configuration from the set of feasible architecture configurations having a largest area that has not been evaluated; providing the operator graphs, the latency for each operator in each of the operator graphs, and the current feasible architecture configuration to a solver component; using the solver component to solve an integer linear program (ILP) for each layer in the deep learning model based on the operator graphs and the latency of each operator in each of the operator graphs to generate an optimal schedule of operators to execute on a single accelerator architecture; using the solver component to solve a dynamic programming (DP) algorithm with reference to the optimal schedule of operators and the current feasible architecture configuration to determine a distribution strategy that identifies an optimal combination of pipeline parallel depth, data parallel width, and tensor model parallel width for executing the layers across multiple accelerators; providing feedback to the architecture generator component indicating a throughput for the current feasible architecture configuration; and repeating the evaluation process until a convergence is reached, the convergence being indicated by a characteristic change in performance for the current feasible architecture configuration; once the convergence has been reached, selecting an optimal architecture configuration from the feasible architecture configurations that has been evaluated and the distribution strategy determined for the optimal architecture configuration to use as a basis for an accelerator design for training the deep learning model.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
The drawing figures depict one or more implementations in accord with the present teachings, by way of example only, not by way of limitation. In the figures, like reference numerals refer to the same or similar elements. Furthermore, it should be understood that the drawings are not necessarily to scale.
The emergence of large DNNs has necessitated various degrees of parallelism, also referred to as distributed training, that uphold training fidelity while enhancing execution throughput. To address the memory overheads associated with training, pipeline parallel training has been adopted, which involves distributing layers across devices to minimize per-device storage requirements. In this type of parallelism, a mini-batch of training data is subdivided into micro-batches that devices process in a pipeline. Different pipelining strategies have been implemented which exhibit unique memory footprints due to variations in the order of micro-batches and the timing of pipeline flushes. In typical pipeline parallelism, a stage includes one or more layers. However, as the sizes of model operators (such as matrix multiplication) increase, there is a growing trend towards splitting a single layer across multiple devices. This approach is commonly referred to as intra-layer tensor model parallelism. For instance, one pipelining strategy distributes a transformer layer by partitioning the self-attention and MLP layers over several devices. Finally, since training is a throughput-centric task, it benefits from model replication if more devices are available. This can be achieved through data parallelism, which works in conjunction with pipeline parallelism and tensor model parallelism by replicating the entire flow multiple times.
Besides varying degrees of parallelism, certain runtime techniques are employed to further minimize the memory footprint, albeit at the expense of throughput. One such technique is activation recomputation. Instead of stashing the activations, they are recomputed during the execution of the backward pass. Although this approach significantly reduces memory requirements throughout training, it necessitates the execution of the forward pass twice.
Given the unique data flow, memory, and compute requirements of DNNs, their training is often executed on specialized hardware known as domain-specific hardware accelerators. Hardware accelerators are specialized computational resources designed to perform specific tasks with high efficiency. These accelerators are highly suitable for deep learning execution due to their ability to handle predictable computation and memory access patterns. As a result, many studies have proposed accelerator designs for either specific deep learning layers, such as Convolutions, transformers, or entire inference or training processes. Accelerator vendors have largely converged on two types of cores to execute the operators relevant to these models: tensor and vector cores. Tensor cores excel at handling high-throughput matrix operations, such as convolutions, General Matrix Multiplications (GEMMs), and batched matrix multiplications. Vector cores effectively perform point-wise and activation functions such as GELU, ReLU, Tanh, and vector-vector addition.
As noted above, the distribution strategy and hardware architecture utilized for DNN training were typically determined separately due to the complex nature of distributing execution across multiple devices in a computationally vast multi-dimensional space. Previously known strategies for optimizing hardware configurations for DNNs have typically only been concerned with inference tasks, i.e., tasks performed after the DNN has been trained and is presented with new data. However, this method can be inefficient due to the challenges presented by training, such as the fact that graphs are much larger than those used for inference, the optimizer and backward pass operators have distinct computational and memory requirements compared to forward pass operators, and training has a larger memory footprint. It's important to note that the design of such accelerators depends not only on the model and its execution graph, but also the distribution strategy. Previously known strategies for distributing a model across accelerators have typically presupposed a certain fixed domain-specific architecture. This information is then used to establish the optimal number of stages in a pipeline, the layers that constitute each stage in a pipeline, data parallel width, and tensor model parallel width for end-to-end training. However, this approach creates a cyclical dependency between device placement and architecture search optimization.
To address these technical problems, and more, in an example, this description provides technical solutions in the form of a DNN training optimization system that implements algorithmic solutions to solve the conjoined problem of accelerator architecture search and model partitioning for distributed training. The system makes the multi-dimensional optimization space of architecture search and device placement tractable by reducing the number of accelerators explored through area-based heuristics and employing a novel integer linear program (ILP), the complexity of which is dependent only on the number of operators in one layer. The ILP scheduling optimization also explores the partitioning of operators across cores, known as intra-operator parallelism. Despite the vast space, the ILP described herein requires significantly less time (≤1 hour) to perform the optimizations across all explored accelerator configurations than previously known implementations. Based on the optimal backward and forward pass latencies, the system leverages a novel dynamic programming (DP) approach to determine the device placement and model partitioning scheme.
The optimization schemes described herein for determining the optimal hardware architecture includes two main parts: first, the quantity and dimensions of each type of core (i.e., tensor and vector) to use are identified, and second, the execution schedule for these cores is determined. For each hardware configuration, the specific operator schedule for each layer is optimized using an ILP, which additionally decides which operators should be executed on multiple vector or tensor cores (intra-operator parallelism). Generally, ILPs produce optimal solutions but their applicability is hindered by inefficiency. To increase the efficiency of the ILP, the use of time-indexed variables is avoided so that its size depends only on the number of operators in each layer. This operator schedule is tailored for each layer (or layer slice when intra-layer parallelism is implemented) of the DNN. The output of the ILP for all layers and layer slice combinations is then passed to a novel DP algorithm to determine the best distribution strategy (optimal combination of data, pipeline, and tensor model parallelism) across the multiple devices. The evaluation of all possible architectural configurations, even with a predefined hardware template, can be computationally intensive as each exploration requires estimating the operator latencies within a layer, executing the ILP for every layer/layer slice to find the optimal latency and schedule of operators on an accelerator, and the DP optimization to determine the device placement strategy. To increase the efficiency of this process, a heuristic is introduced that establishes an early stopping criterion based on the accelerator area (e.g., size) and the performance metrics realized by the configurations already explored.
The technical solutions described herein address the technical problem of inefficiencies and difficulties in determining the optimal hardware architecture and optimal distribution strategy for efficient DNN training. The solutions can reduce the memory requirement while increasing throughput for a DNN or set of DNNs relative to previously known techniques.
The cloud infrastructure 102 is configured to provide one or more cloud computing services and/or distributed computing services to users over the network 106. The computing services include a DNN training optimization service 108 (explained in more detail below). Cloud infrastructure may provide other services, such as hosting applications, user authentication, file storage, system updates, and the like. Cloud infrastructure 102 includes one or more servers 120 which are configured to provide computational and storage resources for the DNN training optimization service 108. Servers are implemented using any suitable number and type of physical and/or virtual computing resources (e.g., standalone computing devices, blade servers, virtual machines, etc.). Cloud infrastructure 102 may also include one or more data stores 122 for storing data, programs, and the like for implementing and managing the DNN training optimization service 108. In
Cloud infrastructure 102 includes a cloud manager 110 for managing various aspects of the cloud infrastructure, such as deploying, configuring, and managing physical and/or virtual machines. Cloud manager 110 includes a load balancer 112 for distributing requests and workloads among server farms and/or among servers of a server farm. The load balancer 112 utilizes parameters such as load, number of connections, and server performance, to determine where to distribute the requests and workloads. Cloud manager 110 also includes a health monitoring system 114 configured to monitor the health of physical and virtual resources. and identify faulty components so that remedial action can be taken.
Client devices 104 enable users to access the services provided by the cloud infrastructure 102 via the network 106, such as the DNN training optimization service 108. Client devices 104 can be any suitable type of computing device, such as personal computers, desktop computers, laptop computers, smart phones, tablets, gaming consoles, smart televisions and the like. Client devices 104 include one or more client (software) applications 116 that are configured to interact with services made available by cloud infrastructure 102. In some implementations, client applications 116 include dedicated applications installed on the client device and programmed to interact with one or more services provided by cloud infrastructure. In other embodiments, client applications 116 include general purpose applications, such as a web browser, configured to access services over the network 106.
In accordance with the disclosure, cloud infrastructure includes a DNN training optimization system 118 for providing the training optimization service. An example implementation of a DNN training optimization system 200 is shown in
In implementations, the input data includes one or more model training scripts. A training script is a computer program or code written to train a DNN. Training scripts define various characteristics and parameters of the model being trained and the process used to train the model. For example, training scripts may define the architecture of the model being trained. This includes specifying the layers, layer slices, operators, connections, and other components of the model. Training scripts may also define the loss function which is used to determine how well the model's performance compares to target performance and the optimization algorithm and variables used to update model parameters to minimize the loss function. Training scripts may also define global batch sizes for training and degrees of parallelism supported. The input data also defines design constraints, such as architectural template, accelerator size constraints, energy expenditure constraints, and/or model topology.
The graph extractor component 204 receives the one or more training scripts provided as input and extracts layer/layer slice information and operator information and generates graphs for each layer/layer slice and corresponding operator graphs for each layer/layer slice. In embodiments, operator graphs are Directed Acyclic Graphs (DAGs). The nodes in the graphs represent operators and connections between nodes indicate activation and weight information for the operators. In embodiments, graph extraction may be performed using a suitable program, function, and/or library. As an example, graph extraction may be performed using torch.fx which is a toolkit for developers. This toolkit enables graphs to be generated from code, such as training scripts. The graph extractor component provides the layer/layer slice graphs and operator graphs for an optimization operation to the runtime estimator component 206.
The architecture generator component 208 receives design constraints (e.g., area constraint, energy constraint, and the like) from the input data and selects a set of feasible architecture configurations for evaluation that satisfy the design constraints, such as the architecture area constraint and/or energy constraint. The architecture generator component 208 then selects an initial architecture configuration to provide to the runtime estimator component. For the purpose of this disclosure, architecture configurations are identified based on a predetermined accelerator template. An example accelerator template is shown in
In embodiments, each architecture configuration is represented as a 5-tuple that includes the following elements: numtc, numvc, PEx, PEy, and PEvc, which respectively denote the number of tensor cores, number of vector cores, x- and y-dimensionality of the MAC units in each tensor core, and width of the vector lane in each vector core. Thus, for an architecture configuration that includes one tensor core having the maximum x-y dimensions of 256×256 and one vector core having the maximum vector lane width of 256, the 5-tuple is represented by {1, 1, 256, 256, 256}. A single model often has similar tensor dimensions throughout, as those are determined by static hyperparameters such as attention heads, sequence length, hidden size, batch size, etc. Thus, both activation operators and matrix multiplication operators share similar tensor sizes. Hence, to further reduce the search space, the vector and tensor core PE widths of architecture configurations are restricted so as to be identical, i.e., PEx is equal to PEvc. The architecture generator component provides the current selected architecture configuration (e.g., 5-tuple) to the runtime estimator.
The runtime estimator component 206 receives the graphs from the graph extractor component 204 and the current selected architecture configuration and is configured to estimate at least one operator performance metric, such as latency per operator, energy expenditure per operator, and the like, for each node/operator in each operator graph. In some implementations, two latency estimates may be provided for each operator with one latency estimate being for use when intra-operator parallelization is implemented and the other latency estimate being for use when intra-operator parallelization is not implemented. The runtime estimator includes the operator metric estimates with the operator graphs, e.g., by annotation. Latency estimation and energy expenditure estimation may be performed in any suitable manner. As examples, latency estimation may be performed using Timeloop or Sunstone libraries, and energy expenditure may be performed using Cacti libraries.
Once operator metric estimates have been determined for each operator in each layer/layer slice and the graphs have been annotated, the runtime estimator component 206 provides the annotated graphs and the current selected architecture configuration to the solver component 210. The goal of the solver component 210 is to determine the device placement and the operator schedules in each layer or layer slice. The solver component may also determine other specifications for accelerator design, such as on-chip buffer size(s) and/or chip memory size(s). The solver component 210 also provides a feedback metric, such as throughput, end-to-end latency, overall energy expenditure, power budget, total cost of ownership (TCO, which includes cost of building the chip, energy consumption, and power budget) for the selected architecture configuration to the architecture generator component 208 which can be used later as the basis for selecting an optimal configuration for the system. The architecture component then selects the next feasible architecture configuration to be evaluated and the process is repeated.
Metric estimation, such as latency estimation and energy expenditure estimation, is a resource intensive task. To reduce the overhead of such estimates, the repetitive structure of large language models is leveraged by only running estimations for operators within a single layer or a layer slice at a time. All the non-repeat layers are estimated independently. To further reduce the overhead of estimations, a pruning scheme is employed to reduce the number of accelerator configurations that need to be evaluated by the system. The pruning scheme is based on the heuristic insight that there is typically a minimum and/or maximum value for a design constraint, such as chip size, chip area, energy consumption, and the like, which can be used to define at least one bound for feasible architectures. As one example, the configuration {1, 1, 256, 256, 256} could be considered a maximum chip area size for defining feasible architectures in some cases. In some cases, hardware may define maximum energy/power characteristics which could be used as the basis for defining at least one bound for feasible architectures. Once at least one bound for an architecture search has been determined, evaluation of feasible architectures can start at one bound and proceed from there in order of decreasing or increasing magnitude depending on whether the bound is a minimum or maximum bound. The bounds defining the architecture search based on an area constraint are illustrated in
The architecture generator component is configured to monitor the feedback metric from the solver component pertaining to the evaluated architecture configurations to determine when convergence occurs. In embodiments, convergence is identified when a feedback metric, such as throughput, for a currently selected architecture configuration changes relative to previously evaluated architectures. For example, when a feedback metric indicates that the throughput of the current evaluated architecture has diminished or is no longer following the trend set by previous evaluations, this may be used as an indication that further evaluations may not be necessary. Convergence may be identified using a predefined hysteresis level which defines the number of architecture configurations (e.g., 5, 10, etc.) having an uncharacteristic feedback metric, e.g., diminished throughput, that must occur before convergence is identified. An example algorithm for generating architecture configurations to explore and to identify convergence based on hysteresis level is shown in
Once convergence has been reached, the architecture generator component selects one of the evaluated configurations as the optimal architecture configuration for the current optimization operation. Any suitable selection method and/or criteria may be used in selecting an optimal architecture configuration. In embodiments, parameters, such as minimum throughput, maximum size, minimum energy expenditure, minimum TCO, and the like, may be used as the basis for selecting an optimal architecture configuration. These parameters may be predetermined and programmed into the system and/or may be provided as user input.
An example implementation of a solver component 600 is shown in
The ILP solver 606 executes an ILP for every selected architecture configuration, as described by the 5-tuple {numtc, numvc, PEx, PEy, PEvc}, and for every layer/layer slice across all possible tensor model parallel widths of a model. Tensor model parallel widths refers to the number of accelerators used in the tensor model parallelization of a layer. At this stage, memory usage (of the accelerator's High Bandwidth Memory, HBM) is not a concern, as it is independent of the scheduling of the layer's operator graph. The objective is to schedule a layer or layer slice's operator graph on a single accelerator in order to minimize latency.
The ILP explores the search space of schedules under the following conditions:
The input to the ILP is an operator DAG (V, E) of a layer/layer slice where Vis the number of operators (i.e., nodes) and E is the number of edges in the graph. Each node includes architecture-dependent latency estimates (as provided by the runtime estimator component). As noted above, two latency estimates may be provided with one latency estimate being for cases when intra-operator parallelization is used and the other latency estimate being for cases when intra-operator parallelization is not used. Each operator in the graph is categorized as either tensor-core, vector-core, or fused.
In deep learning, a matrix multiplication operator is often directly followed by an activation function, such as Relu (rectified linear unit). An optimization technique called operator fusion enables intermediate activations between certain operators to be directly forwarded from tensor to vector core, or vice versa, without passing through the HBM. On the hardware, such an operator is executed on a fused core equipped with both a MAC unit and a vector lane. For the purposes of this disclosure, a tensor core operator followed by a vector core operator is considered a fused operator and can only be executed on a fused core. For the purposes of this disclosure, tensor cores are numbered from 1 to numTC, and vector cores from numTC+1 to numTC+numVC. The first min (numTC, numVC) tensor cores are paired with the first min(numTC, numVC) vector cores. A fused operator, in the absence of intra-operator parallelism, runs on one such pair, i.e., on tensor core c and on vector core numTC+c, for some c∈{1, . . . , min(numTC,numVC)}.
A strict partial order is defined as the transitive closure of the DAG (V,E). If i
j, operator i must finish before j can begin, and only incomparable operators can potentially execute simultaneously. The variables xij define another strict partial order, which lies between
and the partial order given by the execution time intervals of operators in the found schedule. Specifically:
Due to use of the x-variables that encode a strict partial order that is in-between the input partial order and the partial order induced by the computed solution, the necessity of using time-indexed variables is circumvented. Therefore, the ILP is very tractable. An additional optimization may be employed that involves removing the z-variables and all constraints that utilize them. The proposed ILP has two sets of constraints:
However, it is often the case that this restriction is not necessary. To optimize the ILP based on this information, the z-variables and all constraints involving them are removed, and the ILP is solved. The schedule arising from the ILP solution (start executing every operator i at time ti) is generated. The number of vector cores and tensor cores used at any time in the schedule is checked to see if the schedule ever uses more of either type of core than is available at any time. If the schedule does not require the use of more cores of either type than are available, the optimal solution is found. Otherwise, it is necessary to add back the z-variables and constraints corresponding to the type (vector or tensor) of cores for which the resource constraint has been violated. The ILP is then solved again. Note that whenever this happens, it must be the case that there is significant branching in the model (i.e., more nodes executing in parallel than the number of tensor or vector cores). This implies that the number of cores is smaller than the size |V| of the operator graph of the layer (or layer slice). Therefore, the number of variables in the ILP is always at most O(|V|2). The ILP runtimes are also only a small fraction of the entire end-to-end compute time.
The output of the ILP for all layers and layer slice combinations is passed to the DP solver which executes a DP algorithm to determine the best distribution strategy (optimal combination of data, pipeline, and tensor model parallelism) across the various devices. For a given accelerator architecture A, number of accelerators K, and a workload W, the objective of the DP solver is to form a high-throughput pipelined schedule for the workload. For this problem, the workload is a DAG with a layer granularity (i.e., V is the set of layers of the DNN). It is assumed that the batch (or mini-batch) size and the microbatch size are fixed. B is the number of microbatches in a batch. It is further assumed that the user specifies whether activation recomputation is used or not. If yes, then it is used throughout the entire workload, with a stage granularity (that is, only the input activations of a stage are stored, not of each layer, and compute the forward-plus-backward pass of the entire stage, materializing all the intermediate activations for a single microbatch during this time). In this algorithmic problem, the goal is to determine:
The DP solver builds the pipeline stage by stage, starting from the last stage and ending with the first. To do so, it works on downsets: downward-closed sets of layers. The dynamic programming table that we will compute is dp′[D][k][s]:=minimum max-load of any accelerator, when optimally partitioning downset D over s stages using k accelerators, with degree t of tensor model parallelism. Note that at this level only a single (data-parallel) pipeline is considered. This is computed for all downsets D (except the entire set V) and numbers k, s and t. The dynamic programming recursion is as follows:
where:
The final result is computed by optimizing over d, s, t, and the first stage (which in the dynamic program is the last stage to be formed). Namely, the final time per batch F is computed as:
Regarding the final time per batch:
The DP algorithm makes assumptions to solve the model partitioning problem. One assumption involves pipeline flushes. The pipelining scheme follows a flushing schedule similar to PipeDream-Flush/DeepSpeed 1F1B. This approach differs from previous methods utilizing dynamic programming, where non-flushing schedules like PipeDream-2BW were considered. In a non-flushing schedule, the time taken per microbatch equals the maximum load (single-microbatch latency) of any stage. As such, previous methods have concentrated on minimizing this max-load. However, a major difficulty regarding flushing schedules is that the flush time cannot be entirely disregarded. In the DP algorithm described herein, the flush time is taken into account by calculating the time taken per patch by multiplying the max-load by a factor
Note that B/d is the per-pipeline batch size (number of microbatches per pipeline). This approximation is lossless if all stages have the same load.
Another assumption involves gradient synchronization communication costs. AllReduce operation communication, which synchronizes gradients between data-parallel replicas, takes place during the flush period. Throughout this time, all stages, except for the first, remain idle while the backward pass propagates. It is assumed that this period is sufficient for the communication to successfully complete. However, for the first stage, synchronization does cause a slowdown, which is accounted for by adding
to the execution time of the batch. The bandwidth here is the same as the one used to estimate Allreduce operator cost for tensor model parallelism.
It is important to note that, since the load subroutine will be invoked for all feasible settings of S, t, a, s, it must be highly efficient. It is designed to compute the maximum load of any of the a accelerators, and it also provides information about every layer that needs to be relayed downstream (such as latency, memory usage, and so on). The second function of the load subroutine is to calculate memory usage. If the memory limit (as defined by accelerator HBM size) is surpassed, load should return +∞. Ideally, for every s, the schedule with the least latency is determined such that no accelerator exceeds the memory limit. However, this complex problem does not need to be solved. Instead, the algorithm attempts to identify the overall lowest-latency schedule, and calculate its memory usage.
As in all of test workloads, the layer graph (i.e., the operator graph of the full DNN, where operators belonging to each layer or layer slice have been contracted into a single node) is linear (i.e., contains a Hamiltonian path—this is consistent with the layer structure of large language models), the load only needs to be computed for these layers because there are no optimization decisions to be made. The forward pass is focused on for simplicity; the computation for the backward pass (with or without forward pass recomputation) is analogous.
A topological ordering of S (recall that S is a contiguous subgraph of layers) is fixed. Let S={L1 . . . . Ll} (in that topological order). Each layer comes with a single, optimized schedule to execute it, which is computed as schedule (Li). An “object-oriented” notation is used to access the quantities related to this schedule, such as schedule (Li).latency_fw. Note that quantities related to Li, Li.weights_size, are not dependent on the schedule.
The layers are scheduled one by one. Each layer begins at the earliest time that all of its predecessor have been completed, and all incoming activations have been transferred over the edges of the graph from other devices. The transfer costs on those edges that come from outside of the stage are considered here. It is assumed that the network has a flat structure, where transmitting X bytes from any accelerator to any other accelerator takes time X/bandwidth. As a result, the finishing time of the layer is the starting time plus schedule (Li).latency_fw.
If activation recomputation is employed, it is performed at a stage level, meaning that all intermediate activations within the stage are materialized during the forward pass recomputation. This implies that peak memory usage is reached at the end of that recomputation, specifically when the pipeline is in a steady state and the stage has stored data for the full number of in-flight microbatches. At this point, the accelerator memory (HBM) contains the following:
The following property of the implemented pipelining scheme determines the memory usage: in the steady state of the pipeline, for a stage that is the s-th stage from the pipeline's end, it's necessary to stash data for at most s−1 in-flight microbatches. These are the microbatches for which the forward pass has already been computed on this stage, but the backward pass has not yet been processed. The PipeDream-Flush scheme adheres to the property that computations can be lazily scheduled, thereby satisfying the property. Here, the term “data” varies based on whether activation recomputation is being used: if it's not being used, the “data” refers to all forward activations, and if it is being used, “data” refers to the input activations of the stage. For GPipe schedules, s−1 should be replaced with the total number of batches per pipeline, i.e., B/d.
Therefore, the memory usage can be modeled as:
where, if activation recomputation is used,
and
where δ−(s) denotes the set of incoming edges of S, and otherwise
Note that the computation costs and communication costs do not depend on s (except when s=1 as activation recomputation is not required for the last stage). Moreover, the memory usage only depends on s in an affine way. Namely, when s increases by one, the peak memory usage rises precisely by the amount of stashed_data. This allows the runtime of the load computation to be optimized by reusing the results across all s values. Indeed, rather than explicitly computing the quantity loadt(S, a, s) for all t. S, a, s, a pair loadt(S, a) may be returned that comprises: (1) the usual output of load (maximum latency over the a accelerators), and (2) the maximum s for which the found schedule fits in the memory of every accelerator.
In designing the DNN training optimization system, the following assumptions were made:
The example software architecture 1002 may be conceptualized as layers, each providing various functionality. For example, the software architecture 1002 may include layers and components such as an operating system (OS) 1014, libraries 1016, frameworks 1018, applications 1020, and a presentation layer 1044. Operationally, the applications 1020 and/or other components within the layers may invoke API calls 1024 to other layers and receive corresponding results 1026. The layers illustrated are representative in nature and other software architectures may include additional or different layers. For example, some mobile or special purpose operating systems may not provide the frameworks/middleware 1018.
The OS 1014 may manage hardware resources and provide common services. The OS 1014 may include, for example, a kernel 1028, services 1030, and drivers 1032. The kernel 1028 may act as an abstraction layer between the hardware layer 1004 and other software layers. For example, the kernel 1028 may be responsible for memory management, processor management (for example, scheduling), component management, networking, security settings, and so on. The services 1030 may provide other common services for the other software layers. The drivers 1032 may be responsible for controlling or interfacing with the underlying hardware layer 1004. For instance, the drivers 1032 may include display drivers, camera drivers, memory/storage drivers, peripheral device drivers (for example, via Universal Serial Bus (USB)), network and/or wireless communication drivers, audio drivers, and so forth depending on the hardware and/or software configuration.
The libraries 1016 may provide a common infrastructure that may be used by the applications 1020 and/or other components and/or layers. The libraries 1016 typically provide functionality for use by other software modules to perform tasks, rather than rather than interacting directly with the OS 1014. The libraries 1016 may include system libraries 1034 (for example, C standard library) that may provide functions such as memory allocation, string manipulation, file operations. In addition, the libraries 1016 may include API libraries 1036 such as media libraries (for example, supporting presentation and manipulation of image, sound, and/or video data formats), graphics libraries (for example, an OpenGL library for rendering 2D and 3D graphics on a display), database libraries (for example, SQLite or other relational database functions), and web libraries (for example, WebKit that may provide web browsing functionality). The libraries 1016 may also include a wide variety of other libraries 1038 to provide many functions for applications 1020 and other software modules.
The frameworks 1018 (also sometimes referred to as middleware) provide a higher-level common infrastructure that may be used by the applications 1020 and/or other software modules. For example, the frameworks 1018 may provide various graphic user interface (GUI) functions, high-level resource management, or high-level location services. The frameworks 1018 may provide a broad spectrum of other APIs for applications 1020 and/or other software modules.
The applications 1020 include built-in applications 1040 and/or third-party applications 1042. Examples of built-in applications 1040 may include, but are not limited to, a contacts application, a browser application, a location application, a media application, a messaging application, and/or a game application. Third-party applications 1042 may include any applications developed by an entity other than the vendor of the particular platform. The applications 1020 may use functions available via OS 1014, libraries 1016, frameworks 1018, and presentation layer 1044 to create user interfaces to interact with users.
Some software architectures use virtual machines, as illustrated by a virtual machine 1048. The virtual machine 1048 provides an execution environment where applications/modules can execute as if they were executing on a hardware machine (such as the machine 1100 of
The machine 1100 may include processors 1110, memory 1130, and I/O components 1150, which may be communicatively coupled via, for example, a bus 1102. The bus 1102 may include multiple buses coupling various elements of machine 1100 via various bus technologies and protocols. In an example, the processors 1110 (including, for example, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), an ASIC, or a suitable combination thereof) may include one or more processors 1112a to 1112n that may execute the instructions 1116 and process data. In some examples, one or more processors 1110 may execute instructions provided or identified by one or more other processors 1110. The term “processor” includes a multi-core processor including cores that may execute instructions contemporaneously. Although
The memory/storage 1130 may include a main memory 1132, a static memory 1134, or other memory, and a storage unit 1136, both accessible to the processors 1110 such as via the bus 1102. The storage unit 1136 and memory 1132, 1134 store instructions 1116 embodying any one or more of the functions described herein. The memory/storage 1130 may also store temporary, intermediate, and/or long-term data for processors 1110. The instructions 1116 may also reside, completely or partially, within the memory 1132, 1134, within the storage unit 1136, within at least one of the processors 1110 (for example, within a command buffer or cache memory), within memory at least one of I/O components 1150, or any suitable combination thereof, during execution thereof. Accordingly, the memory 1132, 1134, the storage unit 1136, memory in processors 1110, and memory in I/O components 1150 are examples of machine-readable media.
As used herein, “machine-readable medium” refers to a device able to temporarily or permanently store instructions and data that cause machine 1100 to operate in a specific fashion, and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical storage media, magnetic storage media and devices, cache memory, network-accessible or cloud storage, other types of storage and/or any suitable combination thereof. The term “machine-readable medium” applies to a single medium, or combination of multiple media, used to store instructions (for example, instructions 1116) for execution by a machine 1100 such that the instructions, when executed by one or more processors 1110 of the machine 1100, cause the machine 1100 to perform and one or more of the features described herein. Accordingly, a “machine-readable medium” may refer to a single storage device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.
The I/O components 1150 may include a wide variety of hardware components adapted to receive input, provide output, produce output, transmit information, exchange information, capture measurements, and so on. The specific I/O components 1150 included in a particular machine will depend on the type and/or function of the machine. For example, mobile devices such as mobile phones may include a touch input device, whereas a headless server or IoT device may not include such a touch input device. The particular examples of I/O components illustrated in
In some examples, the I/O components 1150 may include biometric components 1156, motion components 1158, environmental components 1160, and/or position components 1162, among a wide array of other physical sensor components. The biometric components 1156 may include, for example, components to detect body expressions (for example, facial expressions, vocal expressions, hand or body gestures, or eye tracking), measure biosignals (for example, heart rate or brain waves), and identify a person (for example, via voice-, retina-, fingerprint-, and/or facial-based identification). The motion components 1158 may include, for example, acceleration sensors (for example, an accelerometer) and rotation sensors (for example, a gyroscope). The environmental components 1160 may include, for example, illumination sensors, temperature sensors, humidity sensors, pressure sensors (for example, a barometer), acoustic sensors (for example, a microphone used to detect ambient noise), proximity sensors (for example, infrared sensing of nearby objects), and/or other components that may provide indications, measurements, or signals corresponding to a surrounding physical environment. The position components 1162 may include, for example, location sensors (for example, a Global Position System (GPS) receiver), altitude sensors (for example, an air pressure sensor from which altitude may be derived), and/or orientation sensors (for example, magnetometers).
The I/O components 1150 may include communication components 1164, implementing a wide variety of technologies operable to couple the machine 1100 to network(s) 1170 and/or device(s) 1180 via respective communicative couplings 1172 and 1182. The communication components 1164 may include one or more network interface components or other suitable devices to interface with the network(s) 1170. The communication components 1164 may include, for example, components adapted to provide wired communication, wireless communication, cellular communication, Near Field Communication (NFC), Bluetooth communication, Wi-Fi, and/or communication via other modalities. The device(s) 1180 may include other machines or various peripheral devices (for example, coupled via USB).
In some examples, the communication components 1164 may detect identifiers or include components adapted to detect identifiers. For example, the communication components 1164 may include Radio Frequency Identification (RFID) tag readers, NFC detectors, optical sensors (for example, one- or multi-dimensional bar codes, or other optical codes), and/or acoustic detectors (for example, microphones to identify tagged audio signals). In some examples, location information may be determined based on information from the communication components 1164, such as, but not limited to, geo-location via Internet Protocol (IP) address, location via Wi-Fi, cellular, NFC, Bluetooth, or other wireless station identification and/or signal triangulation.
In the following, further features, characteristics and advantages of the invention will be described by means of items:
While various embodiments have been described, the description is intended to be exemplary, rather than limiting, and it is understood that many more embodiments and implementations are possible that are within the scope of the embodiments. Although many possible combinations of features are shown in the accompanying figures and discussed in this detailed description, many other combinations of the disclosed features are possible. Any feature of any embodiment may be used in combination with or substituted for any other feature or element in any other embodiment unless specifically restricted. Therefore, it will be understood that any of the features shown and/or discussed in the present disclosure may be implemented together in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Also, various modifications and changes may be made within the scope of the attached claims.
While the foregoing has described what are considered to be the best mode and/or other examples, it is understood that various modifications may be made therein and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.
Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.
The scope of protection is limited solely by the claims that now follow. That scope is intended and should be interpreted to be as broad as is consistent with the ordinary meaning of the language that is used in the claims when interpreted in light of this specification and the prosecution history that follows and to encompass all structural and functional equivalents. Notwithstanding, none of the claims are intended to embrace subject matter that fails to satisfy the requirement of Sections 101, 102, or 103 of the Patent Act, nor should they be interpreted in such a way. Any unintended embracement of such subject matter is hereby disclaimed.
Except as stated immediately above, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.
It will be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein. Relational terms such as first and second and the like may be used solely to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “a” or “an” does not, without further constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element. Furthermore, subsequent limitations referring back to “said element” or “the element” performing certain functions signifies that “said element” or “the element” alone or in combination with additional identical elements in the process, method, article or apparatus are capable of performing all of the recited functions.
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various examples for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed example. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.