The present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods for reducing training times of deep neural networks (DNNs) through efficient hybrid parallelism techniques.
DNNs have achieved great successes in many domains, such as computer vision, natural language processing, recommender systems, etc. Training a DNN requires substantial computational and memory requirements. It has become a standard practice to parallelize training on multiple devices to reduce training times. There are several possible ways to parallelize different layers in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirements. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than those they were designed for.
Accordingly, it is desirable to provide more efficient systems and methods that increase hardware utilization and reduce training times of deep neural networks.
References will be made to embodiments of the disclosure, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these particular embodiments. Items in the figures may not be to scale.
Figure (“FIG.”) 1 depicts an exemplary iteration space of a general matrix multiply (GEMM) computation that is parallelized using the parallelization configuration (1, 4, 2).
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the disclosure. It will be apparent, however, to one skilled in the art that the disclosure can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present disclosure, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the disclosure and are meant to avoid obscuring the disclosure. It shall also be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
Reference in the specification to “one embodiment,” “preferred embodiment,” “an embodiment,” or “embodiments” means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
The terms “include,” “including,” “comprise,” and “comprising” shall be understood to be open terms and any lists that follow are examples and not meant to be limited to the listed items. A “layer” may comprise one or more operations. The words “optimal,” “optimize,” “optimization,” “best,” and the like refer to an improvement of an outcome or a process and do not require that the specified outcome or process has achieved an “optimal” or peak state.
Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated by reference herein in its entirety.
Furthermore, one skilled in the art shall recognize that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently.
It shall be noted that any experiments and results provided herein are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
Deep neural networks are becoming increasingly sophisticated and use ever-increasing datasets to increase accuracy. This has led to an increase in computational and memory requirements to train DNNs. It typically takes from several hours to days and multiple GPUs to train a DNN. For instance, Google's neural machine translation (GNMT) model takes around six days to train on the Dataset3 English-to-French (ENFR) when using 96 NVIDIA K80 GPUs.
Training a DNN typically involves three phases: forward propagation, backward propagation (or backprop), and an update phase. First, the input dataset is split into multiple mini-batches. During a step, a mini-batch is passed through the layers of the network during forward propagation. At the end of the forward phase, the output is compared against the ground truth, and a loss is computed using an appropriate loss function. To minimize the loss, the gradients of the model parameters are computed during backward propagation. Finally, model parameters are updated using the gradients. This process is repeated over several timesteps, called epochs, until a required accuracy is achieved.
DNN parallelization strategies can be broadly classified into three, namely, data parallelism, model parallelism, and pipeline parallelism. One strategy that combines these approaches to parallelize each layer differently is often referred to as hybrid parallelism. As described below, each parallelization strategy has its own advantages and disadvantages.
In data parallelism, each of p devices keeps a replica of the entire DNN, and each mini-batch is split into p shards and is distributed to different devices. Each device performs forward and backward propagation independently on its shard of data. During the update phase, gradients from all the devices are accumulated, typically through an all-reduce operation, before local copies of model parameters are updated. On a model with a large number of model parameters, this becomes a major bottleneck. Further, as the model parameters are replicated (instead of being split and distributed), it might be impossible to train large models by just using data parallelism, due to memory constraints. In addition, data parallelism is inefficient at small mini-batch sizes. Unfortunately, using a larger mini-batch size may not always be possible due to poor convergence and poor accuracy. Despite these drawbacks, data parallelism remains popular due to its simplicity and the ability to apply data parallelism on an entire network automatically. Data parallelism can also be viewed as dividing the work along the mini-batch dimension.
An alternative strategy divides the work along model dimensions (e.g., channel dimension, filter dimension, etc.). This is the approach taken in model parallelism, in which model parameters are distributed among different devices and each device calculates only a part of a layer's activations (and gradients) during forward (and backward) propagation. This conserves memory, but it incurs additional communication (typically an all-to-all communication) to accumulate the activations (and gradients) during forward (and backward) propagation. Depending on the mini-batch and model parameter sizes, one parallelization strategy may be more efficient than the other.
A third approach, pipeline parallelism, involves placing different layers of a network on different devices without splitting the input data and model parameters along any dimension. This allows the computation of layers that do not have data dependencies to overlap. Each device computes the activations (and gradients) for the layers it owns, and sends the results to the devices that own successive layers. This strategy has the advantage of not needing to collectively communicate the model parameters; however, it requires sufficient interlayer parallelism, and the data needs to arrive at a specific rate through the pipeline for this strategy to be efficient.
A hybrid parallelism combines some or all of the three strategies to parallelize different layers differently by using a combination of strategies (e.g., data+model parallelism). As detailed in Section B below, there are several possibilities to choose how different layers should be parallelized. Hence, it is impractical to exhaustively search for an optimal strategy for hybrid parallelism. Based on domain specific knowledge, expert-designed strategies have been proposed on a case-by-case basis for different DNNs. There also have been efforts to automatically find good strategies. These approaches either (i) apply different heuristics to find a greedy solution, or (ii) find an optimal solution restricted to a certain class of DNNs (e.g., convolution networks), or (iii) reduce the search space by restricting some choices to find an optimal strategy within the reduced search space.
In this patent document, hybrid parallelism strategies are used. In one or more embodiments, interlayer pipeline parallelism may be ignored, and a combination of model and data parallelism may be used to find the best strategy for parallelizing different layers of a DNN. Various embodiments comprise a formulation and a node (or vertex) ordering technique to efficiently compute the parallelization strategies corresponding to minimum training costs of DNNs. An efficient process may use the formulation to compute the best strategies for various DNNs.
Experimental results demonstrate that, in one or more embodiments, ignoring interlayer pipeline parallelism does not extensively prune the optimal strategies from the search space. Strategies suggested by a novel process are evaluated against a baseline data-parallel strategy, expert-designed strategies, and strategies proposed by a state-of-the-art approach, Framework1, discussed in Experimental Results section D below. Results show that, in most cases, the process finds efficient strategies for various DNNs within a few seconds. The presented strategies outperform data parallelism by up to 1.85 times on a multi-node/multi-GPU system consisting of 1080Ti GPUs, and by up to four times on a system consisting of 2080Ti GPUs for various benchmarks. The presented strategies also perform better than the expert-designed strategies and the strategies suggested by Framework1.
A DNN may be represented as a computation graph G=(V, E) that is a weakly connected directed graph, where each node v∈V corresponds to a layer (e.g., a fully-connected layer, a convolution layer, etc.) in the DNN, and each edge (u, v)∈E represents a flow of a tensor that is an output of u and an input of v. Each node v∈V has an associated iteration space that captures the computation of v. Considering, for instance, a fully-connected layer that multiplies a matrix AM×K with a matrix BK×N. Its iteration space is specified by the set {(i, j, k) ∈3|0≤i<M∧0≤j<N∧0≤k<K}.
A parallelization configuration Cv of a node v is a d-tuple of positive integers that defines how the iteration space of v is split and parallelized across different devices, where d is the dimension of the iteration space of v. >0d|Πi=1dci≤p}.
For notational simplicity, when p is clear from the context, (v,p) may be specified as
(v). Alternatively, a layer v that is parallelized using a configuration Cv may be viewed as an iteration space tiling of the computation of v.
A parallelization strategy ϕ is the set {(v, Cv)|v∈V∧Cv∈(v)} that specifies a valid configuration for each node v∈V. A configuration for a node v in strategy ϕ is given by Cv=ϕ(v). A substrategy ϕ|U is a strategy ϕ restricted to the subset U, i.e., ϕ|U={(u, Cu)∈ϕ|u∈U}. An optimal strategy {circumflex over (ϕ)} is a parallelization strategy that has the minimum cost over all possible strategies for V, under a given cost function
, i.e., {circumflex over (ϕ)}=arg minϕ∈Φ
(G, ϕ), where Φ is the set of all valid strategies for V in which each strategy ϕ∈Φ is a unique combination of valid configurations of V.
In one or more embodiments, given a processing environment with p devices with an average peak floating-point performance of F FLOPS per device, and an average communication bandwidth of B bytes per second per link, the cost function may be expressed as:
where r=F/B is the FLOP-to-bytes ratio; layer cost tl is the cost (in FLOP) of computing a layer, e.g., a fully-connected layer, and may comprise both computation and communication that may occurs internally within a layer, such as an all-reduce operation that is normalized to FLOP (e.g., by multiplying it with r); and data transfer cost tx is the communication cost in bytes needed to communicate the tensor that flows along the edge (u, v) or (v, u) during forward and/or backward propagation. It is noted that tx is edge-direction agnostic, i.e., for an edge (u, v)∈E, tx(u, v, ϕ(u), ϕ(v))=tx(v, u, ϕ(v), ϕ(u)).
In one or more embodiments, cost function may be an approximation of the actual cost and may ignore overlapping (or pipelining) of different layers by adding the costs tl(vx, ⋅, ⋅) and tl(vy, ⋅, ⋅) of any two layers, e.g., instead of taking a max where possible. As previously mentioned, this approach accurately captures data and model parallelism and ignored pipeline parallelism. In one or more embodiments, while pipeline parallelism between layers may be ignored, pipeline parallelism opportunities within a layer may be accurately captured, e.g., by accounting for intralayer pipeline parallelism in the layer cost tl. As discussed in greater detail with reference to Section C, this approximation allows to devise a technique that efficiently and quickly finds the best strategy for DNNs. As experimental results demonstrate, this approach is very effective despite the simplification as most DNNs do not contain significant inherent pipeline parallelism opportunities. In contrast, some existing approaches use pipeline parallelism to improve parallel training throughput by making semantic modifications to the model, including using older weights. However, such semantic modifications lead to variations in model accuracy when compared to the original model. In addition, they may also require more epochs to converge, thus, eliminating any advantages obtained from pipeline parallelism. For comparison, various embodiments herein need not perform any semantic modifications to the model. As a result, the convergence rate and the final accuracy may be exactly the same as for the original model, advantageously, increasing hardware utilization through better parallelism. It is noted that even if various embodiments find the optimal solution {circumflex over (ϕ)}=arg minϕ∈Φ
(G, ϕ) for
, since the cost function
itself represents an approximation, rather than referring to the solution as the optimal strategy, a solution may be referred to herein as an efficient strategy or the best strategy to avoid confusion.
In embodiments, tl and tx may be computed analytically by using simple closed form expressions in experiments. For an edge (u, v), tx may be represented by the difference: maxd|A(v, d)|+|A(u, d)|−|A(v, d)∩A(u, d)|, where A(v, d) and A(u, d) are the volume of input tensor of v needed by a device d. For the few different types of DNN layers, analytically derived expressions may be used to compute layer costs ti. In one or more embodiments, many low-level details, such as cache effects, etc., may be ignored. In addition, using r=F/B to normalize the costs implicitly assumes that the computations achieve close to machine peak performance and the that communication bandwidth is fully utilized. It is noted that such assumptions are not necessary for the presented systems and methods to properly operate, but they may keep the cost computation simple. Further, some embodiments focus on the relative ordering of the costs of various strategies rather than absolute costs to ascertain the best strategy. These simplifying assumptions affect costs of all examined strategies more or less alike, preserving most of the relative ordering. In addition, as most of the DNN computations are composed of dense matrix operations, experiments show that when standard libraries, such as cuBLAS, cuDNN, NCCL, etc., are used for computations, these assumptions are not heavily violated.
As mentioned in Section B, each network layer may have a set of valid parallelization configurations. In one or more embodiments, finding an efficient strategy for DNNs comprises choosing the best configuration(s) for each layer.
Notations: Let G=(V, E) be the computation graph of a DNN. For a vertex v∈V, N(G,v) denotes its neighbors, i.e., N(G,v)={u∈V|(u, v)∈EV (v,u)∈E}. The notation N(v) refers to neighbors of v when G is clear from the context, and N(U)=Uu∈UN(u). For any vertex-set X, its restricted neighbors set R(X,Y)=N(X)∩Y may be defined as a set of neighbors of vertices in X restricted to the set Y. For any vertex-set U and a vertex u∈U, its restricted reachable set P (U,u) may be defined as a set of vertices v that are reachable from u using an undirected path (v1, . . . , vp) of length p>1 s.t. ∀i∈[1,p], Vi ∈U, i.e., v∈P(U,u) means that a restricted path exists between u and v that goes through only the vertices in U. Let =(v(1), . . . , v(|V|)) be a sequence of vertices in V that is arbitrarily ordered,
<i and
≥i denote sets {v(1), . . . , v(i)} and {v(1), . . . , v(|V|)}, respectively.
<i and
>i denote the sets {v(1), . . . , v(i−1)} and {v(i+1), . . . , v(|V|)}, respectively. For reference, Table I summarizes these notations and other notations defined below.
(v)
.
<i
.
>i
.
, i)
≤i, v(i)) − {v(i)}.
, i)
, i) − ∪j<i L(
, j).
, i)
>i | v ∈ P(
≤i ∪
A brute-force method to compute an efficient strategy for G=(V,E) is to enumerate all possible combinations of configurations of the vertices and choose the one with the least cost. The combinatorial nature of this method makes it impractical to use it even on small graphs such as DNN1, discussed in Experimental Results section D below. However, the complexity of the problem can be greatly reduced due to the following observation in Equation (1): changing the configuration for a vertex v from Ci to Cj affects only the layer cost tl(v, ⋅, ⋅) of the vertex itself and its data transfer costs with its neighbors tx (u, v, ⋅, ⋅), where u∈N(v). This allows ordering the vertices V into a sequence =(v(1), . . . , v(|V|)) in the order in which the vertices are visited during a breadth-first traversal of G, and computing the best strategy for G using the recurrence (2) below. Let ϕi be a substrategy for the set of vertices in R (
≤i,
>i), where the restricted neighbor set R(
≤i,
>i) is a set of vertices in
>1 with at least a neighbor in
≤i, then, for i∈[1,|V|],
where,
The cost of an efficient parallelization strategy for G is given by (
, |
|, Ø). As shown in Table III in Section D, computing efficient strategies using this recurrence is still quite expensive, and it takes a significant amount of time to find the best strategies for graphs other than simple path graphs such as DNN1. A more efficient approach to find the best strategies according to various embodiments is discussed next.
From the recurrence (2), it can be observed that since is a function of the substrategy ϕi that comprises configurations of R(
≤i,
>i), a process that computes efficient strategies using (2) should compute
for all possible combinations of ϕi∈Φ′, where Φ′ is a set of all possible substrategies for R(
≤i,
>i). Hence, the computational complexity for finding an efficient strategy using (2) is at least O(KM+1), where K=maxv∈V|
(v)|. is the maximum number of configurations for any vertex in G, and M=|R(
≤i,
>i)|.
Various embodiment order nodes into a sequence such that the size of the restricted neighbor sets R(
≤i,
>i) for any v(i) in
is as small as possible. In one or more embodiments, this may be accomplished by using a process that orders vertices V in a manner such that the sizes of the restricted neighbor sets are kept to a minimum.
Given a graph G=(V,E) and a sequence =(v(1), . . . , v(|V|), in one or more embodiments, a left-reachable set L(
, i)=P(
≤i, v(i))−{v(i)} may be defined as a set of nodes in
<i that are reachable from v(i) through, e.g., an undirected path (v(a), . . . , v(p)) of length p>1 s.t. ∀x∈[a,p], 1≤x≤i. In one or more embodiments, a terminal set T(
,i)={v(j) ∈L(
, i))|∃k<i, v(j) ∈L(
,k))} of v(i) is a set of vertices that are left-reachable from v(i), and are not left-reachable from any other vertex v(k) for k<i. Alternatively, a terminal set may be defined as T(
,i)=L(
,i)−∪j<iL(
,j). In one or more embodiments, a right-dependent set D(
,i)=R(L(
,i)∪{v(i)},V>i)={v∈
>i|v∈P(
≤i∪{v}, v(i))} may be defined as a subset of
>i comprising restricted neighbors of itself and its left-reachable set. In one or more embodiments, a right-dependent set D(
,i) may be the equivalent of R(
≤i,
>i) in recurrence (2), for the recurrence (4) defined below.
In one or more embodiments, to compute the sequence , the recurrence in (2) may be reformulated as follows: For any i∈[1, |V|],
where, is defined in (3), and ϕi is a substrategy for the set of nodes in D(
,i).
,i), instead of in terms of the previous node in the sequence v(i−1) as in recurrence (2): In the example in
≤5,
>5)|=|{v(7), v(8), v(9) }|=3, while |D(
,5)|=1. Thus, a simple recurrence R(
, i, ϕi) that directly recurses to R(V, i−1, ϕi) has |ϕi| more than twice the size of |D(
,5)|, even on a simplified graph. Since |Φ′|=O(K|
Lemma 1:
Given a computation graph G=(V,E) and a sequence , for any vertex
v(i)∈V and a substrategy ϕi comprising configurations for the vertices in D(,i),
where, Φ′ is the set of all valid substrategies for {v(i)}∪L(,i).
Proof:
It will be shown that (
, i, ⋅) recursively calls
(
, j, ⋅): (i) at most once, (ii) at least once, and (iii) only if v(j) ∈L(
,i).
From the definition of a terminal set, for any x,y s.t. 1≤x<y≤|V|, T(,x)∩T(
,y)=Ø, since as a contradiction if v(j) ∈T(
,x)∩T(
,y), then v(j)∈L(
,x)∧v(j)∈L(
,y)⇒v(x) ∈L(
,y)⇒v(j))∉T(
,y), violating the assumption. This proves (i).
In order to prove (ii), as a contradiction, assume that (
, j, ⋅) is never called from
(
, i, ⋅) for a v(j)∈L (
,i). Without loss of generality, let v(j) be such a vertex closest to v(i), i.e., for any other k∈[j+1, i−1], either v(k)∉L(
,i) or
(
, k, ⋅) is recursively called from
(
, k, ⋅). Let U={v(x)|v(j)∈L(V,x)} be a set of vertices from which v(j) is left-reachable. Let v(k)∈U be the vertex in U closest to v(j), i.e., ∃x<k, v(x) ∈U. Then, from the definition of a terminal set, v(j)∈T(
,k), and since
(
, k, ⋅) is recursively called from
(
, i, ⋅) according to the assumption made,
(
, j, ⋅) is also called from
(
, i, ⋅), contradicting the assumption.
It could be seen that (iii) is always true since, if v(j) ∉L(,i), then v(j) ∉L(
,k) for any v(k)∈L(
,i). Hence, v(j) ∉T(V,k) for any v(k)∈{v(i)}∪L(
,i).
The following Theorem follows from Lemma 1.
Theorem 1: Let G=(V,E) be a computation graph for a DNN that is executed in a processing environment with p devices with average FLOP-to-bytes ratio r. Let be a sequence for V, and Φ be all possible strategies for G, then
Proof:
Since a computation graph is a weakly connected graph, L(, |V|)∪{v|V|}=V, following Lemma 1,
Equation (5) is due to the fact that ∪v>i)}=E′, and for any x,y, x≠y⇒{{v(x), v(j)}|v(j)∈R({v(x)},
>x)}∩{{v(y),v(j)}|v(j) ∈ R({v(y)},
>y)}=Ø, where E′={{u,v}|(u,v)∈E} is the undirected equivalent of E.
,i) (referred to as v.d in
In Line 1, procedure 300 in
In one or more embodiments, as shown in Line 1 of DP process 400, the nodes in computation graph G may be first sorted using the G,i) may be computed. Then, for each ϕ′∈Φ′, a configuration C∈
(v(i)) that minimizes the cost
(
, i, ϕ′) may be computed as shown in Lines 11-22: Line 13 computes
(
, i, ϕ′); Lines 14-16 compute Σv
,i)
(
, j, ϕ″) using the costs stored in the DP tables v(t).tbl; and, the configuration that leads to the minimum cost may be stored into ϕ in Line 19. In one or more embodiments, the DP table for v(i) may be updated with ϕ and its cost min_cost, as shown in Line 22. From the definition of terminal nodes, it can be seen that v(j) ∈T(
,i)⇒D(
,j)⊆{v(i)}∪D(
,i). Hence, in Line 15, the table v(j).tbl comprises one entry for the substrategy ϕ|v
(
, j, ϕ″). Similarly, in Line 26 in
(v)| is the maximum number of configurations for a layer, and M=maxv
,i)| is the maximum size of the right-dependent set. The proof by induction below demonstrates that the right-dependent sets and terminal sets constructed by using the G
Theorem 2:
Given a computation graph G=(V,E), and a sequence V that is computed by using the G,i), and (ii) v(i).t=T(
,i).
Proof:
At the end of any iteration i, there is a partial sequence ≤i and the remaining nodes U=V−
≤i that are yet to be sequenced. As will be shown, the following invariants hold at the end of any iteration i: For any u∈U,
and, for any v(j) ∈≤i, v(j).d=D(
,j) and v(j).t=T(
,j).
Induction Base:
The invariants trivially hold at the beginning of the first iteration (i=0) as v.d is initialized to N(v), and v.t is initialized to Ø for any v∈V.
Induction Step:
As a hypothesis, the invariants are true at the end of an iteration i. Let v(i+1) be the node picked at Line 7 by G≤i∪{u}, u)=P (
≤i+1∪{u}, u). Hence, invariants (6) and (7) are trivially satisfied. For any u∈v(i+1).d,
Thus, invariant (6) is satisfied due to the assignments to u.d at Line 10 in G≤i∪{v(i+1), v},v(i+1))}=D(
, i+1). Further, since v(i+1)∉u.d for any u∈U, v(i+1).d is never modified after iteration i+1. Now,
Thus, the invariant (7) is satisfied due to the assignments to u.t at Line 11 in G≤i∪{v(i+1)},v(i+1))−{v(i+1)}−∪j≤iL(
,j)=T(
, i+1). Further, since v(i+1) ∉u.d for any u∈U, v(i+1).t is never modified after iteration i+1.
Since the computation graphs of DNNs are generally sparse and have few high-degree nodes, in one or more embodiments, the sequence generated by G
In contrast, in one or more embodiments, when nodes are sequenced using G
Experimental data shows that the number of configurations per vertex of DNN2 vary between 10-30 for p=8 GPUs, and the maximum number of configurations reaches up to 100 (i.e., K=100) for p=64 GPUs. In one or more embodiments, by ordering the vertices using G, i)∪{v(i)}| for any i may be maintained at ≤3, and the maximum number of combinations may be analyzed per vertex by the process, |Φ′|≤25200, for p=8. For comparison, when breadth-first ordering is used, the sizes of right-dependent sets reach up to 11, leading to |Φ′|≥1110, making it prohibitively expensive in practice, in terms of both time and memory.
Hybrid parallelism strategies found by DP process 400 are evaluated on four different benchmarks. The benchmarks are chosen to be representative of the whole DNN space. a) DNN1 is an image classification convolutional network (CNN) whose computation graph is a simple path graph, where each layer is connected only to the next layer; b) DNN2 is a deep CNN that uses inception modules to increase the number of layers while maintaining a reasonable computational budget. The nodes are split and concatenated at the beginning and end of each inception module, respectively, leading to a few high degree nodes in its computation graph; c) DNN3 is a two-layer recurrent neural network consisting of LSTM cells, used for language modeling tasks; and finally, d) DNN4 is a non-recurrent neural machine translation model, whose computation graph is quite different from recurrent networks such as DNN3. Datasets used in experiments are summarized in Table II. A batch size of 128 was used for CNNs, and a batch size of 64 was used for the rest of the benchmarks. Results using embodiments of the present disclosure are compared against data parallelism, expert-designed strategies, and the strategies suggested by Framework1.
Framework1:
Framework1 is a deep learning framework that automatically finds fast parallelization strategies. It uses a general Markov Chain Monte Carlo (MCMC) search process to explore the search space and iteratively proposes candidate strategies based on the simulated performance of previous candidates. When the search procedure is finished, the framework returns the best strategy it has discovered. As this approach is based on meta-heuristics, the framework could get stuck in a local minimum, returning a sub-optimal strategy. An initial candidate from the search space needs to be provided to MCMC to begin the search process, and the efficiency of the strategy found by Framework1 might also vary depending on the initial candidate. The evaluation of Framework1 herein uses expert-designed strategies as the initial candidates, such that Framework1 can improve upon them.
Expert Strategies:
Expert-designed parallelization strategies were developed by the domain experts to improve a model's parallel performance on a case-by-case basis. Since not all DNNs have well-defined expert-designed strategies proposed, the following ones that were the most relevant for various benchmarks are chosen: For convolution networks, data parallelism is used for convolution layers, and model parallelism is used for fully-connected layers. This technique is referred to as the “one weird trick” (OWT). Even though this technique was especially proposed for DNN1, it is applicable for any convolution network, in general. Thus, we use this technique to evaluate both DNN1 and DNN2, discussed in Experimental Results section D below. For RNNs, a data+pipeline parallelism strategy has been proposed, where different layers of RNN are placed on different devices to achieve pipeline parallelism, and each layer is replicated on the remaining devices for data parallelism. This strategy is compared against DNN3 experiments. For the DNN4 model, also discussed in Experimental Results section D below, a model parallelism strategy that has been suggested is used in experiments as the expert-designed strategy. This strategy was primarily proposed to overcome the memory bottlenecks in order to train a large DNN4 model within the memory constraints of current architectures, while also achieving good parallel execution efficiency.
In this subsection, the time taken by different approaches is measured to find the best strategies for the four different benchmarks. The running time of an embodiment that uses G
As the computation graph of DNN1 is a simple path graph, sizes of both R(≤i,
>i) and D(
,i) are just 1 for different vertices. Hence, both BF and G
For the DNN3 benchmark, since an RNN operator (with LSTM cells) can be efficiently represented in a single iteration space, the complete RNN operator, including the recurrent steps, is represented as a single vertex in the computation graph. The iteration space of an RNN operator is a five-dimensional space consisting of batch dimension, sentence sequence dimension (recurrent steps are captured by this dimension), layer dimension, hidden dimension, and the output dimension. Note that this is different from the way the RNN operator is handled by Framework1. In Framework1, the recurrent dimension is unrolled (experiments herein use an unroll factor of 40), and each iteration is represented as a vertex in the graph. In one or more embodiments, by representing the whole RNN operator as a single vertex, in addition to significantly reducing the graph size, it also allows to analyze configurations that take advantage of inherent pipeline parallelism present within an RNN operator. Configurations that split the layer and the sentence sequence dimensions capture pipeline parallelism opportunities in an RNN. As detailed in Section B, embodiments that do not capture inter-layer pipeline parallelism are still capable of capturing intra-layer pipeline parallelism by splitting the layer dimension of DNN3's iteration space. With this representation, the computation graph of DNN3 reduces to a simple path graph. Hence, both BF and G
Similar to DNN2, the computation graph of a DNN4 model has a large number of sparse vertices with a very few dense vertices. One major difference is that these high-degree vertices (such as the final output of the encoder) have longer dependencies (i.e., longer live range in the actual computation), as the results of these vertices are used later in the computation. The presence of such long dependencies may eliminate possible vertex orderings that can reduce the sizes of the right-dependent sets as effectively as in the DNN2 model. In one or more embodiments, this may lead to a longer running time to find the best strategy. As with DNN2, BF ordering fails to find the best strategy for DNN4 due to memory constraints. Since the DNN4 model was not implemented and analyzed with Framework1, no comparisons for Framework1's running time for DNN4 are presented here.
Performances of different strategies are compared below. Experiments were performed on different number of GPUs ranging from 4 (on a single node) to 64 (spread over 8 nodes), incremented in powers of 2. The nodes are connected to each other using InfiniBand interconnection network. The results are evaluated on the following two processing environments:
a) a multi-node/multi-GPU system, where each node comprises 8 GeForce GTX 1080 Ti GPUs (with sm_61 compute capabilities) fully-connected using PCIe links; b) a multi-node/multi-GPU system, where each node contains 8 GeForce RTX 2080 Ti GPUs (with sm_75 compute capabilities) fully-connected using PCIe links. The benchmarks and strategies are implemented in the mesh-tensorflow framework for evaluation. Since the DNN4 model was not implemented and evaluated on Framework1, the results from Framework1 for the DNN4 model are not included in the experiments.
DNN1:
DNN1 has five convolution layers, followed by three fully-connected layers. For p=8 GPUs, the DP process G
DNN2:
An Inception network has a sequence of inception modules (A-E) composed of convolution layers, followed by a single final fully-connected layer. In one or more embodiments, the efficient strategy obtained from G
DNN3:
An RNNLM network is composed of an embedding layer, followed by two layers of LSTM cells, and a final projection layer, whose computations are dominated by matrix-matrix multiplication. An embedding layer has a relatively large vocabulary dimension V, and a much smaller embedding dimension H. In one or more embodiments, G
For the LSTM cells, the DP process suggests to fully split the LSTM layer dimension, and partially to split the other three dimensions—batch, hidden, and output dimensionsto varying degrees depending on the number of GPUs on which it is executed. The final projection layers' output dimension may be same as V and the hidden dimension may be same as H. In one or more embodiments, the process suggests splitting the output dimension completely.
DNN4:
DNN4 model is a non-recurrent attention based NMT model. A known hybrid parallelism strategy splits the batch dimension of all the layers m-way, and splits the model dimensions of different layers—vocabulary dimension, feed-forward hidden layer dimension, and attention heads-n-way. In one or more embodiments, the process suggests using complete model parallelism on a few layers, especially the embedding layers, while using a hybrid of data parallelism and model parallelism of varying degrees for the remaining layers.
Data parallelism has been widely used as the standard technique to parallelize training DNNs. However, when model parameters are large, data parallelism performs poorly due to high communication requirement. OWT has been suggested to parallelize CNNs, and some have noted that modern CNNs consist of two types of layers with different properties: (i) convolutional layers containing small number of parameters, and (ii) fully-connected layers with large number of parameters. Thus, embodiments contemplate a parallelization strategy that parallelizes these two types of layers differently. As data parallelism is typically best suited for convolutional layers containing small number of model parameters, and model parallelism works best for fully-connected layers seen in CNN, the proposal suggest to parallelize the convolutional layers using data parallelism, and switch to model parallelism for the fully-connected layers present at the end of the network. As shown in
One dynamic-programming based approach to automatically find optimal schedules for CNNs exploits the following graph property commonly seen in the computation graphs of CNNs: Several of the nodes in the computation graph of a CNN have a single in-edge and a single out-edge. Based on this observation, propose two graph reduction operations—node elimination and edge elimination—to simplify a graph. This approach considers that the optimality is preserved by these reduction operations and uses an efficient dynamic programming-based process for computing optimal strategies for various CNNs. However, the computation graphs of other networks, such as, RNN do not have this special property, and thus the technique fails to reduce the graph efficiently to find an optimal strategy within a reasonable time. In contrast, embodiments disclosed herein are not limited to computation graphs of CNNs and can thus find efficient strategies for various types of networks, such as RNN and DNN4 models, within a few minutes. In addition, embodiments herein define a parallelization configuration to split any dimension in the iteration space. This is significantly different that splitting only the output tensor dimensions, which can heavily restrict the search space since some of the dimensions (such as the reduction k-dimension of GEMM) are not considered possible choices for parallelization.
One approach using Framework1 to automatically find efficient parallelization strategies for various DNNs uses an execution simulator that uses a general Markov Chain Monte Carlo (MCMC) search process to explore the search space to discover the best strategy. The strategy returned by this framework need not necessarily be optimal. While this method takes the whole search space into consideration, embodiments herein may ignore inter-layer pipeline parallelism to, advantageously, find an efficient strategy for various DNNs much faster than Framework1, without being subject to the limitation of getting stuck at a locally minimum solution.
REINFORCE uses machine learning to find efficient device placement for various layers to achieve efficient pipeline parallelism. However, the technique ignores data and model parallelism strategies in the search process. Further, it requires multiple GPUs and takes several hours to find an efficient strategy, while embodiments herein finish within a few minutes. Some approaches use polyhedral compilation techniques to optimize the kernels of individual DNN operators to efficiently execute on GPUs. However, these approaches do not consider parallelization of these kernels on multiple devices. Embodiments may use such techniques orthogonally to further improve the performance within each GPU.
Some of previous efforts apply semantic modifications to the model to expose better pipeline parallelism to improve parallel training. However, these semantic modifications lead to variations in the model accuracy, compared to the original model, and they also might take more epochs to converge, thus, eliminating any advantages obtained from the modifications. In contrast, embodiments herein need not perform semantic modifications to the model. As a result, the convergence rate and the final accuracy may be exactly the same as the original model, providing better hardware utilization through enhanced parallelism.
Several expert-designed strategies have been proposed for different networks based on domain specific knowledge. One suggests a technique for convolution networks, and another proposes a way to achieve good pipeline parallelism for RNNs. However, each network has to be individually analyzed manually to come up with an efficient strategy. Further, these strategies need not be necessarily optimal. In contrast, embodiments presented herein automate this process and can point an expert user towards the right direction for parallelization.
Presented systems and methods facilitate automatically finding efficient parallelism strategies for DNN applications. Embodiments use a recurrence formulation to compute the minimum cost of a computation graph. A technique to sequence the vertices in an efficient order that allows to compute the best strategy to parallelize the graph within a few minutes is presented. Results are evaluated against data parallelism, expert designed strategies, and the strategies proposed by a deep learning framework, Framework1. Results show that the strategies proposed by various embodiments outperform the standard data parallelism by up to a factor of 4. In addition, the proposed strategies perform better than expert-designed strategies, and strategies proposed by Framework1.
In one or more embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems (or computing systems). An information handling system/computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or may include a personal computer (e.g., laptop), tablet computer, mobile device (e.g., personal digital assistant (PDA), smart phone, etc.) smart watch, server (e.g., blade server or rack server), a network storage device, camera, or any other suitable device and may vary in size, shape, performance, functionality, and price. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of memory. Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, touchscreen and/or a video display. The computing system may also include one or more buses operable to transmit communications between the various hardware components.
As illustrated in
A number of controllers and peripheral devices may also be provided, as shown in
In the illustrated system, all major system components may connect to a bus 816, which may represent more than one physical bus. However, various system components may or may not be in physical proximity to one another. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network. Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub-modules or combined.
It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.
This patent application is related to and claims priority benefit under 35 USC § 119(e) to co-pending and commonly-owned U.S. Pat. App. No. 62/930,518, filed on 4 Nov. 2019, entitled “REDUCING TRAINING TIMES OF DEEP NEURAL NETWORKS THROUGH EFFICIENT HYBRID PARALLELISM,” and listing Venmugil Elango as inventor (Docket No. 28888-2363P), which patent document is incorporated by reference herein in its entirety and for all purposes.
Number | Date | Country | |
---|---|---|---|
62930518 | Nov 2019 | US |