Achieved advances in speech processing and media technology have led to a wide use of automated user machine interaction across different applications and services. Using an automated user machine interaction approach, businesses may provide customer services and other services with relatively inexpensive cost.
Typical user machine interaction is implemented through use of speech recognition systems. Speech recognition systems convert input audio, including speech, to recognized text. During recognition, acoustic waveforms are typically divided into a sequence of discrete time vectors (e.g., 10 ms segments) called “frames,” and one or more of the frames are converted into sub-word (e.g., phoneme or syllable) representations using various approaches. According to one such example approach, input audio is compared to a set of templates, and the sub-word representation for the template in the set that most closely matches the input audio is selected as the sub-word representation for that input. In yet another approach, statistical modeling is used to convert input audio to a sub-word representation (e.g., to perform acoustic-phonetic conversion). When statistical modeling is used, acoustic waveforms are processed to determine feature vectors for one or more of the frames of the input audio, and statistical models are used to assign a particular sub-word representation to each frame based on its feature vector.
Hidden Markov Models (HMMs) are statistical models that are often used in speech recognition to characterize the spectral properties of a sequence of acoustic patterns. For example, acoustic features of each frame of input audio may be modeled by one or more states of an HMM to classify the set of features into phonetic-based categories. Gaussian Mixture Models (GMMs) are often used within each state of an HMM to model the probability density of the acoustic patterns associated with that state. Artificial neural networks (ANNs) may alternatively be used for acoustic modeling in a speech recognition system. ANNs may be trained to estimate the posterior probability of each state of an HMM given an acoustic pattern. Some statistical-based speech recognition systems favor the use of ANNs over GMMs due to better accuracy in recognition results and faster computation times of the posterior probabilities of the HMM states.
Embodiments of the present invention provide methods and apparatuses that support training neural networks. According to at least one example embodiment, a method of training a neural network comprises: by each agent of a plurality of agents, performing a pipelined gradient analysis to update respective local models of a neural network using respective subsets of data from a common pool of training data; and updating a common global model of the neural network based upon the local models. In an example embodiment, the pipelined gradient analysis is performed by splitting the respective local models of the neural network into consecutive chunks and assigning each chunk to a stage of the pipeline.
In yet another example embodiment, each stage of the pipeline is associated with a graphics processing unit (GPU). Further still, embodiments may perform the pipelined gradient analysis by selecting the subsets of data from the common pool of training data according to a focused-attention back-propagation (FABP) strategy. An alternative embodiment, performed according to the principles of the present invention, includes an initialization procedure where a single agent of the plurality of agents performs the pipelined gradient analysis to update its respective local model of the neural network using a respective subset of data from the common pool of training data. This initialization procedure further includes updating the common global model of the neural network based upon the respective local model.
According to an embodiment, the common global model is owned by a single agent of the plurality of agents at any one time, and this ownership is regulated by a locking mechanism. In such an embodiment, the common global model is updated by a single agent during a period in which the single agents owns the common global model. In another embodiment of the present invention, the common global model is updated when a critical section is reached. In such an embodiment, the critical section may be defined by a point when a multitude of agents of the plurality of agents need to update the common model. In yet another embodiment, a critical section is reached when an agent of the plurality is ready to update the global model and the agent of the plurality that is ready to update the global model does not own the global model. In such an embodiment, the agent that is ready to update the global model may request the global model.
An alternative embodiment of the present invention is directed to a computer system for training a neural network. Such a computer system embodiment comprises a processor and a memory with computer code instructions stored thereon. The processor and the memory, according to such an embodiment, with the computer code instructions, are configured to cause the system to: by each agent of a plurality of agents, perform a pipelined gradient analysis to update respective local models of a neural network using respective subsets of data from a common pool of training data; and update a common global model of the neural network based upon the local models.
According to an embodiment of the system, in performing the pipelined gradient analysis, the processor and the memory, with the computer code instructions, are further configured to cause the system to split the respective local models of the neural network into consecutive chunks and assign each chunk to a stage of a pipeline. In embodiments, a “chunk” may be a portion of the neural network, e.g. one or more blocks of weights where a block of weights is the weights connecting two consecutive DNN layers. Further still, according to an embodiment, each stage of the pipeline may be associated with a GPU. In yet another computer system embodiment, in performing the pipelined gradient analysis, the processor and the memory, with the computer code instructions, are further configured to cause the system to select the subsets of data used for the analysis from the common pool of training data according to a FABP strategy.
Another embodiment of the system employs an initialization procedure. According to one such embodiment, the processor and the memory, with the computer code instructions, are further configured to implement the initialization procedure that causes the system to: by a single agent of the plurality of agents, perform the pipelined gradient analysis to update its respective local model of the neural network using a respective subset of data from the common pool of training data and update the common global model of the neural network based upon its local model. According to yet another embodiment, the common global model is owned by a single agent of the plurality of agents at any one time according to a locking mechanism. In such an embodiment, the common global model is updated by the single agent during a period in which the single agents owns the common global model.
In yet another example computer system embodiment, a critical section is reached when an agent of the plurality is ready to update the global model and the agent of the plurality that is ready to update the global model does not own the global model. In such an embodiment, the agent that is ready to update the global model may request the global model.
An embodiment of the present invention is directed to a computer program product for training a neural network. The computer program product, according to such an embodiment, comprises one or more computer readable tangible storage devices and program instructions stored on at least one of the one or more storage devices. The program instructions, when loaded and executed by a processor, cause an apparatus associated with the processor to: cause each agent of a plurality of agents to perform a pipelined gradient analysis to update respective local models of a neural network using respective subsets of data from a common pool of training data and update a common global model of the neural network based upon the local models.
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
A description of example embodiments of the invention follows.
The teachings of all patents, published applications, and references cited herein are incorporated by reference in their entirety.
It should be understood that the terms neural network, artificial neural network (ANN), and deep neural network (DNN) are used interchangeably herein.
As described hereinabove, neural networks may be used in speech recognition applications. One such ANN commonly used for speech recognition is the feed-forward multi-layer perceptron (MLP). This neural network includes a plurality of layers of nodes forming a directed graph. The most basic MLP includes an input layer and an output layer. MLPs with three or more layers are also commonly referred to as DNNs and include one or more “hidden” layers arranged between the input and output layers. Each layer in the MLP includes a plurality of processing elements called nodes, which are connected to other nodes in adjacent layers of the network. The connections between nodes are associated with weights that define the strength of association between the nodes. Each node is associated with non-linear activation functions that define the output of the node given one or more inputs. Typical activation functions used for input and hidden layers in an ANN are sigmoid functions or Rectified Linear Units, whereas a softmax function is often used for the output layer.
While neural networks may be favored over existing approaches due to increased accuracy in recognition results and faster computation times of the posterior probabilities of the Hidden Markov Model (HMM) states, ANNs are not without their drawbacks. ANNs and their associated functions require a significant amount of training on example speech patterns to achieve an acceptable level of accuracy. This training requires a significant amount of time and can be computationally very expensive. It is not uncommon for ANN training using a single processor to take approximately one to two weeks or even longer, depending upon the size of the pool of training data. Further, while parallelization is a common technique used in computer systems to speed up processing times, ANN training is not easily amenable to parallelization. Training of a DNN is particularly difficult to parallelize due to the use of small mini-batches and the need to update the model after each mini-batch has been processed. Difficulties in training ANNs are further described in Zhang et al., “Asynchronous Stochastic Gradient Descent For DNN Training,” the contents of which are herein incorporated by reference.
Existing methods of data parallelization of stochastic gradient descent (a neural network training methodology) utilize a central coordinator (parameter server) and several computation agents. Each agent has a replica of the whole model and its own shard of the training data. At each training step, each agent gets the latest available model from the parameter server, computes the variations of the model on its own data shard, and sends the variations to the parameter server. This is a poor solution to use in neural network training because back-propagation uses small mini-batches and the model parameters need to be updated after every mini-batch processing, which results in a very high model and gradients exchange rate.
Thus, embodiments of the present invention introduce a hierarchical approach to neural network training that employs data parallelization and multi-GPU (graphics processing unit) pipelined parallelization to achieve a significant speed up in neural network training time. Described hereinbelow are example method and system embodiments for neural network training that employ pipelined data parallelization techniques according to the principles of the present invention.
Once each agent 101A-N has the training data 103a-n, each respective agent 101A-N performs a respective gradient analysis 104a-n. According to embodiments of the present invention, the gradient analysis 104a-n is a pipelined gradient analysis. Further detail regarding the gradient analysis 104a-n is described hereinbelow in relation to
As described hereinabove,
At this point, the global model 106 only includes the results from the gradient analysis 104a performed by the agent A 101A; thus, embodiments of the system 100 must proceed to update the global model 106 to reflect changes to the other local work models 105b-n.
The system 100 can proceed in the manner described above to continue training the neural network model 106 using the remaining training data 102. The global model 106 can be transferred to any number of agents 101 in the system 100, and once owned by each respective agent, the global model can be updated to include the results of each respective gradient analysis 104 that is reflected in the respective working models 105.
The timing/transfer method 220 is but one example method that may be utilized by embodiments of the present invention. Alternative embodiments are not limited to any fixed timing. The agents instead may employ non-deterministic, parallel, and asynchronous methods for training a neural network. In such an asynchronous method, the agents perform respective gradient analyses and then, when an individual agent is ready to update the global model, that agent requests the global model from the agent that currently owns the model. If in that moment, the global model is being downloaded/updated by another agent, then the requesting agent waits and retries.
These timing/transfer methods allow the global model to be updated without utilizing a central server that can become a bottleneck. Instead, embodiments of the present invention utilize a locking mechanism where ownership of the global neural network model is dynamically allocated and owned by the most recent agent that updated it. This rotational ownership can be employed such that each agent only downloads the global model when the agent needs to apply gradient changes. Such a method helps to prevent model transfers and, thus, reduces the use of bandwidth. The locking mechanism may be implemented using any method that can implement a critical section, such as a mutex locking protocol.
Training a neural network is an iterative process that includes processing operations for determining node activations in the feed-forward direction (i.e., from an input layer to an output layer) and propagation of network errors in the backward direction (i.e., from an output layer to an input layer). Network errors are a measure of difference between actual outputs of the neural network and an expected output given a particular input. One such technique for training a neural network is the supervised learning technique called backpropagation. This technique relies on iterative gradient descent optimizations to minimize the network errors in the neural network by adjusting the connection weights between the nodes.
In ANN training, the feed-forward computation determines, for each layer, the output network activations 443 of the corresponding nodes given a layer input vector. After the forward stage, the outputs for nodes in each layer in the ANN are used to determine network errors 444, which are then updated in a backpropagation stage during which the errors are propagated from the output layer 441e to underlying hidden layers 441b-d in the ANN 440.
The processing in the architecture of
According to embodiments of the present invention, all GPUs 442 are configured to work simultaneously on the data they have. In the simplest case, the number of layers (hidden and output) equals the number of GPUs, and each GPU is assigned to process calculations for a single layer. If the number of layers in the ANN exceeds the number of GPUs, multiple layers may be grouped on a GPU. Moreover, while four layers, each with a corresponding GPU are depicted, embodiments of the present invention may utilize any number of layers and corresponding GPUs, for example between four and eight.
Each input pattern travels twice per GPU, once in the feed-forward direction and once for backpropagation. Because the data needed for an update of the neural network arrives at a delay due to the pipeline roundtrip, updates to the network weights in the ANN use delayed data, and the deeper the network (i.e., due to more layers), the longer the delay. As the activation computations in the feed-forward direction and the error computations in the backpropagation are out of sync, queues of activations from the feed-forward direction may be kept to compute the weight variations with the corresponding activations and errors. The activation queues can be used to compensate for the delay introduced by the resulting lack of synchronization of the weights used in the forward and backward propagation caused by the delayed network weight updates introduced by the pipeline.
Furthermore, while the pipeline method in
The method 550 continues by updating a common global model of the neural network based upon the local models. The common model may be updated (552) according to the principles described herein. For example, the method 550 may utilize a locking procedure and employ the timing described hereinabove in relation to
According to an embodiment of the method 550, performing the pipelined gradient analysis (551) comprises splitting the respective local models of the neural network into consecutive chunks and assigning each chunk to a stage of a pipeline. In such an embodiment, this may be performed by each agent of the plurality of agents so as to implement a method where every agent is performing a pipelined analysis. Further still, according to an embodiment, each stage of the pipeline may be associated with a GPU, where the GPUs perform the gradient analysis. Moreover, in an embodiment where the neural network is split into consecutive chunks and each chunk is assigned a stage in the pipeline, each stage may be assigned a respective GPU. In this way, embodiments may provide DNN multi-GPU parallel training through hierarchical data splitting and pipelining.
According to yet another example embodiment of the method 550, performing the pipelined gradient analysis (551) may further include selecting the subsets of data from the common pool of training data according to a focused-attention back-propagation (FABP) or stochastic data sweeping strategy. In such an embodiment, the selected data is used to perform the pipelined gradient analysis (551).
Embodiments of the method 550 may further employ an initialization procedure. An example initialization procedure is implemented by a single agent of the plurality. In such an embodiment, the single agent performs the pipelined gradient analysis using a subset of data from the common pool of training data and, further, updates the common global model of the neural network based upon its local model. In an example embodiment, an agent may perform the initialization procedure using a pre-established amount of the training data, for example, 20%. Yet another initialization procedure includes one agent of the plurality performing the analysis and then at some subsequent time, the remaining agents starting the analysis. In such an embodiment, the agents may all start performing the analysis during a first iteration. In yet another embodiment, an initialization procedure may be performed where each agent is started gradually during the first iteration, for example at regular intervals of time, e.g. a first agent at time t=0, a second agent at time t=T, and a third agent at time t=2T, etc.
According to an embodiment of the method 550, the common global model is owned by a single agent of the plurality of agents at any one time according to a locking mechanism. In such an embodiment, the common global model is updated by the single agent during a period in which the single agent owns the common global model. According to such an embodiment, the plurality of agents work in conjunction to implement said locking mechanism. According to an example locking mechanism, upon beginning the method 550, a single agent owns the common global model. As the method 550 progresses, a critical section is reached, and when this critical section is reached, the agent owning the global model updates the model to reflect all of the changes to its respective local model that were determined by the gradient analysis. The common global model is then transferred to a next agent of the plurality to allow that agent to update the common global model. In such an embodiment, a critical section is a point when a multitude of agents of the plurality of agents may need to update a same section of the common model but when only one agent of the plurality may update the model at a time. In other words, the critical section is entered when an agent of the plurality has reached a point in processing when it is ready to update the global model but when it does not own the global model. At this point, the agent that is ready to update the global model needs to download the global model from the agent that currently owns it. However, if the global model is being updated by the agent that owns it, then the critical section prevents the agent requesting the global model from performing the download and queues the download request. When the agents that owns the global model finishes modifying the global model, it than exits the critical section and makes the global model public, i.e. available to be downloaded by another agent. At this point the agent in the critical section queue can lock, download, and modify the global model. Thus, the critical section also serves to manage a queue of agents that have requested the model and implements exclusive access to the global model.
The system 660 may implement any method described herein. For example, the CPU 662 and memory 666 and/or storage 665, with computer code instructions stored thereon, may be configured to cause the system 660 to provide a plurality of agents each configured to perform a pipelined gradient analysis to update respective local models of a neural network using respective subsets of data from a common pool of training data and, in turn, update a common global model of the neural network based upon the local models.
Various hardware components may be used in performing the above-described methods and, further, this variety of hardware components may be configured in numerous ways to implement the various methods and systems described hereinabove. One example embodiment of the present invention is implemented using a single server with four agents, each using a pipelined parallelization on four GPUs. In such an embodiment, the GPUs connected on the first PCIe of the server may contain stages one and two of the pipelines, and the GPUs connected on the second PCIe of the server may contain stage three and four of the four pipelines. Thus, utilizing sixteen GPUs provided by the server. Further, the agents may be executed by CPU threads, among other examples.
As described herein, the agents utilize two shared information resources: the official model and the training data. According to embodiments, the agents may be configured to access the official model in a devoted critical region, i.e. section, and similarly pick up new training data in a devoted critical region. This ensures that no two agents can access the same training data or modify the global model at a given time. In order to minimize computational costs, embodiments may further utilize conditional events to make agents “sleep” when the training data is finished and “wake-up” when the last agent concludes its processing. Moreover, in another embodiment, each agent may be configured to allocate its respective local model on GPU memory, thus avoiding transfers from CPU to GPU RAM and, in turn, further increasing computational efficiency.
The methods and systems may be tuned and varied so as to employ the most efficient approaches given available hardware and software. One example architecture may employ a server of sixteen GPUs that can implement four agents each using a pipeline length of four.
The described training techniques were tested for validation. Experiments involved tuning the following recipes:
1) Large DNN Scorer for NCS US English
Experimental setup:
In this case the following configurations were tested:
The results are shown below in Table 1.
2) Increasing the Number of GPUs
Using the same case study as above, this experiment increased the number of GPU jobs, moving from 16 (DP=4, PP=4) to 32 (DP=8, PP=4) and 64 (DP=16, PP=4). Such a method requires 32/64 GPUs on the same server, but may be implemented using 16 GPUs in time sharing. This is useful so as to fully load the GPUs, masking the model transfer time of data parallelization (DP) and the imperfect balancing of pipelined parallelization (PP). Table 2 below shows the GPU loading of this experiment.
The results from increasing the number of GPUs is shown below in Table 3.
As shown above, 32 logical GPUs (mapped on 16 physical GPUs) resulted in a 37% speed-up and 64 logical GPUs (mapped on 16 physical GPUs) resulted in 60% speed-up. These experiments show that hierarchical integration of data splitting and pipelining make the parallelization more efficient and lead to large speed-ups. On a large DNN scorer trained with approximately 1300 hours of data, the total speed-up was 9.6 times and achieved by running 16 jobs on 16 GPUs in parallel. The aforementioned speed-up can be improved to 15.4× by running 64 jobs on 16 GPUs with DP=16 and PP=4. Thus, training a similar size DNN with 10,000 hours of data would take approximately 6 days instead of 3 months. This could be even further reduced if the training is implemented on a single server with 32 physical GPUs.
It should be understood that the example embodiments described above may be implemented in many different ways. In some instances, the various methods and machines described herein may each be implemented by a physical, virtual, or hybrid general purpose computer, or a computer network environment.
Embodiments or aspects thereof may be implemented in the form of hardware, firmware, or software. If implemented in software, the software may be stored on any non-transient computer readable medium that is configured to enable a processor to load the software or subsets of instructions thereof. The processor then executes the instructions and is configured to operate or cause an apparatus to operate in a manner as described herein.
Further, firmware, software, routines, or instructions may be described herein as performing certain actions and/or functions of the data processors. However, it should be appreciated that such descriptions contained herein are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
It should also be understood that the flow diagrams, block diagrams, and network diagrams may include more or fewer elements, be arranged differently, or be represented differently. But it further should be understood that certain implementations may dictate the block and network diagrams and the number of block and network diagrams illustrating the execution of the embodiments be implemented in a particular way.
Accordingly, further embodiments may also be implemented in a variety of computer architectures, physical, virtual, cloud computers, and/or some combination thereof, and, thus, the data processors described herein are intended for purposes of illustration only and not as a limitation of the embodiments.
While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.