Various aspects of the present disclosure may pertain to various forms of neural network interconnection for efficient training.
Due to recent optimizations, neural networks may be favored as a solution for adaptive learning-based recognition systems. They may currently be used in many applications, including, for example, intelligent web browsers, drug searching, and identity recognition by face or voice.
Fully-connected neural networks may consist of a plurality of nodes, where each node may process the same plurality of input values and produce an output, according to some function of its input values. The functions may be non-linear, and the input values may be either primary inputs or outputs from internal nodes. Many current applications may use partially- or fully-connected neural networks, e.g., as shown in
Multi-processor systems or array processor systems, such as graphic processing units (GPUs), may perform the neural network computations on one input pattern at a time. Alternatively, special purpose hardware, such as the triangular scalable neural array processor described by Pechanek et al. in U.S. Pat. No. 5,509,106, granted Apr. 16, 1996, may also be used.
These approaches may require large amounts of fast memory to hold the large number of weights necessary to perform the computations. Alternatively, in a “batch” mode, many input patterns may be processed in parallel on the same neural network, thereby allowing the weights to be used across many input patterns. Typically, batch mode may be used when learning, which may require iterative perturbation of the neural network and corresponding iterative application of large sets of input patterns to the perturbed neural network. Furthermore, each perturbation of the neural network may consist of a combination of error back-propagation to generate gradients for the neural network weights and cumulating the gradients over the sets of input patterns to generate a set of updates for the weights.
As the training and verification sets grow, the computation time for each perturbation grows, significantly lengthening the time to train a neural network. To speed up the neural network computation, Merrill et al. describe spreading the computations across many heterogeneous combinations of processors in U.S. patent application Ser. No. 14/713,529, filed May 15, 2015, and incorporated herein by reference. Unfortunately, as the number of processors grows, the communication of the weight gradients and updates may limit the resulting performance improvement. As such, it may be desirable to create a communication architecture that scales with the number of processors.
Various aspects of the present disclosure may include scalable structures for communicating neural network weight gradients and updates between a root processor and a large plurality of neural network workers (NNWs), each of which may contain one or more processors performing one or more pattern recognitions (or other tasks for which neural networks may be appropriate; the discussion here refers to “pattern recognitions,” but it is contemplated that the invention is not thus limited) and corresponding back-propagations on the same neural network, in a scalable neural network system (SNNS).
In one aspect, the communication structure may consist of a plurality of synchronizing sub-systems (SSS), which may each be connected to one parent and a plurality of children in a multi-level tree structure connecting the NNWs to the root processor of the SNNS.
In another aspect, each of the SSS units may broadcast packets from a single source to a plurality of targets, and may combine the contents of a packet from each of the plurality of targets into a single resulting equivalent-sized packet to send to the source.
Other aspects may include sending and receiving data between the parent and children of each SSS unit on either bidirectional buses or pairs of unidirectional buses, compressing and decompressing the packet data in the SSS unit, using buffer memory in the SSS unit to synchronize the flow of data, and/or managing the number of children being used by controlling the flow of data through the SSS units.
The NNWs may be either atomic workers (AWs) performing a single pattern recognition and corresponding back-propagation on a single neural network or may be composite workers (CWs) performing many pattern recognitions on a single neural network in a batch fashion. These composite workers may consist of batch neural network processors (BNNPs) or any combination of SSS units and AWs or BNNPs.
The compression may, like pulse code modulation, be reduced to as little as strings of single bits of data that may correspond to increments of the gradient and increments of weight updates, where each of the gradient increments may be different from each of the NNPs and for each of the weights.
Combining the data may consist of summing the data from each of the children below the SSS unit, or may consist of performing other statistical functions, such as means, variances, and/or higher-order statistical moments, and which may include time or data dependent growth and/or decay functions.
It is also contemplated that the SSS units may be employed to continuously gather and generate observational statistics while continuously distributing control information, and it is further contemplated that observational and control information may be locally adjusted at each SSS unit.
Various aspects of the disclosed subject matter may be implemented in hardware, software, firmware, or combinations thereof. Implementations may include a computer-readable medium that may store executable instructions that may result in the execution of various operations that implement various aspects of this disclosure.
Embodiments of the invention will now be described in connection with the attached drawings, in which:
Various aspects of this disclosure are now described with reference to
In one aspect of this disclosure, the communication structure within a SNNS may consist of a plurality of synchronizing sub-systems (SSS), which may each be connected to one parent and a plurality of children in a multi-level tree structure connecting the AWs or CWs to the root processor.
Reference is now made to
In another aspect, at a system level, in a manner similar to Pechanek's adder tree (108 in
Reference is now made to
In another aspect of the current disclosure, the data may be combined and compressed by normalizing, scaling or reducing the precision of the results. Similarly, the data may be adjusted to reflect the scale or precision of each of the children before the data is distributed to the children.
During the iterative process of forward pattern recognition followed by back-propagation of error signals, as the training reaches either a local or global minimum, the gradients and the resulting updates may become incrementally smaller. As such, the compression may, like pulse code modulation, reduce the word size of the resulting gradients and weights, which may thereby reduce the communication time required for each iteration. The control logic 36 may receive word size adjustments from either the root processor or from each of the plurality of the children. In either case, adjustments to scale and/or word size may be performed prior to combining the data for transmission to the parent or subsequent to distribution for each of the children.
In another aspect of the current disclosure, the control logic 36 may, via commands from the root processor, turn on or turn off one or more of its children, by passing an adjusted command on to the respective children and correspondingly adjusting the computation to combine the resulting data from the children.
In yet another aspect of the current disclosure, the control logic 36 may synchronize the packets received from the children by storing the early packets of gradients and, if necessary, stalling one or more of the respective children until the corresponding gradients have been received from all the children, which may then be combined and transmitted to the parent.
It may be noted here that all the AWs, BNNPs and CWs may have separate local memories, which may initially contain the same neural network with the same weights. It is further contemplated that the combining of a current cycle's gradients may coincide with a distribution of a next cycle's weight updates, and that if the gradients take too long to collect, updates may be distributed, thereby beginning the processing of the next cycle, before all of the current cycle's gradients have been combined, thereby varying the weights between the different NNWs. As such the root processor may choose to stall all subsequent iterations until all the NNWs have been re-synchronized.
Furthermore, the root processor may choose to reorder the weights into categories, e.g., from largest to smallest changing weights and, thereafter, may drop one or more of the weight categories on each iteration.
When combined, these techniques may maximize the utilization of the AWs and CWs, by minimizing the communication overhead in the neural network system, thereby making it a more scalable neural network system.
Lastly, in yet another aspect of the current disclosure, the SSS units may be employed between a root processor and a plurality of continuous sensor-controller units to continuously gather and generate observational statistics while continuously distributing control information, and it is further contemplated that the observational and control information may be locally adjusted at each SSS unit.
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations and sub-combinations of various features described hereinabove as well as modifications and variations which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.
This application is a non-provisional application claiming priority to U.S. Provisional Patent Application No. 62/164,645, filed on May 21, 2015, and incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62164645 | May 2015 | US |