Various aspects of the present disclosure may pertain to various forms of neural network batch processing from custom hardware architectures to multi-processor software implementations, and parallel control of multiple job streams.
Due to recent optimizations, neural networks may be favored as a solution for adaptive learning-based recognition systems. They may currently be used in many applications, including, for example, intelligent web browsers, drug searching, and voice and face recognition.
Fully-connected neural networks may consist of a plurality of nodes, where each node may process the same plurality of input values and produce an output, according to some function of its input values. The functions may be non-linear, and the input values may be either primary inputs or outputs from internal nodes. Many current applications may use partially- or fully-connected neural networks, e.g., as shown in
Multi-processor systems or array processor systems, such as Graphic Processing Units (GPUs), may perform the neural network computations on one input pattern at a time. This approach may require large amounts of fast memory to hold the large number of weights necessary to perform the computations. Alternatively, in a “batch” mode, many input patterns may be processed in parallel on the same neural network, thereby allowing the weights to be used across many input patterns. Typically, batch mode may be used when learning, which may require iterative perturbation of the neural network and corresponding iterative application of large sets of input patterns to the perturbed neural network. Skeirik, in U.S. Pat. No. 5,826,249, granted Oct. 20, 1998, describes batching groups of input patterns derived from historical time-stamped data.
Recent systems, such as internet recognition systems, may be applying the same neural network to large numbers of user input patterns. Even in batch mode, this may be a time-consuming process with unacceptable response times. Hence, it may be desirable to have a form of efficient real-time batch mode, not presently available for normal pattern recognition.
Various aspects of the present disclosure may include hardware-assisted iterative partial processing of multiple pattern recognitions, or jobs, in parallel, where the weights associated with the pattern inputs, which are in common with all the jobs, may be streamed into the parallel processors from external memory.
In one aspect, a batch neural network processor (BNNP) may include a plurality of field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), each containing a large number of inner product unit (IPU) processing units, image buffers with interconnecting busses and control logic, where a plurality of pattern recognition jobs are loaded, each into one of the plurality of the image buffers, and weights for computing each of the nodes, which may be loaded into the BNNP from external memory. The IPUs may perform, for example, inner product, max pooling, average pooling, and/or local normalization based on opcodes from associated job control logic, and the image buffers may be controlled by data & address control logic.
Other aspects may include a batch-based neural network system comprised of a load scheduler that connects a plurality of job dispatchers and a plurality of initiators, job dispatchers to job initiators each controlling a plurality of associated BNNPs with virtual communication channels to transfer jobs and results between/among them. The BNNPs may be comprised of GPUs, general purpose multi-processors, FPGAs, or ASICs, or combinations thereof. Upon notification to a load scheduler of the completion of a batch of jobs, the job dispatcher may choose to either keep or terminate a communication link, which may be based on the status of other batches of jobs already sent to the BNNP or plurality of BNNPs. Alternatively, upon notification of completion of the batch of jobs, the load scheduler may choose to either keep or terminate the link, based, e.g., on other requests for and the availability of equivalent resources. The job dispatcher may reside in the user's server or in the load scheduler's server. Also, the job dispatcher may choose to request a BNNP for a partial batch of jobs or to send an assigned BNNP a partial batch of jobs over an existing communication link.
Various aspects of the disclosed subject matter may be implemented in hardware, software, firmware, or combinations thereof. Implementations may include a computer-readable medium that may store executable instructions that may result in the execution of various operations that implement various aspects of this disclosure.
Embodiments of the invention will now be described in connection with the attached drawings, in which:
Various aspects of this disclosure are now described with reference to
In one aspect of this disclosure, a BNNP may include a plurality of FPGAs and/or ASICs, which may each contain a large number IPUs, image buffers with interconnecting buses, and control logic, where a plurality of pattern recognition jobs may be loaded, each into one of the plurality of the image buffers, and weights for computing each of the nodes may be loaded into the BNNP from external memory.
Reference is now made to
To perform a batch of, for example, pattern recognition jobs, which may initially consist of a plurality of input patterns, one pattern per job, that may be inputted to a common neural network with one set of weights for all the jobs in the batch, the patterns may be initially loaded from the I/O bus 31 onto the image bus 30 to be written into the plurality of image buffers 20, one input pattern per image buffer, followed by commands written to the D&A control logic 25 to begin the neural network computations. The D&A control logic 25 may begin the neural network computations by simultaneously issuing burst read commands with addresses through the memory interface 24 to external memory, which may be, for example, double data rate (DDR) memory (not shown), while issuing commands for each job to its respective job control logic 21. There may be M*N IPUs in each FPGA, where each of M jobs may simultaneously use N IPUs to calculate the values of N nodes in each layer (where M and N are positive integers). This may be performed by simultaneously loading M words, one word from each job's image buffer 20, into each job's N IPUs 22, while inputting N words from the external memory, one word for each of the IPUs 22 in all M jobs. This process may continue until all the image buffer data has been loaded into the IPUs 22, after which the IPUs 22 may output their results to each of their respective job buses 27, which may be performed one row of IPUs 22 at a time, for N cycles to be written into the image buffers 20. To compute one layer of the neural network, this process may be repeated until all nodes in the layer have been computed, after which the original inputs may replaced with the results written into the image buffer, and the next layer may be computed until all layers have been computed, after which the neural network results may be returned through the I/O control logic 23 and the I/O bus 31.
Therefore, according to one aspect of the present disclosure, a method for performing batch neural network processing may be as follows:
It is noted that the techniques disclosed here may pertain to training, processing of new data, or both.
According to another aspect of the present disclosure, the IPUs may perform inner product, max pooling, average pooling, and/or local normalization based on opcodes from the job control logic.
Reference is now made to
Reference is now made to
In this manner a first batch of jobs may be loaded into the BNNP, and a second batch jobs may be loaded into a different bank of the image buffers 20 prior to completing the computation on the first batch of jobs, such that the second batch of jobs may begin processing immediately after completing the computation on the first batch of jobs. Furthermore, the results from the first batch of jobs may be returned while the processing continues on the second batch of jobs, if the size of the results and final layer's inputs are less than the size of a bank. By loading the final results into the same bank where the final layer's inputs reside, the other bank may be simultaneously used to load the next batch of jobs. The results may be placed in a location that is an even multiple of N and is larger than the number of inputs, such that the final results do not overlap with the final layer's inputs.
In yet further aspect of this disclosure, the image buffers 20 may be controlled by the D&A control logic 25. Reference is again made to
A BNNP need not necessarily reside on a single server; by “reside on,” it is meant that the BNNP may be implemented, e.g., in hardware associated with/controlled by a server or may be implemented in software on a server (as noted above, although hardware implementations are primarily discussed above, analogous functions may be implemented in software stored in a memory medium and run on one or more processors). Rather, it is contemplated that the IPUs 22 of a BNNP may, in some cases (but not necessarily), reside on multiple servers/computing systems, as may various other components shown in
According to a further aspect of this disclosure, a batch-based neural network system may be composed of a load scheduler, a plurality of job dispatchers, and a plurality of initiators, each controlling a plurality of BNNPs, which may be comprised of GPUs, general purpose multi-processors, FPGAs, and/or ASICs. Reference is now made to
According to another aspect of this disclosure, upon notification to the load scheduler 60 of the completion of a batch of jobs, the job dispatcher 61 may choose to either keep or terminate the communication link, which may be based on the status of other batches of jobs already sent to the BNNP 63 or plurality of BNNPs 63. Alternatively, upon notification of completion of the batch of jobs, the load scheduler 60 may choose to keep or terminate the link, e.g., based on other requests for and the availability of equivalent resources.
It is further contemplated that the job dispatcher 61 may reside either in the user's server or in the load scheduler's 60 server. Also, the job dispatcher 61 may choose to request a BNNP 63 for a partial batch of jobs (less than M jobs) or to send an assigned BNNP 63 a partial batch of jobs over an existing communication link. The decision may be based in part, e.g., on an estimated amount of time to fill the batch of jobs exceeding some threshold that may be derived, e.g., from a rolling average of the requests being submitted by the users 64. It is also contemplated that the threshold may be lower for sending the partial batch of jobs to over an existing communication link, than for requesting a new BNNP 63. Additionally, this may be repeated for multiple partial batches of jobs.
It is further noted that the servers hosting the various system components may also host BNNPs 63 or components thereof (e.g., one or more IPUs 22 and/or other components, as shown in
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations and sub-combinations of various features described hereinabove as well as modifications and variations which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.
This application is a non-provisional application claiming priority to U.S. Provisional Patent Application No. 62/160,209, filed on May 12, 2015, and incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62160209 | May 2015 | US |