The present invention relates to a computing device to support operations required in neural networks. In particular, the present invention relates to hardware architecture that achieves many folds of speed improvement over the conventional hardware structure.
Today, artificial intelligence has been used in various applications such as perceptive recognition (visual or speech), expert systems, natural language processing, intelligent robots, digital assistants, etc. Artificial intelligence is expected to have various capabilities including creativity, problem solving, recognition, classification, learning, induction, deduction, language processing, planning, and knowledge. Neural network is a computational model that is inspired by the way biological neural networks in the human brain process information. Neural network has become a powerful tool for machine learning, in particular deep learning, in recent years. In light of power of neural networks, various dedicated hardware and software for implementing neural networks have been developed.
Y
j=Σi=13WijXi, (1)
where Wij is the weight associated with Xi and Yj. The output, yi at the hidden layer becomes:
y
j=ƒ(Σi=13WijXi+b), (2)
where b is the bias.
The output values can be calculated similarly by using yj as input. Again, there is a weight associated with each contribution from yj.
As shown above, in each layer, the weighted sum has to be computed for each node. The vector size of the input layer, hidden layer and output layer could be very large (e.g. 256). Therefore, the computations involved may become very extensive. In order to support the needed heavy computations efficiently, specialized hardware has been developed.
In
Y
j=Σi=1MWijXi, for j=1, . . . ,N. (3)
For input layer, the activation vector 310 corresponds to the input vector (X1 . . . XM). The inputs are loaded into registers (320-1, 320-2, and 320-M). The M PEs operate in a systolic fashion, where all PEs perform same operations according to system clocks. In particular, at one system clock, the multiplication (211) is performed at each multiplier (211) of the PE 200. At the next system clock, the multiplication result from each multiplier (211) is added to a partial sum from a previous PE using adder (212). The adder is often referred as accumulator. In this disclosure, the term adder and accumulator are used interchangeably. As shown in
The device in
As mentioned above, the conventional PEs will take a long time to generate the weighted sums when the number of inputs is large. It is desirable to develop a device that can reduce the time required to compute the weighted sums.
A computing device for fast weighted sum calculation in neural networks is disclosed, where the neural networks have M inputs and N output, and M and N are integer greater than 1. The computing device comprises N processing elements with each processing element designated for calculating a weighted sum for one target output. Each processing element comprises M multipliers and a plurality of adders arranged to add the M weighted inputs to generate said one target output. The M multipliers are coupled to M inputs and M weights respectively.
In one embodiment, M corresponds to a power-of-2 integer and the plurality of adders corresponds to (M−1) adders arranged in a binary-tree fashion to add the M weighted inputs to generate said one target output.
In another embodiment, each processing element further comprises timing and control circuitry to coordinate systolic operations for the M multipliers and the plurality of adders. Each processing element may further comprise a buffer to store the M weights. Alternatively, the M weights are provided to each processing element externally.
A method for fast weighted sum calculation in neural networks is also disclosed, where the neural networks have M inputs and N output, and M and N are integer greater than 1. The method comprises utilizing N processing elements to calculate weighted sums for the N outputs by utilizing one processing element designated for calculating a weighted sum for one target output. Furthermore, said utilizing said one processing element designated for calculating a weighted sum for one target output comprises: multiplying M inputs and M weights respectively using M multipliers in said one processing element to generate M weighted inputs for said one target output, wherein the M weights are associated with said one target output; adding the M weighted inputs to generate said one target output using a plurality of adders in said one processing element; and providing said one target output.
In one embodiment of the method, M corresponds to a power-of-2 integer and the plurality of adders corresponds to (M−1) adders arranged in a binary-tree fashion to add the M weighted inputs to generate said one target output.
In another embodiment, each processing element further comprises timing and control circuitry to coordinate systolic operations for the M multipliers and the plurality of adders. Furthermore, each processing element further comprises a buffer to store the M weights. Alternatively, the M weights are provided to each processing element externally.
The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The illustrated embodiments of the invention will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of apparatus and methods that are consistent with the invention as claimed herein.
In the description like reference numbers appearing in the drawings and description designate corresponding or like elements among the different views.
As mentioned above, the weighted sum calculation plays an important role in neural networks and deep learning. The conventional devices in the market usually is configured as an array of processing elements (PEs), where the output (i.e., the partial sum) of one PE is fed to the input of the next stage for more weighted sums. In particular, a popular configuration being used designates each PE to one input. For example, for M inputs (X1, X2, . . . , XM) as shown in
In
To support the weighted sum calculation associated with (X1, X2, . . . , XM) and (Y1, Y2, . . . , YN), an exemplary architecture based on the present invention is shown in
As a comparison, for the architecture of conventional PE array in
In
The above description is presented to enable a person of ordinary skill in the art to practice the present invention as provided in the context of a particular application and its requirement. The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed. In the above detailed description, various specific details are illustrated in order to provide a thorough understanding of the present invention. Nevertheless, it will be understood by those skilled in the art that the present invention may be practiced.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), field programmable gate array (FPGA), and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The software code or firmware codes may be developed in different programming languages and different format or style. The software code may also be compiled for different target platform. However, different code formats, styles and languages of software codes and other means of configuring code to perform the tasks in accordance with the invention will not depart from the spirit and scope of the invention.
The present invention claims priority to U.S. Provisional Patent Application, Ser. No. 62/639,451, filed on Mar. 6, 2018. The U.S. Provisional Patent Application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62639451 | Mar 2018 | US |