The present invention relates to a computing device to support multiple activation functions as required in neural networks. In particular, the present invention relates to hardware architecture that achieves cost effectiveness as well as high processing throughputs over the conventional hardware structure.
Today, artificial intelligence has been used in various applications such as perceptive recognition (visual or speech), expert systems, natural language processing, intelligent robots, digital assistants, etc. Artificial intelligence is expected to have various capabilities including creativity, problem solving, recognition, classification, learning, induction, deduction, language processing, planning, and knowledge. Neural network is a computational model that is inspired by the way biological neural networks in the human brain process information. Neural network has become a powerful tool for machine learning, in particular deep learning, in recent years. In light of power of neural networks, various dedicated hardware and software for implementing neural networks have been developed.
Y
j=Σi=13WijXi, (1)
where Wij is the weight associated with Xi and Yj. In general, the total number of input signals may be M1, where M1 is an integer greater than 1. There may be N1 neurons in the hidden layer. The output, yi at the hidden layer becomes:
y
j
=f(Σi=13WijXi+b), (2)
where b is the bias.
The output values can be calculated similarly by using yj as input. Again, there is a weight associated with each contribution from yj.
Accordingly, the function of each neuron can be modelled as weighted sum calculation 180 followed by an activation function 190 as shown in
Sigmoid Function
The Sigmoid function curve 210 has an S-shape that looks like a form of the Greek letter Sigma as shown in
Hyperbolic Tangent (Tan h) Function
The hyperbolic tangent function (tan h) has a shape 220 as shown in
Rectified Linear Unit (ReLU) Function
The Rectified Linear Unit (ReLU) function is another popular non-linear activation function used in recent years. The Rectified Linear Unit function has a shape 230 as shown in
f(Y)=max(0,Y). (5)
Leaky ReLU Function
For the ReLU function, all the negative values are mapped to 0, which decreases the ability of the model to fit or train from the data properly. In order to overcome this issue, a leaky ReLU function has been used. The leaky ReLU function has a shape 240 as shown in
In the above equation, the value of a is often selected to be less than 1. For example, the value of a can be 0.01.
The activation functions mentioned above are intended for illustration instead of an exhaustive list of all activation functions. In practice, other activation functions, such as Softmax function, are also being used.
A scalar element computing device for computing a selected activation function selected from two or more different activation functions is disclosed. The scalar element computing device comprises N processing elements, N command memories and an operator pool. Each processing element comprises one or more inputs and one or more outputs, and the N processing elements are arranged into a pipeline to cause said one or more outputs of each non-last-stage processing element coupled to said one or more inputs of one next-stage processing element, where N is an integer greater than 1. The N command memories are coupled to the N processing elements individually. The operator pool is coupled to the N processing elements, where the operator pool comprises a set of operators for implementing any activation function in an activation function group of two or more different activation functions. The N processing elements are configured according to command information stored in the N command memories to calculate a target activation function selected from said two or more different activation functions by using one or more operators in the set of operations.
In one embodiment, said two or more different activation functions comprise Sigmoid, Hyperbolic Tangent (tan h), Rectified Linear Unit (ReLU) and leaky ReLU activation functions. The set of operators may comprise addition, multiplication, division, maximum and exponential operator. In another embodiment, the set of operators comprises addition, multiplication, division, maximum, minimum, exponential operator, logarithmic operator, and square root operator. The set of operators may also comprise one or more pool operators, where each pool operator is applied to a sequence of values. For example, the pool operators correspond to ADD_POOL to add the sequence of values, MIN_POOL to select a minimum value of the sequence of values, MAX_POOL to select a maximum value of the sequence of values, or a combination thereof.
In one embodiment, the set of operators comprises a range operator to indicate range result of a first operand compared with ranges specified by one other second operand or two other operands. Furthermore, one processing element can be configured to use a target operator conditionally depending on the range result of the first operand in a previous-stage processing element.
In one embodiment, each of the N command memories is partitioned memory entries and each entry is divided into fields. For example, each entry comprises a command field to identify a selected command and related control information, one or more register fields to indicate values of one or more operands for a selected operator, and one or more constant fields to indicate values of one or more operands for the selected operator.
In one embodiment, the scalar element computing device may comprise a multiplexer to select one or more inputs of first-stage processing element from feeder interface corresponding to full sum data or one or more outputs of a last-stage processing element.
A method of using the above computing device is also disclosed. One or more operations required for a target activation function are determined. One or more target operators, corresponding to the operations, are selected from a set of operators supported by the operator pool. The target operators are mapped into the N processing elements arranged into the pipeline. The target activation function is calculated for an input data using the N processing elements by applying said one or more operations to the input data, where the N processing elements implement said one or more operations using said one or more target operators from the operator pools according to command information related to said one or more target operators stored in the N command memories respectively.
A scalar computing subsystem for computing a selected activation function selected from two or more different activation functions is also disclosed. The scalar computing subsystem comprises an interface module to receive input data for applying a selected activation function and M scalar elements coupled to the interface module to receive data to be processed. The scalar element is based on the scalar element computing device mentioned above. The scalar computing subsystem may further comprise a reduced operator pool coupled to all M scalar elements, where when a reduce operator is selected, each of the N processing elements in the M scalar elements provides a value for the reduced operator and uses a result of the reduced operator. The reduced operator pool may comprise an addition operator, a minimum operator and a maximum operator.
The scalar computing subsystem may further comprise an aligner coupled to all M scalar elements to align first data output from all M scalar elements. The scalar computing subsystem may further comprise a padder coupled to the aligner to pad second data output from the aligner. The input data corresponds to full sum data or memory data from a unified memory. The interface module comprises a multiplexer to select the input data from output data of a full sum calculation unit or looped-back outputs from last-stage processing elements in each scalar element.
The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The illustrated embodiments of the invention will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of apparatus and methods that are consistent with the invention as claimed herein.
In the description like reference numbers appearing in the drawings and description designate corresponding or like elements among the different views.
As mentioned above, neural network implement may need to support various action functions. In theory, parallel sets of processors may be used in parallel to support the various action functions. For example, a system may have four sets of processors, where each set of the processors is dedicated for a particular activation function. In this case, four sets of processors will be needed to support the Sigmoid, tan h, ReLU and leaky ReLU activation functions. While such implementation is straightforward, the implementation may not be cost effective.
Scalar Element (Se) with Operator Pool
In this disclosure, an innovative architecture and related interfaces and operations are disclosed to support multiple activation functions. According to the present invention, the operations required to support the multiple activation functions are identified. The required operations are used as a common pool to support the implementation of various activation functions. Furthermore, in order to support high-speed operation, pipelined processing units are disclosed so that various operations can be performed concurrently in various pipeline stages.
As an example, the operations required to support the Sigmoid, tan h, ReLU and leaky ReLU activation functions will include addition (for Sigmoid and tan h), multiplication (for leaky ReLU), exponential function (for Sigmoid and tan h), maximum (for ReLU) and comparison (for leaky ReLU). In this example, it is assumed that negation of a value (e.g. “−Y” and “−e−Y”) can be performed implicitly.
The set of operations to support a given set of activation functions may not be unique. For example, instead of implicitly implementing the negation of a value (e.g. “−Y” and “−e−Y”), the negation of a value can be implemented by multiplying the value by a constant “−1”. Furthermore, some activation functions or some partial activation functions may be supported by a dedicated operation. For example, the ReLU activation function (Y)=max(0, Y) may be implemented by a special conditional operation corresponding to a ranging operation followed by a branching operation according to the ranging result of input signal. Such special conditional operation can efficiently implement any activation that uses different mapping functions depending on data range. The ReLU is an example of such activation function, where the output ƒ(Y) is equal to Y if Y is greater than 0. Otherwise, ƒ(Y) is equal to 0. The ranging operator may include two operands with the first operand as the input signal and the second operand as a threshold to be compared with the input signal. If the first operand is greater than (or smaller than) the second operand, the ranging result is equal to 0 (or 1). Otherwise, the ranging result is equal to 1 (or 0). In the next stage, different operations can be selected according to the ranging result. This special conditional operation can be used to implement ReLU by setting the second operand to 0. If the ranging result is equal to 0, the conditional operator can be set to result in Y. If the ranging result is equal to 1, the conditional operator can be set to result in 0. In another embodiment, the special conditional operation may have three operands, where the first operand is the input signal, and the second operand and the third operand are thresholds to be compared with the input signal. In this case, three different ranges can be determined to cause three ranging results (e.g. 0, 1 and 2).
SCU Operator Pool
Each scalar computing unit comprises multiple pipeline inputs and multiple pipeline outputs. The example in
In order to support a set of activation functions consisting of sigmoid, tan h, ReLU and leaky ReLU, a set of operations comprising addition (ADD), multiplication (MULT), maximum (MAX), division (DIV) and exponential function (EXP) may be used as the operator pool to implement the set of activation functions. As mentioned before, the ReLU activation function may be implemented by a dedicated operator referred as condition branding (COND_BCH) in this disclosure, which determines the range of an input signal and selects an operator based on the ranging result. Similarly, the leaky ReLU activation function also involves comparison of an input with zero and then uses either “Y” or “aY” function depending on the comparison result as shown in equation (6). Therefore, an operator (e.g. pass through or no operation (NOP)) to cause ƒ(Y)=Y can be used when the input signal is greater than 0. For the input signal smaller than 0, the multiplication (MULT) operator can be used to cause ƒ(Y)=aY. The comparison operation may also be implemented using the MIN or MAX operator with 0 as one operand. The actual operator is selected according to the comparison result. Accordingly, in another embodiment, the set of operations comprising addition (ADD), multiplication (MULT), maximum (MAX), COND_BCH division (DIV) and exponential function (EXP) may be used as the operator pool for implementing the set of activation functions. The set of operations may include both maximum (MAX) and minimum (MIN).
In some applications, an operation may be applied to a sequence of data. For example, it may be needed to add up a sequence of input values. In order to implement such operation efficiently, a new operator (ADD_POOL) that can add up a sequence of values is disclose. For example, ADD_POOL may accumulate operand 0, where the accumulator is cleared when first of pool signal is active. The accumulation result will be outputted when the last pool signal is active. When both first pool and last pool signals are active, ADD_POOL will function as ADD. ADD_POOL will also output pool size. The ADD_POOL operator provides the accumulated value and the pool size for a sequence of values. The results of ADD_POOL can be used to calculate the average of a sequence of values by dividing the accumulated value by the pool size. Accordingly, a special division operator is disclosed to perform the division of an accumulated value by the pool size. Similarly, the operation on a sequence of data may also be applicable to MIN and MAX. For example, MAX_POOL can be used for the MAX operation on a sequence of data, which outputs the maximum value of a sequence of operand 0 when the last pool signal is active. The start of a pool is indicated by the first pool signal. When both first of pool and last of pool are active, MAX_POOL will function as MAX. In another example, MIN_POOL can be used for the MIN operation on a sequence of data, which outputs the minimum value of a sequence of operand 0 when the last pool signal is active. The start of a pool is indicated by the first pool signal. When both first of pool and last of pool are active, MIN_POOL will function as MIN.
The operator pool may comprise multiple copies of one operator. For example, the operator pool may comprise two copies of ADD (e. g. ADDa and ADDb) so that the addition operation can be used by two pipeline stages at the same time. If only one copy of ADD is available in the operator pool, two different pipeline stages would have to take turns to share the same ADD operator.
While the exemplary sets of operators are adequate to support major existing activation functions, there may be other activation functions that may be used. In yet another embodiment of the present invention, an extended set of operators may be used to support a variety of activation functions. For example, the set of operators may further comprise a logarithmic operator (LN) and a square-root operator (SQRT) to support other activation functions.
Individual SCU Operator Pool and Global Operator Pool
In
The reduced operator is useful for some operators that are applied to all SEs. For example, values of pipeline stages from all SEs may be summed using a reduced ADD operator (e.g. REDUCE_OP_ADD). In another example, it may be needed to find a minimum among values of pipeline stages from all SEs. In this case, a reduced minimum operator (e.g. REDUCE_OP_MIN) may be used. Similarly, it may be needed to find a maximum among values of pipeline stages from all SEs. In this case, a reduced maximum operator (e.g. REDUCE_OP_MAX) may be used.
SCU Pipeline Stages with Loop Back
In another embodiment, the output from the last SCU pipeline stage can be looped back to the input of the first SCU pipeline stage so as to increase length of the pipeline stages. For example, the outputs (i.e., 350-0, 350-1 and 350-2) from the SCU pipeline stage 7 can be looped back to the inputs (i.e., 360-0, 360-1 and 360-2) through multiplexers 340. The multiplexers 340 can be configured to select the looped back inputs (i.e., 360-0, 360-1 and 360-2) or inputs (input 0, input 1 and input 2) from the full sum feeder.
SCU Memory Data Structure
The SCU memory includes the needed information to control the operation of the corresponding SCU pipeline stage. For example, each SCU memory may include multiple entries (e.g. Si entries and Si being a positive integer greater than 1) and each entry may consist of multiple bits (e.g. S2 bits and S2 being a positive integer greater than 1) partitioned into various fields. For example, the SCU memory may include 128 entries (i.e., S1=128). The 128 entries are organized as 32 sets of loop of 4 commands. Each SCU Pipeline Stage will receive 5-bit command address (cmd_addr) to indicate the set of commands to use. Each entry consists of 192 bits i.e., S2=192), which may be divided into 6 fields with 32 bits for each field. In the following, an example of data structure for the SCU memory is illustrated:
Operator Selection
The scu_command includes a field consisting of multiple bits for selecting an operator. The number of bits for indicating a selected operator depends on the number of operators available for selection. For example, a 5-bit field will be able to identify up to 32 different operators to select from.
As mentioned earlier, an extended set of operators may be used to support a large variety of activation functions. In this case, scu_command is designed to accommodate the inclusion of the set of extended operators.
When the reduced operators are used, selection of a reduced operator should be indicated. Accordingly, scu_command is designed to accommodate the inclusion of reduced operators in this case.
Pipeline Output Selection
In order to provide flexibility, the SCU pipeline stage may be configured to allow selectable output. For example, out1 and out2 may be set to select pipeline input 1 and pipeline input 2 respectively; or out1 and out2 may be set to select respective operator outputs. In this case, the information related to the operation may include one or more bits to indicate the pipeline output selection.
Operand Selection
In order to provide flexibility, the SCU pipeline stage may be configured to allow operand selection. For example, the operand 0 may be selected from a group comprising one of the three inputs (i.e., in0, in1 and in2) of the SCU pipeline stage, from a register, or from the result of a reduced operation. The operand 1 may be selected from a group comprising one of the three inputs (i.e., in0, in1 and in2) of the SCU pipeline stage, a register, a constant, or the result of a reduced operation.
COND_BCH Operator
In order to the special conditional operation, an entry for such operation may use the input signal and two scu_registers as three operands, where the first operand corresponds to the input signal to be processed and the second and third operands are used for values of thresholds to be compared with. For example, the operation outputs cmp_result=0 if operand 0>operand 1; the operation outputs cmp_result=1 if operand 0<=operand 1 and operand 0>=operand 2; and the operation outputs cmp_result=2 if operand 0<operand 2.
In order to support conditional operation, the scu_command may include one or more bits to indicate whether to use different operations depending on the ranging result. For example, one “use compare result” bit can be set or unset to indicate whether to use the ranging result. When “use compare result” bit of the scu_command is set, the cmp_result of the last pipeline stage will be used to determine the actual command to be used. Accordingly, when the “use compare result” bit is set, scu_constants can be further used to indicate a corresponding operator selected for a ranging result. For example, when cmp_result=1, scu_constant[0] will be used to replace scu_command. The “use compare result” and “operator select” bits of scu_constant[0] will be ignored. When cmp_result=2, scu_constant [1] will be used to replace scu_command. The “use compare result” and “operator select” bits of scu_constant[1] will be ignored.
Cmp_result can be propagated through the SCU pipeline stages including loopback until the cmp_result is replaced by a new cmp_result from the MIN, MAX, or COND_BCH operator.
Loop Count
As mentioned before, the output from the last pipeline stage can be looped back to the input of the first stage. According to one embodiment of the present invention, the system can be configured to allow multiple loops of operation. For example, one or more bits in the scu_cmd can be used to indicate or control the number of loops of operations. For example, the 2 LSBs of the memory address can be used for the loop count, which indicates one of the 4 passes through the 8 pipeline stages. In order to increase the efficiency, the last SCU pipeline stage may use a separate memory.
Table 1 illustrates an exemplary data structure for scu_cmd.
Scalar Computing Unit (SCU) Subsystem
The scalar element (SE) as shown in
The Mux 340 as shown in
The above description is presented to enable a person of ordinary skill in the art to practice the present invention as provided in the context of a particular application and its requirement. The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed. In the above detailed description, various specific details are illustrated in order to provide a thorough understanding of the present invention. Nevertheless, it will be understood by those skilled in the art that the present invention may be practiced.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), field programmable gate array (FPGA), and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The software code or firmware codes may be developed in different programming languages and different format or style. The software code may also be compiled for different target platform. However, different code formats, styles and languages of software codes and other means of configuring code to perform the tasks in accordance with the invention will not depart from the spirit and scope of the invention.