This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2018-136714, filed Jul. 20, 2018, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to an information processing apparatus for convolution operations in layers of a convolutional neural network.
In layers of a convolutional neural network (CNN) for use in image recognition processing, etc., convolution operations are performed.
Such convolution operations in layers of CNN involve a great deal of calculations. Accordingly, bit precision is often differentiated on an operation-by-operation basis with the aim of mitigating calculation load and improving efficiency.
Also, a CNN includes multiple layers. It is known that the bit precision required for realizing recognition accuracy necessary in, for example, image recognition processing varies depending on each of the layers.
According to one embodiment, an information processing apparatus for convolution operations in layers of a convolutional neural network, includes a memory and a product-sum operating circuitry. The memory is configured to store items of information indicative of an input, a weight to the input, and a bit width determined for each filter of the weight. The product-sum operating circuitry is configured to perform a product-sum operation based on the items of information indicative of the input, the weight, and the bit width, stored in the memory.
Embodiments will be described with reference to the drawings.
A CNN is formed of multiple layers. Principal processing in each layer is given as following expression (1).
In the expression, ym,r,c is referred to as an output, Xn,r,c is referred to as an input, and wm,n,ky,kx is referred to as a weight. Each value of weight is determined in advance through learning processes, so the values are already known and fixed values when processing such as image recognition is performed. On the other hand, for the case of image recognition, the input xn,r,c and the output ym,r,c are changed as an input image changes.
The input x takes a three-dimensional structure having a height R, a width C, and a channel N, and may be expressed as an N×R×C cuboid as shown in
Note that cutting out a region of the size equal to one filter m of the weight w from the input x cuboid, and performing a product-sum operation, i.e., multiplying the values and summing all the multiplication results within the region, will yield a single value in the output y (see
For performing the foregoing processing, it is common to use the same format, e.g., the same single-precision floating point, for all of the output y, the input x, and the weight w. That is, use of the same bit precision for all of the output y, the input x, and the weight w is general.
This embodiment is based particularly on the nature of CNN processing, where a product-sum operation is performed for each filter m as discussed above.
For the sake of simplicity, the description will assume an instance of the weight w being expressed by integers. For example, the weight w of a given layer includes M×N×Ky×Kx values, and it is supposed that the largest value among them is 100, and the smallest value is −100. In this case, 8-bit precision would be typically used as the bit precision for the weight win order to express the largest value and the smallest value, since 8 bits can express a value from −128 to +127.
In the first embodiment, a bit width of the weight w is determined for each value of the weight w for a filter m. The weight w includes M filters m. The maximum weight value for one of these filters m is 100, and the minimum weight value for one of these filters m is −100. However, it will be supposed that, for the 0th filter m, for example, the weight value may take 50 as the maximum value and −10 as the minimum value. In this case, 7 bits are sufficient and 8 bits are not necessary for the 0th filter m, since 7 bits can express a value from −64 to +63. Similarly, the maximum weight value and the minimum weight value are estimated for each filter m, and the smallest bit width required is used. In this way, the entire calculation amount, and the capacity of a memory necessary for weight storage may be reduced.
Besides, a product-sum operation is performed for each filter m as discussed above. Since all the product-sum operations for N×Ky×Kx, performed as many as the M filters for calculating one given output y, can use the same bit width for the filter m, efficient processing is possible.
As shown in
These information items for the weight wm,n,ky,kx, the bit width Bwm of the weight wm,n,ky,kx, and the input xn,ky,kx, stored in the memory 201, are input to a product-sum operation unit 202a. Note that the information items for the weight wm,n,ky,kx, the bit width Bwm of the weight wm,n,ky, kx, and the input xn,ky,kx may be directly input to the product-sum operation unit 202a without being stored in the memory 201.
The product-sum operation unit 202a performs processing for product-sum operations based on the information items for the weight wm,n,ky,kx, the bit width Bwm of the weight wm,n,ky,kx, and the input xn,ky,kx, stored in the memory 201.
The product-sum operation unit 202a performs processing for product-sum operations in accordance with, and appropriate for, the information for the bit width Bwm. The processing for product-sum operations by the product-sum operation unit 202a may be software processing for implementation by a processor, or hardware processing for implementation by product-sum operation circuitry. The product-sum operation circuitry may be, for example, logical operation circuitry.
The output from the product-sum operation unit 202a is given as ym,r,c as indicated by the expression (1).
The weight wm,n,ky,kx, and the bit width Bwm of the weight wm,n,ky,kx with respect to each filter m are values which have been calculated through learning processes, and stored in the memory 201.
The bit width Bwm may also be obtained through calculation by a bit-width calculator (processor) 251. As shown in
The following method may be adopted for calculating the bit width Bwm with respect to each filter m.
The bit width Bwm of the weight wm,n,ky,kx is calculated by a processor (not shown). The bit width Bwm adopts the number that is obtained by adding one bit to a bit width which is a binarized expression of the maximum value (maximum absolute value) of the weight wm,n,ky,kx. The addition of one bit is involved since it is necessary to utilize the maximum value in the positive domain or the negative domain with respect to the center 0, for expressing the other domain as well.
For the example shown in
The symbol “┌ ┐” indicates a ceiling function.
Accordingly, the required bit width Bwm is found to be 6 bits.
As the product-sum operation unit 202a, it is possible to adopt, for example, product-sum operation circuitry configured to receive input of multibit data as shown in
In the second embodiment as shown in
According to the second embodiment, the bit widths Bwm0 to BwmL-1 of the weights wm0 to wmL-1 are different for the respective L filters m. The weights wm0 to wmL-1 for the L filters m, and the bit widths Bwm0 to BwmL-1 of the respective weights wm0 to wmL-1 are input to the product-sum operation unit 202b. Note that the weights wm0 to wmL-1 for the L filters m, the bit widths Bwm0 to BwmL-1 of the weights wm0 to wmL-1, and the input xn,ky,kx may be directly input to the product-sum operation unit 202b without being stored in the memory 201.
The product-sum operation unit 202b performs processing for product-sum operations for a group of multiple filters m, based on the information items for the weights wm0 to wmL-1 for the L filters m, the bit widths Bwm0 to BwmL-1 of the respective weights wm0 to wmL-1, and the input xn,ky,kx, stored in the memory 201.
In the product-sum operation unit 202b, processing for multiple filters m is performed in a parallel manner. The product-sum operation unit 202b performs processing for product-sum operations in accordance with, and appropriate for, the input bit widths Bwm0 to BwmL-1 of the respective weights wm0 to wmL-1 for the filter m. The processing for product-sum operations by the product-sum operation unit 202b may be software processing for implementation by a processor, or hardware processing for implementation by product-sum logical operation circuitry. The output from the product-sum operation unit 202b is given as ym,r,c as indicated by the expression (1)
As the product-sum operation unit 202b, it is possible to adopt, for example, product-sum operation circuitry configured to receive input of multibit data as shown in
It has been supposed in the first embodiment that the weight value for the 0th filter m takes the maximum value of 50 and the minimum value of −10, and 7 bits are necessarily used in order to express this range in the normal two's complement representation. However, the range of +50 to −10 covers at the most 61 kinds of integers, which fall within the range that can be expressed with 6 bits. The third embodiment estimates the range of filter in and uses the minimum bit width required, instead of using the maximum weight value and the minimum weight value for each filter m. This allows for reduction of the entire calculation amount and the capacity of a memory that must be secured for storing the weights.
The processing according to this embodiment may be given as the following expression.
Here, wm,n,ky,kx=w′m,n,ky,kx+bm. Note that bm is a value for correcting w′ so that the range of w can be expressed in the minimum bit precision required, and bm takes a single value for each filter m. For example, bm can be defined as bm=(max w+1+min w)/2. This renders the bit width. Bw′m of the weight w′m smaller than the bit width Bwm of the original weight wm, and therefore, the first term in the expression (2) can be calculated with a smaller bit width. The expression (2) additionally includes the second term as compared to the expression (1). Nevertheless, while the first term requires M+N+Ky+Kx+R+C product-sum operations, the second term can be calculated by N×R×C+Ky×Kx×R×C additions. Since the second term is sufficiently smaller than the first term, it can be expected that having the smaller bit width for the first term would provide an effect beyond the overhead introduced by the addition of the processing of the second term.
As shown in
The memory 201 stores information for the weight w′m,n,ky,kx, information for the bit width Bw′m of the weight w′m,n,ky,kx, information for the input xn,ky,kx, and information for the correction value bw′m. The bit width Bw′m of the weight w′ is determined with respect to each filter m.
The information items for the weight w′m,n,ky,kx, the bit width Bw′m of the weight w′m,n,ky,kx, and the input xn,ky,kx, stored in the memory 201, are input to a product-sum operation unit 202c. Note that these information items for the weight w′m,n,ky,kx, the bit width Bw′m of the weight w′m,n,ky,kx, and the input xn,ky,kx may be directly input to the product-sum operation unit 202c without being stored in the memory 201.
The product-sum operation unit 202c performs processing for product-sum operations in accordance with, and appropriate for, the information for the bit width Bw′m.
The output from the product-sum operation unit 202c is expressed as the first term in the expression (2).
The input xn,ky,kx and the correction value bw′m, stored in the memory 201, are input to the correction value calculator 203c. The correction value calculator 203c outputs a correction value expressed as the second term in the expression (2), based on the input xm,ky,kx, and the correction value bw′m from the memory 201.
An adder 204 adds together the output from the product-sum operation unit 202c (the first term in the expression (2)) and the output from the correction value calculator 203c (the second term in the expression (2)) to output ym,r,c.
The processing for product-sum operations by the product-sum operation unit 202c, the processing for correction value calculation by the correction value calculator 203c, and the processing for addition by the adder 204 may be software processing for implementation by a processor, or hardware processing for implementation by product-sum logical operation circuitry.
As in the preceding embodiments, the bit width Bw′m of the weight w′ differs for each filter m. The correction value bw′m also differs for each filter m.
The product-sum operation unit 202c performs processing for product-sum operations in accordance with, and appropriate for, the information for the bit width Bw′m.
As the product-sum operation unit 202c, it is possible to adopt, for example, product-sum operation circuitry configured to receive input of multibit data as shown in
The output from the adder 204 is given as ym,r,c as indicated by the expression (1).
The weight w′m,n,ky,kx, the bit width Bw′m of the weight w′m,n,ky,kx with respect to each filter m, and the correction value bw′m are values which have been calculated through learning processes, and stored in the memory 201.
The weight w′, the bit width Bw′m of the weight w′, and the correction value bw′m may also be obtained through calculation by a bit-width corrector (processor) 301. As shown in
According to the third embodiment, the correction value bw′m is used so that the bit width of the weight is optimized into a smaller value. The weight w′m,n,ky,kx, the bit width Bw′m, and the input x are input to the product-sum operation unit 202c, and the correction value bw′m for use in correction is input to the correction value calculator 203c.
The weight w′m,n,ky,kx, the bit width Bw′m, and the correction value bw′m are calculated by the bit-width corrector 301 in the following manner.
In the example shown in
In practice, however, it is sufficient if 31 values (20+10+1) are expressed. Therefore, the required minimum bit width of the weight is given as follows, where it is determined to be 5.
Bit width Bw′m=┌ log231 ┐=┌4.9┐=5
In this example, subtracting “5” from every value renders the maximum value 15 and the minimum value −15, and accordingly, 5 bits can express this range. As such, the correction value bw′m is “5”. This value “5” may be calculated as, for example, (max wm+1+min wm)/2.
With the information processing apparatus 501c according to the third embodiment, the product-sum operation unit 202c that involves a great deal of calculations can use the bit width of the weight, which has been reduced from 6 bits to 5 bits, and therefore, the resulting calculation amount can further be reduced.
In the fourth embodiment as shown in
According to the fourth embodiment, the bit widths Bw′m0 to Bw′mL-1 are different for the respective L filters m. The information items for the weights w′m0 to w′mL-1 for L filters m, the bit widths Bw′m0 to Bw′mL-1 of the respective weights w′m0 to w′mL-1, and the input xn,ky,kx are input to the product-sum operation unit 202d. Note that these information items for the weights w′m0 to w′ML-1 for L filters m, the bit widths Bwm0 to BW′mL-1 of the weights w′m0 to w′mL-1, and the input xn,ky,kx may be directly input to the product-sum operation unit 202d without being stored in the memory 201.
The product-sum operation unit 202d performs processing for product-sum operations based on the information items for the weights w′m0 to w′mL-1 for L filters m, the bit widths Bw′m0 to Bw′mL-1 of the respective weights w′m0 to w′mL-1, and the input Xn,ky,kx, stored in the memory 201.
In the product-sum operation unit 202d, processing for multiple filters m is performed in a parallel manner. The product-sum operation unit 202d performs processing for product-sum operations in accordance with, and appropriate for, the input bit widths Bw′m0 to BW′mL-1 of the respective weights w′m0 to w′mL-1 for the filter m. The output from the product-sum operation unit 202d is expressed as the first term in the expression (2)
As the product-sum operation unit 202d, it is possible to adopt, for example, product-sum operation circuitry configured to receive input of multibit data as shown in
A correction value calculator 203d outputs a correction value expressed as the second term in the expression (2), based on the input xn,ky,kx and the correction values bw′m0 to bw′mL-1 input from the memory 201.
The adder 204 adds together the output from the product-sum operation unit 202d (the first term in the expression (2)) and the output from the correction value calculator 203d (the second term in the expression (2)) to output ym,r,c.
The processing for product-sum operations by the product-sum operation unit 202d, the processing for correction value calculation by the correction value calculator 203d, and the processing for addition by the adder 204 may be software processing for implementation by a processor, or hardware processing for implementation by product-sum logical operation circuitry.
The output from the adder 204 is given as ym,r,c as indicated by the expression (1).
As discussed for the first to fourth embodiments, the product-sum operation units 202a to 202d each receive data input of the bit width Bwm or Bw′m, which is different for each filter m. In the description of the fifth embodiment, a series of data processing for the data x and w, input from the memory to the product-sum operation circuitry and differing in bit width Bw for each filter m, will be explained.
As shown in
The data X is expressed in a matrix form with t rows and r columns, and the data W is expressed in a matrix form with m rows and t columns (t, r, and m each being 0 or a positive integer). The embodiment will assume t to be time (read cycle).
The two matrices will be given as:
W={w
m,t}0≤m≤M−1, 0≤t≤T−1, and
X={x
t,r}0≤t≤T−1, 0≤r≤R−1,
in which T−1 is the maximum value of read cycles, R−1 is the maximum column number of the matrix data X, and M−1 is the maximum row number of the matrix data W.
The product-sum operation circuitry 1 performs a matrix operation using the two data items (W, X) input from the memory 2, and outputs the operation result to the post-processing circuitry 3. More specifically, the product-sum operation circuitry 1 includes a plurality of processing elements arranged in an array and each including a multiplier and an accumulator.
Assuming that a matrix to be calculated is Y=WX, the operation for each element of Y={ym,r}0≤m≤M−1, 0≤r≤R−1 takes a product-sum form as follows.
The product-sum operation circuitry 1 accordingly outputs the result of the product-sum operation to the post-processing circuitry 3.
The memory 2 may have any configuration as long as it is a semiconductor memory, such as an SRAM, a DRAM, an SDRAM, a NAND flash memory, a three-dimensionally designed flash memory, an MRAM, a register, a latch circuit, or the like.
The post-processing circuitry 3 performs an operation to the output from the product-sum operation circuitry 1, which includes the output of each arithmetic operator at time T−1 corresponding to an m-th row and an r-th column, using a predetermined coefficient settable for each processing element. The post-processing circuitry 3 then puts an output index to the operation result and outputs it to a processor 5. In these actions, the post-processing circuitry 3 acquires the predetermined coefficient and the output index from a lookup table (LUT) 4 as necessary.
If the post-processing is not required, the post-processing circuitry 3 maybe omitted, and the output from the product-sum operation circuitry 1 may be supplied to the processor 5.
The LUT 4 stores the predetermined coefficients and the output indexes for the respective processing elements in the product-sum operation circuitry 1. The LUT 4 may be storage circuitry.
The processor 5 receives results of the product-sum operations of the respective processing elements after the processing by the post-processing circuitry 3. The processor 5 is capable of setting the predetermined coefficients and the output indexes to be stored in the LUT 4 and set for the respective processing elements.
[First Exemplary Product-Sum Operation Circuitry (Multibit Case 1: Product-Sum Operation Circuitry When Input Data wm,t and xt,r are 3 Bits)]
For example, assuming that the product-sum operation unit 202a according to the first embodiment is applied, the product-sum operation circuitry 1a of
The multiplier 21 in each of the processing elements ub0,0 to ub2,2 includes a first input terminal and a second input terminal. The first input terminal of the multiplier 21 in an processing element ubm,r is coupled to a data line that is common to the other processing elements arranged on the m-th row, and the second input terminal is coupled to a data line that is common to the other processing elements arranged on the r-th column.
In other words, first inputs which are supplied to the first input terminals of certain multipliers 21 (among all the processing elements ubm,r) share the data line for data wm,t in the row direction, and second inputs which are supplied to the second input terminals of certain multipliers 21 share the data line for data xt,r in the column direction.
As such, at time t, the first inputs to the multipliers 21 in the processing elements ub0,0, ub0,1, and ub0,2 share the value of data w(2)0,t, the first inputs to the multipliers 21 in the processing elements ub1,0, ub1,1, and ub1,2 share the value of data w(1)0,t, and the first inputs to the multipliers 21 in the processing elements ub2,0, ub2,1, and ub2,2 share the value of data w(0)0,t.
Similarly, at the time t, the second inputs to the multipliers 21 in the processing elements ub0,0, ub1,1, and ub2,0 share the value of data x(2)t,0, the second inputs to the multipliers 21 in the processing elements ub0,1, ub1,1, and ub2,1 share the value of data x(1)t,0, and the second inputs to the multipliers 21 in the processing elements ub0,2, ub1,2, and ub2,2 share the value of data x(0)t,0.
The multiplier 21 in each of the processing elements ub0,0 to ub2,2 multiplies data of the first input by data of the second input, and outputs the multiplication result to the adder 12.
Accordingly, the multipliers 21 in the processing elements ub0,0, ub0,1, and ub0,2 at the time t output the respective multiplication results (i.e. the results of multiplying the data w(2)0,t of the first input by the data x(2)t,0, x(1)t,0, and x(0)t,0 of the second input, respectively).
Also, the multipliers 21 in the processing elements ub0,0, ub1,0, and ub2,0 at the time t output the respective multiplication results (i.e. the results of multiplying the data x(2)t,0 of the second input by the data w(2)0,t, w(1)0,t, and w(0)0,t of the first input, respectively).
The adder 12 and the register 13 in each of the processing elements ub0,0 to ub2,2 constitute an accumulator. In each of the processing elements ub0,0 to ub2,2, the adder 12 adds together the multiplication result given from the multiplier 21 and the value at time t−1 (one cycle prior to the time t) that the register 13 is holding (value of the accumulator).
The register 13 holds the time t−1 multiplication result given via the adder 12, and retains the addition result output from the adder 12 at the cycle of time t.
In this manner, 3×3 processing elements are arrayed in parallel, and at time t, data wm,t is input to the r processing elements Ub arranged on the m-th row and data xt,r is input to the m processing elements arranged on the r-th column. Accordingly, at the time t, the processing element at the m-th row and the r-th column performs the calculation expressed as:
y
m,r,t
=y
m,r,t−1
+w
m,t
×x
t,r (4)
in which ym,r,t represents the value newly stored at the time t in the register 13 in the processing element ubm,r. Consequently, the arithmetic operations according to the expression (1) are finished by T cycles. That is, the determinant Y=W×X can be calculated by the 3×3 processing elements each calculating ym,r over the T cycles.
The time t value in the register 13 in each processing element ubm,r is output to the post-processing circuitry 3. The processing elements ub0,0 to ub2,2 may be configured as follows.
In each processing element ubm,r within the product-sum operation circuitry 1a, the multiplier 21 as an AND logic gate receives two 1-bit inputs, namely, 1-bit data wm,t and 1-bit data xt,r. The multiplier 21 provides a 1-bit output, namely, an AND logic value based on the data wm,t and xt,r.
The adder 12 receives a 1-bit input, which is the 1-bit output data from the multiplier 21. The other input to the adder 12 consists of multiple bits from the register 13. That is, a time t−1 multibit value in the register 13 is input to the adder 12. The adder 12 provides multibit output data that corresponds to a sum of the 1-bit output data from the multiplier 21 and the time t−1 multibit value in the register 13.
The register 13 receives a multibit input. That is, the register 13 retains the multibit output data from the adder 12, which has been obtained at the adder 12 by addition of the 1-bit output data given from the multiplier 21 at time t. The values at time T (cycles) in the respective registers 13 in the processing elements ubm,r of the product-sum operation circuitry 1a are output to the post-processing circuitry 3.
The output from each processing element ubm,r in the product-sum operation circuitry 1a is supplied to the post-processing circuitry 3.
Note that the multiplier 21 have been adopted on the assumption that the 1-bit data items wm,t and xt,r are expressed as “(1,0)”, as the AND logic gate. If the data items wm,t and xt,r are expressed as “(+1, −1)”, the multiplier 21 are replaced by XNOR logic gates.
Also, each processing element ubm,r may include the AND logic gate, an XNOR logic gate (not shown), and a selection circuit (not shown) that is adapted to select the AND logic gate or the XNOR logic gate according to the setting of the register.
Moreover, while the accumulator of a 1-bit input type may be constituted by the adder 12 and the register 13 as shown in
As shown in
Also, the value at the 0th bit (LSB) of the data xt,0 is input to a data line for the data xt,0(0), the value at the 1st bit of the data xt,0 is input to a data line for the data xt,0(1), and the value at the 2nd bit (MSB) of the data xt,0 is input to a data line for the data xt,0(2).
For example, if the data w0,t is 3-bit data expressed as “011b” at time t, “1” is input to the data line for the data) w0,t(0), “1” is input to the data line for the data ww0,t(1), and “0” is input to the data line for the data w0,t(2).
Also, if the data xt,0 is 3-bit data expressed as “110b” at the time t, “0” is input to the data line for the data xt,0(0), “1” is input to the data line for the data xt,0(1), and “1” is input to the data line for the data wt,0(2).
That is, when the data wm,t and xt,r are each 3-bit data, they may be expressed as below. Here, however, the description will focus only on one element of the output, and will omit the indices m and r as used in the foregoing descriptions. The values of wt(2), etc., are all 1-bit values (0 or 1).
w
t
=w
t
(2)×22+wt(1)×21+wt(0)×20 (5)
x
t
=x
t
(2)×22+xt(1)×21+xt(0)×20 (6)
In this instance, the expression (3) becomes the following.
Looking at the expression (7), the first horizontally-given three sigmas use w(t)(2), the second horizontally-given three sigmas use w(t)(1), and the third horizontally-given three sigmas use w(t)(0). Also, the first vertically-given three sigmas use x(t)(2), the second vertically-given three sigmas use x(t)(1), and the third vertically-given three sigmas use x(t)(0). As such, the configurations of the processing elements ub0,0 to ub2,2 shown in
The output of each of the processing elements ub0,0 to ub2,2 is supplied to the post-processing circuitry 3. In the post-processing circuitry 3, a final result of the multibit product-sum operation is obtained by multiplying the sigmas by their respective corresponding power-of-two coefficients and summing them. The processing of the power-of-two coefficient multiplications in the post-processing circuitry 3 may be easily performed through shift operations.
In many instances, including instances with deep neural networks, T is a relatively large value that exceeds 100. Accordingly, the processing of multiplying the 1-bit results of the product-sum operations of sigma terms by respective power-of-two coefficients and summing the sigmas in the end (that is, the post-processing) is not so frequently performed. The way in which the post-processing is performed may be discretionarily selected. For example, it may be performed in a sequential manner.
Dealing with Negatives
Assuming that the data values are handled in two's complement representation, the expressions (5) and (6) are given as the following (5′ and 6′).
w
t
=−w
t
(2)×22+wt(1)×21+wt(0)×20 (5′)
x
t
=−x
t
(2)×22+xt(1)×21+xt(0)×20 (6 ′)
In this instance, the expression (7) becomes the following.
That is, it is sufficient to change the coefficient to negative at the post-processing in the post-processing circuitry 3, and therefore, the configurations similar to
[Second Exemplary Product-Sum Operation Circuitry (Multibit Case 2: Product-Sum Operation Circuitry When Input Data wm,t Involves Different Bits and xt,r is 4 Bits)]
Next, second exemplary product-sum operation circuitry will be described.
The second exemplary product-sum operation circuitry adopts a configuration of a 16×16-operator array.
The description will assume that input data X is a matrix of 32 rows and 4 columns, in which every element is expressed by 4 bits. Input data W is assumed to be a matrix of 15 rows and 32 columns, in which the bit widths of the respective rows are {1, 2, 4, 2, 2, 1, 2, 3, 2, 2, 3, 2, 1, 3, 2}; that is, in this example, the 32 elements on the 0th row are each 1 bit, the 32 elements on the 1st row are each 2 bits, the 32 elements on the 2nd row are each 4 bits, the 32 elements on the 3rd row are each 2 bits, and so on.
For example, referring to processing elements shown in
The matrix product Y=WX will be a matrix of 15 rows and 4 columns.
As shown in
The value of t is initially 0, and incremented by one for each cycle until it reaches 31. For example, assuming that y(um,r) is the accumulator's output from an processing element um,r, the values of y(u0,0) to y(u0,3) included in y0,0 after 32 cycles are given by the following expressions (8).
y(u0,0)=Σt=031 w0,t(0)xt,0(3)
y(u0,1)=Σt=031 w0,t(0)xt,0(2)
y(u0,2)=Σt=031 w0,t(0)xt,0(1)
y(u0,3)=Σt=031 w0,t(0)xt,0(0)
By performing the following arithmetic operation on them in the post-processing circuitry 3, y0,0 can be obtained.
y
0,0=23×y(u0,0)+22×y(u0,1)+21×y(u0,2)+20×y(u0,3)
Similarly, the values of y(u1,0) to y(u2,3) included in y1,0 after 32 cycles are given by the following expressions (9).
y(u1,0)=Σt=031w1,t(1)xt,0(3)
y(u1,1)=Σt=031w1,t(1)xt,0(2)
y(u1,2)=Σt=031w1,t(1)xt,0(1)
y(u1,0)=Σt=031w1,t(1)xt,0(0)
y(u2,0)=Σt=031w2,t(0)xt,0(3)
y(u2,1)=Σt=031w1,t(0)xt,0(2)
y(u2,2)=Σt=031w1,t(0)xt,0(1)
y(u2,3)=Σt=031w1,t(0)xt,0(0)
Using these, y1,0 can be calculated as follows.
y
1,0=24×y(u1,0)=23×y(u1,1)+22×y(u1,2)+21×y(u1,3)+23×y(u2,0)+22×y(u2,1)+21×y(u2,2)+20×y(u2,3) (10)
As such, applicable values of the coefficients (powers of two), as well as correspondences (indexes) to the output elements are different for the respective results from the processing elements um,r. For example, the coefficient values and the output indexes may be set as follows.
y(u0,0): coefficient=23, output index=(0,0)
y(u0,1): coefficient=22, output index=(0,0)
y(u0,2): coefficient=21, output index=(0,0)
y(u0,3): coefficient=20, output index=(0,0)
y(u1,0): coefficient=24, output index=(1,0)
y(u1,1): coefficient=23, output index=(1,0)
y(u1,2): coefficient=22, output index=(1,0)
y(u1,3): coefficient=21, output index=(1,0)
y(u2,0): coefficient=23, output index=(1,0)
y(u2,1): coefficient=22, output index=(1,0)
y(u2,2): coefficient=21, output index=(1,0)
y(u2,3): coefficient=20, output index=(1,0) (11)
Thus, the embodiment adopts the LUT 4 that stores coefficients and output indexes addressed to “m,r”.
As shown in
Turning back to
y(u14,0): coefficient=25, output index=(7,0)
y(u14,1): coefficient=24, output index=(7,0)
y(u14,2): coefficient=23, output index=(7,0)
y(u14,3): coefficient=22, output index=(7,0)
y(u15,0): coefficient=24, output index=(7,0)
y(u15,1): coefficient=23, output index=(7,0)
y(u15,2): coefficient=22, output index=(7,0)
y(u15,3): coefficient=21, output index=(7,0) (12)
Therefore, y7,0 has a value given by the following.
y
7.0=25×y(u14,0)+24×y(u14,1)+23×y(u14,2)+22+y(u14,3)+24×y(u15.0)+23×y(u15.1)+22×y(u15,2)+21×y(u15,3) (13)
The remaining 1 bit is handled after the completion of the operation shown in
y(u0,0): coefficient=23, output index=(7,0)
y(u0,1): coefficient=22, output index=(7,0)
y(u0,2): coefficient=21, output index=(7,0)
y(u0,3): coefficient=20, output index=(7,0)
The post-processing with these values, according to the algorithm based on the coefficients and the output indexes, will give the following expression (14) incorporating the expression (13).
y
7,0=25×y(u14,0)+24×y(u14,1)+23×y(u14,2)+22×y(u14.3)+24+y(u15,0)+23×y(u15,1)+22×y(u15,2)+21×y(u15,3)+23×y(u0,0)+22×y(u0,1)+21×y(u0,2)+20×y(u0,3) (14)
This completes the calculation for y7,0, which was incomplete at the processing shown in
As shown in
It is then determined whether or not all the post-processing operations for the accumulator outputs from the processing elements u0,0 to u15,15, up to time t=31, have been finished (step S3). If it is determined that all the post-processing operations have not yet been finished (NO in step S3), the post-processing circuitry 3 returns to step S1, and performs the remaining post-processing operations for the accumulator outputs from the processing elements u0,0 to u15,15, for the time t=1 and onward.
On the other hand, if it is determined in step S3 that all the post-processing operations for the accumulator outputs from the processing elements u0,0 to u15,15 up to time t=31 have been finished (YES in step S3), the post-processing circuitry 3 sends the result of the post-processing operations to the processor 5 (step S4), and terminates the processing.
With the configuration of the product-sum operation circuitry 1 for the information processing apparatus 100 according to the embodiments, it is possible to reduce the data transfers from the memory, such as an SRAM, to the operator array of the product-sum operation circuitry 1. Consequently, the data processing by the information processing apparatus 100 can be realized with an improved efficiency.
When M×R processing elements are arrayed in parallel, the total number of times of the product-sum operations is M×R×T. Supposing that the apparatus has one processing element, then 2×M×R×T data transfers are required in total, since two data items need to be transferred from the memory to the processing element each time the product-sum operation is performed. In the configuration according to the embodiment shown in
With the information processing apparatus 100 according to the embodiments in the first and second exemplary multibit cases, suitable coefficients and output indexes are set in the LUT 4 in accordance with the bit widths of the input data X and W, and the post-processing algorithms are applied as discussed above. Thus, the data X and W can be processed even when they are of various bit numbers differing from each other.
Also, the embodiments can duly deal with the instances where one value must be segmented, as the value y7 in the second exemplary case. The embodiments as such can make full use of the operator array without idle resources, and this contributes to the improved efficiency and the accelerated processing speed of the processing elements.
For example, a semiconductor device that adopts parallel operations of multiple 1-bit processing elements is not capable of coping with the demand for an accuracy level of 2 or more bits. In contrast, the 1 bit×1 bit product-sum operations in the first and second exemplary cases of the embodiments enable comparably high-speed processing while being capable of coping with multibit inputs.
The embodiments further contrast with multibit×multibit-dedicated circuitry (e.g., GPU). Note that when processing elements are each adapted for multibit×multibit operations, the circuit size of one processing element is larger than a processing element for 1 bit×1 bit operations.
Provided that the same parallel number and the same processing time for one operation of processing elements are set, the product-sum operation circuitry in the first and second exemplary cases of the embodiments has a smaller circuit size for performing 1 bit×1 bit product-sum operations while having the same processing speed.
In other words, using multibit×multibit-dedicated processing elements for performing 1 bit×1 bit operations involves idle circuits. This means that resources are largely wasted and efficiency is sacrificed.
For example, when there are 16×16 processing elements, 16×16=256 parallel operations can be performed as 1 bit×1 bit product-sum operations. Using the same configuration, (16/4)×(16/4)=16 parallel operations can be performed as 4 bits×4 bits product-sum operations. Also, the two matrices do not need to have the same bit widths, and it is possible to perform, for example, (16/2)×(16/8)=16 parallel operations as 2 bits×8 bits product-sum operations.
The first and second exemplary cases of the embodiments eliminate the idle resources as noted above by efficiently allowing all the processing elements to be used irrespective of the bit widths of input data. In the instances of multibit×multibit product-sum operations, still, the embodiments require multiple processing elements to deal with a calculation that is performed by one multibit×multibit-dedicated processing element. As such, on the condition that the same parallel number is set, the product-sum operation circuitry in the first and second exemplary cases of the embodiments—which may be hypothesized to have a smaller parallel number on an equivalent basis—operates at a relatively low processing speed as compared to the circuitry of multibit x multibit-dedicated processing elements.
However, the embodiments can have a smaller circuit size for one processing element as compared to a multibit×multibit-dedicated processing element. Accordingly, the embodiments can have a larger parallel number for processing elements when the size of the entire circuitry is the same.
Ultimately, the embodiments provide a higher processing speed when the bit widths of input data are small, while providing a lower processing speed when the bit widths of input data are large Despite this, in most instances (for example, in the processing for deep learning where the desired bit widths of input data can vary depending on layer), small bit widths are sufficient and large bit widths are only required for a limited part. Therefore, assuming the instances where the operations using input data with small bit widths account for a larger part, the information processing apparatus 100 according to the embodiments provide a higher processing speed as a whole.
While certain embodiments have been described, they have been presented by way of example only, and they are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be worked in a variety of other forms. Furthermore, various omissions, substitutions, and changes in such forms of the embodiments may be made without departing from the gist of the inventions. The embodiments and their modifications are covered by the accompanying claims and their equivalents, as would fall within the scope and gist of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2018-136714 | Jul 2018 | JP | national |