This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2018-054284, filed on Mar. 22, 2018, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a semiconductor device.
A semiconductor device performing matrix-matrix multiplications (hereinafter, “matrix product calculations”) is already known about. A matrix product calculation is a basic arithmetic operation that constitutes the essentials in digital signal processing, e.g., image processing. A major part of calculating operations in deep neural networks also adopts matrix product calculations. The accuracy required of the matrix product calculations in deep neural networks may be low when compared to matrix product calculations in other instances. The accuracy required of the matrix product calculations also varies in the processing within a deep neural network depending on the situations in which they are applied.
For example, it is already known that the accuracy required of the matrix product calculation for inferring is lower than that for learning (or training). Besides, even in the same type of inferring behaviors, the required accuracy may differ depending on the layers of a deep neural network. The required accuracy within the same layer could often differ as well, depending on arithmetic operations.
On the other hand, a matrix product calculation proceeds through the combination of multiplications and additions which are independent of one another, and as such, a matrix product calculation can be easily parallelized. Therefore, parallel processing by multiple arithmetic operators would be one effective option for the improvement of efficiency including reduction of power, acceleration of calculation speeds, and so on.
According to an embodiment, there is provided a semiconductor device comprising a plurality of operation circuits each comprising a multiplier including a first input terminal and a second input terminal and configured to calculate a product of a value input via the first input terminal and a value input via the second input terminal, and an accumulator configured to integrate an output of the multiplier and output an integrated value that is obtained by integrating output values of the multiplier. The plurality of operation circuits are divided into groups by two manners, where by the first manner multiple operation circuits in each of the groups are configured to receive a common first value via the respective first input terminals, and by the second manner multiple operation circuits in each of the groups are configured to receive a common second value via the respective second input terminals.
Embodiments will be described with reference to the drawings.
1 Configuration of Semiconductor Device
As shown in
The data X is expressed in a matrix form with t rows and r columns, and the data W is expressed in a matrix form with m rows and t columns (t, r, and m each being 0 or a positive integer). The embodiment will assume t to be time (read cycle).
The two matrices will be given as:
W={wm,t}0≤m≤M-1, 0≤t≤T-1, and
X={xt,r}0≤t≤T-1, 0≤r≤R-1,
where T-1 is the maximum value of read cycles, R-1 is the maximum column number of the matrix of the data X, and M-1 is the maximum row number of the matrix of the data W.
The product-sum operation circuitry 1 performs a matrix operation using the two data items (W, X) input from the memory 2, and outputs the operation result to the post-processing circuitry 3. More specifically, the product-sum operation circuitry 1 includes a plurality of arithmetic operators arranged in an array and each including a multiplier and an accumulator.
Assuming that a matrix to be calculated is Y=WX, the operation for each element of Y={ym,r}0≤m≤M-1, 0≤r≤R-1 takes a product-sum form as an expression (1) shown in
The product-sum operation circuitry 1 accordingly outputs the result of the product-sum operation to the post-processing circuitry 3.
The memory 2 may have any configuration as long as it is a semiconductor memory, such as an SRAM, a DRAM, an SDRAM, a NAND flash memory, a three-dimensionally designed flash memory, an MRAM, a register, a latch circuit, or the like.
The post-processing circuitry 3 performs an operation to the output from the product-sum operation circuitry 1, which includes the output of each arithmetic operator at time T-1 corresponding to an m-th row and an r-th column, using a particular coefficient settable to each arithmetic operator. The post-processing circuitry 3 then puts an output index to the operation result and outputs it to a processor 5. In these actions, the post-processing circuitry 3 acquires the particular coefficient and the output index from a lookup table (LUT) 4 as necessary.
If the post-processing is not required, the post-processing circuitry 3 may be omitted, and the output from the product-sum operation circuitry 1 may be supplied to the processor 5.
The LUT 4 stores the particular coefficients and the output indexes for the respective arithmetic operators in the product-sum operation circuitry 1. The LUT 4 may be storage circuitry.
The processor 5 receives results of the product-sum operations of the respective arithmetic operators after the processing by the post-processing circuitry 3. The processor 5 is capable of setting the particular coefficients and the output indexes to be stored into the LUT 4 and set to the respective arithmetic operators.
2 Configuration of Product-Sum Operation Circuitry 1 and Operations of Semiconductor Device
2-1 First Exemplary Product-Sum Operation Circuitry (Product-Sum Operation Circuitry without Bit Limitations)
As shown in
The multiplier 11 in each of the arithmetic operators u0,0 to uM-1,R-1 includes a first input terminal and a second input terminal. The first input terminal of the multiplier 11 in an arithmetic operator um,r is coupled to a data line that is common to the other arithmetic operators arranged on the m-th row, and the second input terminal is coupled to a data line that is common to the other arithmetic operators arranged on the r-th column.
In other words, first inputs which are supplied to the first input terminals of certain multipliers 11 (among all the arithmetic operators um,r) share a data line for data wm,t in the row direction, and second inputs which are supplied to the second input terminals of certain multipliers 11 share a data line for data xt,r in the column direction.
As such, at time t, the first inputs to the multipliers 11 in the arithmetic operators u0,0, u0,1, . . . , u0,R-1 share a value of data w0,t, the first inputs to the multipliers 11 in the arithmetic operators u1,0, u1,1, . . . , u1,R-1 share a value of data w1,t, and so forth, so that the first inputs to the multipliers 11 in the arithmetic operators uM-1,0, uM-1,1, . . . , uM-1,R-1 share a value of data wM-1,t.
Similarly, at the time t, the second inputs to the multipliers 11 in the arithmetic operators u0,0, u1,0, . . . , uM-1,0 share a value of data xt,0, the second inputs to the multipliers 11 in the arithmetic operators u0,1, u1,1, . . . , uM-1,1 share a value of data xt,1, and so forth, so that the second inputs to the multipliers 11 in the arithmetic operators u0,R-1, u1,R-1, . . . , uM-1,R-1 share a value of data xt,R-1.
The multiplier 11 in each of the arithmetic operators u0,0 to uM-1,R-1 multiplies data of the first input by data of the second input, and outputs the multiplication result to the adder 12.
Accordingly, the multipliers 11 in the arithmetic operators u0,0, u0,1, . . . , u0,R-1 at the time t output the respective multiplication results (i.e. the results of multiplying the data w0,t of the first input by the data xt,0, xt,1, . . . , xt,R-1 of the second input, respectively).
Also, the multipliers 11 in the arithmetic operators u0,0, u1,0, . . . , uM-1,0 at the time t output the respective multiplication results (i.e. the results of multiplying the data xt,0 of the second input by the data w0,t, w1,t, . . . , wM-1,t of the first input, respectively).
The adder 12 and the register 13 in each of the arithmetic operators u0,0 to uM-1,R-1 constitute an accumulator. In each of the arithmetic operators u0,0 to uM-1,R-1, the adder 12 adds together the multiplication result given from the multiplier 11 and the value at time t-1 (one cycle prior to the time t) that the register 13 is holding (value of the accumulator).
The register 13 holds the time t-1 multiplication result given via the adder 12, and retains the addition result output from the adder 12 at the cycle of time t.
Namely, the accumulator in each of the arithmetic operators u0,0 to uM-1,R-1 is configured to integrate an output of the multiplier 11 and output an integrated value that is obtained by integrating output values of the multiplier 11.
In this manner, M×R arithmetic operators are arrayed in parallel, and at time t, data wm,t is input to the r arithmetic operators arranged on the m-th row and data xt,r is input to the m arithmetic operators arranged on the r-th column. Accordingly, at the time t, the arithmetic operator at the m-th row and the r-th column performs the calculation expressed as:
ym,r,t=ym,r,t-1+wm,t×xt,r (2)
where, ym,r,t represents the value newly stored at the time t in the register 13 in the arithmetic operator um,r. Consequently, the arithmetic operations according to the expression (1) are finished by T cycles. That is, the determinant Y=W×X can be calculated by the M×R arithmetic operators each calculating ym,r, over the T cycles.
The time t value in the register 13 in each arithmetic operator um,r is output to the post-processing circuitry 3. Note that the output from each arithmetic operator um,r in the first exemplary product-sum operation circuitry 1 is supplied to the processor 5 without post-processing.
In other words, the plurality of arithmetic operators u0,0 to uM-1,R-1 (operation circuits) are divided into two groups by two manners. By the first manner, multiple operation circuits in each of the groups (u0,0 to u0,R-1, u1,0 to u1,R-1, . . . , uM-1,0 to uM-1,R-1) are configured to receive a common first value via the respective first input terminals. By the second manner, multiple operation circuits in each of the groups (u0,0 to uM-1,0, u0,1 to uM-1,1, . . . , u0,R-1 to uM-1,R-1) are configured to receive a common second value via the respective second input terminals.
2-2 Second Exemplary Product-Sum Operation Circuitry (Product-Sum Operation Circuitry 1a Adopting 1-Bit Multipliers 11)
In the example shown by
In
The adder 12 receives a 1-bit input, which is the 1-bit output data from the AND logic gate 21. The other input to the adder 12 consists of multiple bits from the register 13. That is, a time t-1 multibit value in the register 13 is input to the adder 12. The adder 12 provides multibit output data that corresponds to a sum of the 1-bit output data from the AND logic gate 21 and the time t-1 multibit value in the register 13.
The register 13 receives a multibit input. That is, the register 13 retains the multibit output data from the adder 12, which has been obtained at the adder 12 by addition of the 1-bit output data given from the AND logic gate 21 at time t. The values at time T (cycles) in the respective registers 13 in the arithmetic operators uam,r of the product-sum operation circuitry 1a are output to the post-processing circuitry 3.
It should be noted that the output from each arithmetic operator uam,r in the second exemplary product-sum operation circuitry 1a is supplied to the processor 5 without post-processing.
Also, the AND logic gates 21 have been adopted on the assumption that the 1-bit data items wm,t and xt,r are expressed as “(1,0)”. If the data items wm,t and xt,r are expressed as “(+1, −1)”, the AND logic gates 21 are replaced by XNOR logic gates.
Each arithmetic operator uam,r may include the AND logic gate 21, an XNOR logic gate (not shown), and a selection circuit (not shown) that is adapted to select the AND logic gate 21 or the XNOR logic gate according to the setting value of the register.
Moreover, while the accumulator of a 1-bit input type may be constituted by the adder 12 and the register 13 as shown in
2-3 Third Exemplary Product-Sum Operation Circuitry (Multibit Case 1: Product-Sum Operation Circuitry when Input Data wm,t and xt,r are 3 Bits)
As shown in
Also, the value at the 0th bit (LSB) of the data xt,0 is input to a data line for the data xt,0(0), the value at the 1st bit of the data xt,0 is input to a data line for the data xt,0(1), and the value at the 2nd bit (MSB) of the data xt,0 is input to a data line for the data xt,0(2).
For example, if the data w0,t is 3-bit data expressed as “011b” at time t, “1” is input to the data line for the data w0,t(0), “1” is input to the data line for the data w0,t(1), and “0” is input to the data line for the data w0,t(2).
Also, if the data xt,0 is 3-bit data expressed as “110b,” at the time t, “0” is input to the data line for the data xt,0(0), “1” is input to the data line for the data xt,0(1), and “1” is input to the data line for the data wt,0(2).
That is, when the data wm,t and xt,r are each 3-bit data, they may be expressed as below. Here, however, the description will focus only on one element of the output, and will omit the indices m and r as used in the foregoing descriptions. The values of wt(2), etc., are all 1-bit values (0 or 1).
wt=wt(2)×22+wt(1)×21+wt(0)×20 (3)
xt=xt(2)×22+xt(1)×21+xt(0)×20 (4)
In this instance, the expression (1) becomes an expression (5) as shown in
Looking at the expression (5), the first horizontally-given three sigmas use w(t)(2), the second horizontally-given three sigmas use w(t)(1), and the third horizontally-given three sigmas use w(t)(0). Also, the first vertically-given three sigmas use x(t)(2), the second vertically-given three sigmas use x(t)(1), and the third vertically-given three sigmas use x(t)(0). As such, the configurations of the arithmetic operators ub0,0 to ub2,2 shown in
The output of each of the arithmetic operators ub0,0 to ub2,2 is supplied to the post-processing circuitry 3. In the post-processing circuitry 3, a final result of the multibit product-sum operation is obtained by multiplying the sigmas by their respective corresponding power-of-two coefficients and summing them. The processing of the power-of-two coefficient multiplications in the post-processing circuitry 3 may be easily performed through shift operations.
In many instances, including instances with deep neural networks, T is a relatively large value that exceeds 100. Accordingly, the processing of multiplying the 1-bit results of the product-sum operations of sigma terms by respective power-of-two coefficients and summing the sigmas in the end (that is, the post-processing) is not so frequently performed. The way in which the post-processing is performed may be discretionarily selected. For example, it may be performed in a sequential manner.
Dealing with Negatives
Assuming that the data values are handled in two's complement representation, the expressions (3) and (4) are given as follows.
wt=−wt(2)×22+wt(1)×21+wt(0)×20 (3′)
xt=−xt(2)×22+xt(1)×21+xt(0)×20 (4′)
In this instance, the expression (5) becomes an expression (5′) as shown in
That is, it is sufficient to change the coefficient to negative at the post-processing in the post-processing circuitry 3, and therefore, the configurations similar to
2-4 Fourth Exemplary Product-Sum Operation Circuitry (Multibit Case 2: Product-Sum Operation Circuitry when Input Data wm,t Involves Different Bits and xt,r is 4 Bits)
Next, fourth exemplary product-sum operation circuitry will be described.
The fourth exemplary product-sum operation circuitry adopts a configuration of a 16×16-operator array.
The description will assume that input data X is a matrix of 32 rows and 4 columns, in which every element is expressed by 4 bits. Input data W is assumed to be a matrix of 15 rows and 32 columns, in which the bit widths of the respective rows are {1, 2, 4, 2, 2, 1, 2, 3, 2, 2, 3, 2, 1, 3, 2}; that is, in this example, the 32 elements on the 0th row are each 1 bit, the 32 elements on the 1st row are each 2 bits, the 32 elements on the 2nd row are each 4 bits, the 32 elements on the 3rd row are each 2 bits, and so on.
The matrix product Y=WX will be a matrix of 15 rows and 4 columns.
As shown in
The value of t is initially 0, and incremented by one for each cycle until it reaches 31. For example, assuming that y(um,r) is the accumulator's output from an arithmetic operator um,r, the values of y(u0,0) to y(u0,3) included in y0,0 after 32 cycles are given by the following expressions (6).
y(u0,0)=Σt=031w0,t(0)xt,0(3)
y(u0,1)=Σt=031w0,t(0)xt,0(2)
y(u0,2)=Σt=031w0,t(0)xt,0(1)
y(u0,3)=Σt=031w0,t(0)xt,0(0) (6)
By performing the following operation on them in the post-processing circuitry 3, y0,0 can be obtained.
y0,0=23×y(u0,0)+22×y(u0,1)+21×y(u0,2)+20×y(u0,3)
Similarly, the values of y(u1,0) to y(u2,3) included in y1,0 after 32 cycles are given by the following expressions (7).
y(u1,0)=Σt=031w1,t(1)xt,0(3)
y(u1,1)=Σt=031w1,t(1)xt,0(2)
y(u1,2)=Σt=031w1,t(1)xt,0(1)
y(u1,3)=Σt=031w1,t(1)xt,0(0)
y(u2,0)=Σt=031w1,t(0)xt,0(3)
y(u2,1)=Σt=031w1,t(0)xt,0(2)
y(u2,2)=Σt=031w1,t(0)xt,0(1)
y(u2,3)=Σt=031w1,t(0)xt,0(0) (7)
Using these, y1,0 can be calculated as follows.
y1,0=24×y(u1,0)+23×y(u1,1)+22×y(u1,2)+21×y(u1,3)+23×y(u2,0)+22×y(u2,1)+21×y(u2,2)+20×y(u2,3) (8)
As such, applicable values of the coefficients (powers of two), as well as correspondences (indexes) to the output elements are different for the respective results from the arithmetic operators um,r. For example, the coefficient values and the output indexes may be set as follows.
y(u0,0):coefficient=23,output index=(0,0)
y(u0,1):coefficient=22,output index=(0,0)
y(u0,2):coefficient=21,output index=(0,0)
y(u0,3):coefficient=20,output index=(0,0)
y(u1,0):coefficient=24,output index=(1,0)
y(u1,1):coefficient=23,output index=(1,0)
y(u1,2):coefficient=22,output index=(1,0)
y(u1,3):coefficient=21,output index=(1,0)
y(u1,0):coefficient=20,output index=(1,0)
y(u1,1):coefficient=22,output index=(1,0)
y(u1,2):coefficient=21,output index=(1,0)
y(u1,3):coefficient=20,output index=(1,0) (9)
Thus, the embodiment adopts the LUT 4 that stores coefficients and output indexes addressed to “m,r”.
As shown in
Turning back to
y(u14,0):coefficient=25,output index=(7,0)
y(u14,1):coefficient=24,output index=(7,0)
y(u14,2):coefficient=23,output index=(7,0)
y(u14,3):coefficient=22,output index=(7,0)
y(u15,0):coefficient=24,output index=(7,0)
y(u15,1):coefficient=23,output index=(7,0)
y(u15,2):coefficient=22,output index=(7,0)
y(u15,3):coefficient=21,output index=(7,0) (10)
Therefore, y7,0 has a value given by the following.
y7,0=25×y(u14,0)+24×y(u14,1)+23×y(u14,2)+22×y(u14,3)+24×y(u15,0)+23×y(u15,1)+22×y(u15,2)+21×y(u15,3) (11)
The remaining 1 bit is handled after the completion of the operation shown in
y(u0,0):coefficient=23,output index=(7,0)
y(u0,1):coefficient=22,output index=(7,0)
y(u0,2):coefficient=21,output index=(7,0)
y(u0,3):coefficient=20,output index=(7,0)
The post-processing with these values, according to the algorithm based on the coefficients and the output indexes, will give the following expression (12) incorporating the expression (11).
y7,0=25×y(u14,0)+24×y(u14,1)+23×y(u14,2)+22×y(u14,3)+24×y(u15,0)+23×y(u15,1)+22×y(u15,2)+21×y(u15,3)+23×y(u0,0)+22×y(u0,1)+21×y(u0,2)+20×y(u0,3) (12)
This completes the calculation for y7,0, which was incomplete at the processing shown in
As shown in
It is then determined whether or not all the post-processing operations for the accumulator outputs from the arithmetic operators u0,0 to u15,15, up to time t=31, have been finished (step S3). If it is determined that all the post-processing operations have not yet been finished (NO in step S3), the post-processing circuitry 3 returns to step S1 and performs the remaining post-processing operations for the accumulator outputs from the arithmetic operators u0,0 to u15,15, for the time t=1 and onward.
On the other hand, if it is determined in step S3 that all the post-processing operations for the accumulator outputs from the arithmetic operators u0,0 to u15,15 up to time t=31 have been finished (YES in step S3), the post-processing circuitry 3 sends the result of the post-processing operations to the processor 5 (step S4), and terminates the processing.
3 Effects of Semiconductor Device
With the configuration of the product-sum operation circuitry 1 for the semiconductor device 100 according to the embodiments, it is possible to reduce the data transfers from the memory, such as an SRAM, to the operator array of the product-sum operation circuitry 1. Consequently, the data processing by the semiconductor device 100 can be realized with an improved efficiency.
In the example shown in
With the semiconductor device 100 according to the embodiments in the third and fourth exemplary multibit cases, suitable coefficients and output indexes are set in the LUT 4 in accordance with the bit widths of the input data X and W, and the post-processing algorithms are applied as discussed above. Thus, the data X and W can be processed even when they are of various bit numbers differing from each other.
Also, the embodiments can duly deal with the instances where one value must be segmented, as the value y7 in the fourth exemplary case. The embodiments as such can make full use of the operator array without idle resources, and this contributes to the improved efficiency and the accelerated processing speed of the arithmetic operators.
For example, a semiconductor device that adopts parallel operations of multiple 1-bit arithmetic operators is not capable of coping with the demand for an accuracy level of 2 or more bits. In contrast, the 1 bit×1 bit product-sum operations in the third and fourth exemplary cases of the embodiments enable comparably high-speed processing while being capable of coping with multibit inputs.
The embodiments further contrast with multibit×multibit-dedicated circuitry (e.g., GPU). Note that when arithmetic operators are each adapted for multibit×multibit operations, the circuit size of one arithmetic operator is larger than an arithmetic operator for 1 bit×1 bit operations.
Provided that the same parallel number and the same processing time for one operation of arithmetic operators are set, the product-sum operation circuitry in the third and fourth exemplary cases of the embodiments has a smaller circuit size for performing 1 bit×1 bit product-sum operations while having the same processing speed.
In other words, using multibit×multibit-dedicated arithmetic operators for performing 1 bit×1 bit operations involves idle circuits. This means that resources are largely wasted and efficiency is sacrificed.
For example, when there are 16×16 arithmetic operators, 16×16=256 parallel operations can be performed as 1 bit×1 bit product-sum operations. Using the same configuration, (16/4)×(16/4)=16 parallel operations can be performed as 4 bits×4 bits product-sum operations. Also, the two matrices do not need to have the same bit widths, and it is possible to perform, for example, (16/2)×(16/8)=16 parallel operations as 2 bits×8 bits product-sum operations.
The third and fourth exemplary cases of the embodiments eliminate the idle resources as noted above by efficiently allowing all the arithmetic operators to be used irrespective of the bit widths of input data. In the instances of multibit×multibit product-sum operations, still, the embodiments require multiple arithmetic operators to deal with a calculation that is performed by one multibit×multibit-dedicated arithmetic operator. Thus, on the condition that the same parallel number is set, the product-sum operation circuitry in the third and fourth exemplary cases of the embodiments—which may be hypothesized to have a smaller parallel number on an equivalent basis—operates at a relatively low processing speed as compared to the circuitry of multibit×multibit-dedicated arithmetic operators.
However, the embodiments can have a smaller circuit size for one arithmetic operator as compared to a multibit×multibit-dedicated arithmetic operator. Accordingly, the embodiments can have a larger parallel number for arithmetic operators when the size of the entire circuitry is the same.
Ultimately, the embodiments provide a higher processing speed when the bit widths of input data are small, while providing a lower processing speed when the bit widths of input data are large. Despite this, in most instances (for example, in the processing for deep learning where the desired bit widths of input data can vary depending on layer) small bit widths are sufficient and large bit widths are only required for a limited part. Therefore, assuming the instances where the operations using input data with small bit widths account for a larger part, the semiconductor device 100 according to the embodiments provide a higher processing speed as a whole.
While certain embodiments have been described, they have been presented by way of example only, and they are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be worked in a variety of other forms. Furthermore, various omissions, substitutions, and changes in such forms of the embodiments may be made without departing from the gist of the inventions. The embodiments and their modifications are covered by the accompanying claims and their equivalents, as would fall within the scope and gist of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2018-054284 | Mar 2018 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5014235 | Morton | May 1991 | A |
5121352 | Hesson | Jun 1992 | A |
6029187 | Verbauwhede | Feb 2000 | A |
6609236 | Watanabe et al. | Aug 2003 | B2 |
7746109 | Young | Jun 2010 | B1 |
7813171 | Shibata et al. | Oct 2010 | B2 |
7840630 | Wong | Nov 2010 | B2 |
7849119 | Vadi | Dec 2010 | B2 |
Number | Date | Country |
---|---|---|
H06-502937 | Mar 1994 | JP |
2006-139864 | Jun 2006 | JP |
2007-128651 | May 2007 | JP |
Entry |
---|
M. Courbariaux, et al., “Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or—1”, Cornel University Library, arXiv: 1602.02830, 2016, http://arxiv.org/abs/1602.02830. |
H. Sharma, et al., “Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks”, 45th International Symposium on Computer Architecture (ISCA), 2018, http://arxiv.org/abs/1712.01507. |
Number | Date | Country | |
---|---|---|---|
20190294414 A1 | Sep 2019 | US |