The present application claims priority from Japanese Patent Application No. 2021-189169 filed on Nov. 22, 2021, the content of which is hereby incorporated by reference to this application.
The present invention relates to a semiconductor device, for example, a semiconductor device executing a neural network processing.
Patent Document 1 (Japanese Patent Application Laid-Open No. 2019-40403) discloses an image recognition device having a convolutional operation processing circuit that performs calculation by using an integrated coefficient table in order to reduce an amount of calculation of convolutional operations in a CNN (Convolutional Neural Network). The integrated coefficient table holds N × N pieces of data, and each of the N × N pieces of data is configured by a coefficient and a channel number. The convolutional operation processing circuit includes a product operation circuit that executes N × N product operations of an input image and a coefficient in parallel, and a channel selection circuit for performing an accumulation addition operation to its production operation result for each channel number and storing its addition operation result in an output register for each channel number.
In a neural network such as a CNN, the floating-point number of parameters such as 32 bits, specifically, weight parameters and bias parameters are obtained by learning. However, in using the floating-point number of parameters to perform a product-sum operation during inference, a circuit area, a processing load, power consumption, and execution time of a product-sum operation unit (called MAC (Multiply ACcumulate operation) circuit) can be increased. Further, required memory capacity and memory bandwidth increase with read or write from temporary buffers of the parameters and the operation results, and the power consumption can also increase.
Therefore, in recent years, attention has been focused on a method of making an inference after quantizing the floating-point number of parameters such as 32 bits into integers of 8 bits or less. In this case, since the MAC circuit may perform integer operations with the small number of bits, the circuit area, processing load, power consumption, and execution time of the MAC circuit can be reduced. However, However, in using the quantization, quantization error varies depending on granularity of quantization, and accuracy of the inference may vary accordingly. Consequently, an efficient mechanism for reducing the quantization error is demanded. Also, reducing the memory bandwidth is required to allow inference to be made with less hardware resources and time.
Other problems and novel features will become apparent from the description of the present specification and the accompanying drawings.
Therefore, a semiconductor device according to one embodiment executes a neural network processing, and includes a first buffer, a first shift register, a product-sum operator, and a second shift register. The first buffer holds output data. The first shift register sequentially generates a plurality of pieces of quantized input data by quantizing a plurality of pieces of output data sequentially inputted from the first buffer by bit-shifting. The product-sum operator generates operation data by performing a product-sum operation to a plurality of parameters and the plurality of pieces of quantized input data from the first shift register. The second shift register generates the output data by inversely quantizing the operation data from the product-sum operator by bit-shifting, and stores the output data in the first buffer.
Using a semiconductor device according to one embodiment makes it possible to provide a mechanism for efficiently reducing the quantization errors in the neural network.
In the embodiments described below, the invention will be described in a plurality of sections or embodiments when required as a matter of convenience. However, these sections or embodiments are not irrelevant to each other unless otherwise stated, and the one relates to the entire or a part of the other as a modification example, details, or a supplementary explanation thereof. Also, in the embodiments described below, when referring to the number of elements (including number of pieces, values, amount, range, and the like), the number of the elements is not limited to a specific number unless otherwise stated or except the case where the number is apparently limited to a specific number in principle, and the number larger or smaller than the specified number is also applicable. Further, in the embodiments described below, it goes without saying that the components (including element steps) are not always indispensable unless otherwise stated or except the case where the components are apparently indispensable in principle. Similarly, in the embodiments described below, when the shape of the components, positional relation thereof, and the like are mentioned, the substantially approximate and similar shapes and the like are included therein unless otherwise stated or except the case where it is conceivable that they are apparently excluded in principle. The same goes for the numerical value and the range described above.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Note that components having the same function are denoted by the same reference characters throughout the drawings for describing the embodiments, and the repetitive description thereof will be omitted. In addition, the description of the same or similar portions is not repeated in principle unless particularly required in the following embodiments.
The semiconductor device 10 shown in
The memory MEM1 holds, for example, a plurality of pieces of data DT composed of pixel values, and a plurality of parameters PR. The parameter PR includes a weight parameter WP and a bias parameter BP. The memory MEM2 is used as a high-speed cache memory for the neural network engine 15. For example, the plurality of pieces of data DT in the memory MEM1 are used in the neural network engine 15 after being copied in the memory MEM2 in advance.
The neural network engine 15 includes a plurality of DMA (Direct Memory Access) controllers DMAC1, DMAC2, a MAC unit 20, and a buffer BUFi. The MAC unit 20 includes a plurality of MAC circuits 21, that is, a plurality of product-sum operators. The DMA controller DMAC1 controls data transfer via the system bus 16 between the memory MEM1 and the plurality of MAC circuits 21 in the MAC unit 20, for example. The DMA controller DMAC2 controls data transfer between the memory MEM2 and the plurality of MAC circuits 21 in the MAC unit 20.
For example, the DMA controller DMAC1 sequentially reads the plurality of weight parameters WP from the memory MEM1. Meanwhile, the DMA controller DMAC2 sequentially reads the plurality of pieces of data DT copied in advance from the memory MEM2. Each of the plurality of MAC circuits 21 in the MAC unit 20 performs a product-sum operation to the plurality of weight parameters WP from the DMA controller DMAC1 and the plurality of pieces of data DT from the DMA controller DMAC2. Further, although the details will be described later, each of the plurality of MAC circuits 21 appropriately stores a product-sum operation result in the buffer BUFi.
The buffer BUFi is composed of, for example, 32-bit width × N flip-flops (N is an integer equal to or greater than 2). A demultiplexer DMUX2 is provided on an input side of the buffer BUFi, and the multiplexer MUX2 is provided on an output side of the buffer BUFi. The buffer BUFi holds output data DTo outputted from the subsequent-stage shift register SREG2 via the two demultiplexers DMUX1, DMUX2. A bit width of the output data DTo is, for example, 32 bits.
The demultiplexer DMUX1 makes a selection of whether to store the output data DTo from the subsequent-stage shift register SREG2 in the memory MEM2 via the DMA controller DMAC2 or in the buffer BUFi via the demultiplexer DMUX2. When the buffer BUFi is selected, the demultiplexer DMUX1 outputs the output data DTo of 32-bit width, and when the memory MEM2 is selected, the demultiplexer DMUX1 outputs the output data DTo of, for example, lower 8 bits or the like in 32 bits. At this time, the remaining 24 bits in the output data DTo are controlled to be zero by quantization / inverse quantization using the preceding-stage shift register SREG1 and the subsequent-stage shift register SREG2, which will be described later.
The demultiplexer DMUX2 makes a selection of which location in the 32-bit width × N buffers BUFi the 32-bit width output data DTo from the demultiplexer DMUX1 is stored at. More specifically, the buffer BUFi is provided in common to the plurality of MAC circuits 21, as shown in
The preceding-stage shift register SREG1 sequentially generates a plurality of pieces of quantized input data DTi by quantizing the plurality of pieces of output data DTo sequentially inputted from the buffer BUFi via the two multiplexers MUX2, MUX1 from the buffer BUFi by bit-shifting. Specifically, first, the multiplexer MUX2 selects the output data DTo held at a location of any one of the 32-bit width × N buffers BUFi and, for example, outputs as intermediate data DTm the lower 8 bits of the output data DTo to the multiplexer MUX1.
Also, the multiplexer MUX2 sequentially performs such a processing in time series while changing a position in the buffer BUFi, thereby sequentially outputting a plurality of pieces of intermediate data DTm equivalent to the plurality of pieces of output data DTo. The multiplexer MUX1 selects either the 8-bit width data DT read from the memory MEM2 via the DMA controller DMAC2 or the 8-bit width intermediate data DTm read from the buffer BUFi via the multiplexer MUX2, and outputs the selected data to the preceding-stage shift register SREG1.
The preceding-stage shift register SREG1 is, for example, an 8-bit width register. The preceding-stage shift register SREG1 quantizes the data from the multiplexer MUX1 by using a quantization coefficient Qi of 2m (m is an integer equal to or greater than zero), thereby generating the quantized input data DTi that is in an 8-bit integer (INT8) format. That is, the preceding-stage shift register SREG1 multiplies the inputted data by the quantization coefficient Qi by left-shifting the inputted data by m bits. Assuming that 8 bits can represent 0 to 255 in decimal, the quantization coefficient Qi, that is, a shift amount “m” is determined so that the quantized input data DTi has a value close to 255, for example.
The MAC circuit 21 performs the product-sum operation to the plurality of weight parameters WP sequentially read out from the memory MEM1 via the DMA controller DMAC1 and the plurality of pieces of pieces of quantized input data DTi from the preceding-stage shift register SREG1, thereby generating operation data DTc. The weight parameter WP obtained by learning is usually a value smaller than 1 represented by a 32-bit floating-point number (FP32). Such a weight parameter WP in FP32 format is quantized in advance into INT8 format by using a quantization coefficient Qw, which is 2n (n is an integer equal to or greater than zero), and is then stored in the memory MEM1.
The MAC circuit 21 includes a multiplier that multiplies two pieces of input data in INT8 format, and an accumulative adder that accumulatively adds multiplication results of the multiplier. The operation data DTc generated by the MAC circuit 21 is, for example, an integer of 16 bits or more, here, in a 32-bit integer (INT32) format.
Incidentally, more specifically, the MAC circuit 21 includes an adder that adds a bias parameter BP to accumulative addition results of the accumulative adder, and an arithmetic unit that computes an activation function for the addition result. Then, the MAC circuit 21 outputs, as operation data DTc, a result obtained by performing addition of the bias parameter BP and calculation of an activation function. In the following, the addition of the bias parameter BP and the calculation of the activation function are ignored for the sake of simplification of a description and will be explained.
The subsequent-stage shift register SREG2 is, for example, a 32-bit width register. The subsequent-stage shift register SREG2 generates the output data DTo by inversely quantizing the operation data DTc from the MAC circuit 21 by bit-shifting. Then, the subsequent-stage shift register SREG2 stores the output data DTo in the buffer BUFi via the two demultiplexers DMUX1, DMUX2.
In particularly, the subsequent-stage shift register SREG2 generates the output data DTo in an INT32 format by multiplying the operation data DTc by the inverse quantization coefficient QR. The inverse quantization coefficient QR is, for example, 1 / (Qi × Qw), that is, 2-(m + n) by using the quantization coefficients Qi (= 2m) and Qw (= 2n) described above. In this case, the subsequent-stage shift register SREG2 inversely quantizes the operation data DTc by right-shifting the operation data DTc by k (= m + n) bits.
Incidentally, the shift amount “k” does not necessarily have to be “m + n”. In this case, the output data DTo can be a value that differs from the original value by 2i times (i is a positive or negative integer). However, in this case, at some stage before the final result in the neural network is obtained, the 2i-fold deviation can be corrected by the right-shifting or left-shifting in the subsequent-stage shift register SREG2.
Also, the demultiplexers DMUX1, DMUX2 can be configured by a plurality of switches each connecting one input to a plurality of outputs. Similarly, the multiplexers MUX1, MUX2 can be configured by a plurality of switches each connecting a plurality of inputs to one output. On / off of each of the plurality of switches forming the demultiplexers DMUX1, DMUX2 is controlled by selection signals SDX1, SDX2. On / off of each of the plurality of switches forming the multiplexers MUX1, MUX2 is controlled by selection signals SMX1, SMX2.
The selection signals SDX1, SDX2, SMX1, and SMX2 are generated by firmware or the like that controls the neural network engine 15, for example. The firmware appropriately generates the selection signals SDX1, SDX2, SMX1, and SMX2 through a not-shown control circuit of the neural network engine 15 based on a structure of the neural network preset or programmed by the user.
The shift amount “m” of the preceding-stage shift register SREG1 is controlled by a shift signal SF1, and the shift amount “k” of the subsequent-stage shift register SREG2 is controlled by a shift signal SF2. The shift signals SF1, SF2 are also generated by the firmware and the control circuit. At this time, the user can arbitrarily set the shift amounts “m” and “k”.
The convolution layer 25 [2] generates data of a feature map FM[2] by performing a convolution operation with the data of the feature map FM[1] obtained by the convolution layer 25 [1] as an input. Similarly, the convolution layer 25[3] generates data of a feature map FM[3] by performing a convolution operation with the data of the feature map FM[2] obtained by the convolution layer 25 [2] as an input. The pooling layer 26 performs a pooling processing with the data of the feature map FM[3] obtained by the convolution layer 25 [3] as an input.
By targeting such a neural network, the neural network engine 15 in
In the convolution layer 25 [1], the MAC circuit 21 inputs the plurality of INT8-format weight parameters WP[1] sequentially read out from the memory MEM1. Also, the MAC circuit 21 inputs the plurality of pieces of INT8-format data DT sequentially read out from the memory MEM2 via the multiplexer MUX1 and the preceding-stage shift register SREG1. At this time, the preceding-stage shift register SREG1 performs quantization using the quantization coefficient Qi [1] (= 2m1) (m1 is an integer equal to or greater than 0) for each of the plurality of pieces of data DT, that is, performs the left-shifting, thereby generating a plurality of pieces of quantized input data DTi[1]. Incidentally, the plurality of pieces of data DT from the memory MEM2 are data constituting the input map IM.
The MAC circuit 21 sequentially performs a product-sum operation or the like to the plurality of weight parameters WP [1] from the memory MEM1 and the plurality of pieces of quantized input data DTi[1] from the preceding-stage shift register SREG1, thereby outputting the INT32-format operation data DTc[1]. The subsequent-stage shift register SREG2 generates the output data DTo [1] by multiplying the operation data DTc[1] by the inverse quantization coefficient QR [1] . The inverse quantization coefficient QR[1] is, for example, 1 / (Qw • Qi [1]). In this case, the subsequent-stage shift register SREG2 performs the right-shifting.
The output data DTo [1] obtained in this manner is one piece of data out of the plurality of pieces of data constituting the feature map FM[1] . The subsequent-stage shift register SREG2 stores the output data DTo [1] at a predetermined location in the buffer BUFi via the demultiplexers DMUX1, DMUX2. Thereafter, the MAC circuit 21 generates another piece of data out of the plurality of pieces of data constituting the feature map FM[1] by performing the same processing to another plurality of pieces of data DT. This another piece of data is also stored at a predetermined location in the buffer BUFi. In addition, all the pieces of data constituting the feature map FM[1] are stored in the buffer BUFi by the plurality of MAC circuits 21 performing the same processing in parallel.
In the convolution layer 25 [2], the MAC circuit 21 inputs a plurality of INT8-format weight parameters WP[2] read out from the memory MEM1. Also, the MAC circuit 21 inputs a plurality of pieces of intermediate data DTm via the multiplexer MUX1 and the preceding-stage shift register SREG1, the plurality of pieces of intermediate data DTm being sequentially read out from the buffer BUFi via the multiplexer MUX2. At this time, the preceding-stage shift register SREG1 performs, for each of the plurality of pieces of intermediate data DTm, the quantization using a quantization coefficient Qi [2] (= 2m2) (m2 is an integer equal to or greater than 0), that is, performs the left-shifting, thereby generating a plurality of pieces of quantized input data DTi [2] . The plurality of pieces of intermediate data DTm from the buffer BUFi are data constituting the feature map FM[1].
In this manner, in the configuration example of
The MAC circuit 21 generates the INT32-format operation data DTc[2] by sequentially performing the product-sum operation to the plurality of weight parameters WP [2] from the memory MEM1 and the plurality of pieces of quantized input data DTi[2] from the preceding-stage shift register SREG1. The subsequent-stage shift register SREG2 generates the output data DTo [2] by multiplying the operation data DTc [2] by the inverse quantization coefficient QR [2] . The inverse quantization coefficient QR[2] is, for example, 1 / (Qw • Qi [2]). In this case, the subsequent-stage shift register SREG2 performs the right-shifting.
The output data DTo[2] obtained in this manner is one piece of data out of the plurality of pieces of data constituting the feature map FM[2] . The subsequent-stage shift register SREG2 stores the output data DTo[2] in the buffer BUFi via the demultiplexers DMUX1, DMUX2. Then, similarly to a case of the convolutional layer 25 [1] , all the pieces of data constituting the feature map FM[2] are stored in the buffer BUFi.
Also in the convolutional layer 25 [3] , the same processing as that to the convolutional layer 25 [2] is performed. At this time, a quantization coefficient Qi[3] (= 2m3) is used in the preceding-stage shift register SREG1, and an inverse quantization coefficient QR[3], for example, 1 / (Qw • Qi [3] ) is used in the subsequent-stage shift register SREG2. However, in the convolutional layer 25 [3], unlike respective cases of the convolutional layers 25 [1] and 25 [2], the output data DTo[3] forming the feature map FM[3] is stored in the memory MEM2 via the demultiplexer DMUX1 and the DMA controller DMAC2. Thereafter, for example, the processor 17 shown in
In such a behavior, a value of the output data DTo usually decreases as it passes through the convolutional layers 25 [1], 25 [2], 25 [3]. In this case, the quantization coefficient Qi of the preceding-stage shift register SREG1 can be increased by an amount corresponding to a decrease in the value of the output data DTo. Here, in order to reduce the quantization error, it is desirable to set the quantization coefficient Qi at a value as large as possible so that the quantized input data DTi falls within an integer range of the INT8 format. Therefore, for example, the quantization error can be reduced by setting the quantization coefficient Qi [2] (= 2m2) and the quantization coefficient Qi [3] (= 2m3) so as to meet m2 < m3.
However, a method of reducing the quantization error is not necessarily limited to a method of determining m2 < m3, and another method may be used. Whichever method is used, the reducing method can be handled by appropriately determining the shift amount “m” of the preceding-stage shift register SREG1 and the shift amount “k” of the subsequent-stage shift register SREG2 according to the setting or programming by the user. Further, the inverse quantization coefficient QR is not limited to 1 / (Qw • Qi), and can also be changed as appropriate. In this case, as described above, 2i-fold deviations may occur, but the 2i-fold deviations may be corrected by the subsequent-stage shift register SREG2 so as to target the final result, that is, the out data DTo[3] forming the feature map FP[3].
As described above, in the semiconductor device according to the first embodiment, providing the preceding-stage shift register SREG1 and the subsequent-stage shift register SREG2 makes it possible to typically provide a mechanism for efficiently reducing the quantization error in the neural network. As a result, it becomes possible to sufficiently maintain the accuracy of the inference using the neural network. Further, providing the buffers BUFi makes it possible to reduce the memory bandwidth. Then, reduction in the processing load due to the quantization, cutdown of the required memory bandwidth, and the like make it possible to shorten the time required for the inference.
Incidentally, it is assumed as a comparative example that the preceding-stage shift register SREG1, the subsequent-stage shift register SREG2, and the buffer BUFi are not provided. In this case, for example, the data of the feature maps FM [1], FM [2] obtained from the convolutional layers 25 [1], 25 [2] needs to be stored in the memory MEM2. Further, a quantization / inverse quantization processing or the like using the processor 17 is required separately. As a result, the memory bandwidth is increased, and the time required for the inference can also be increased due to necessity of a processing by the processor 17.
Each of the buffer controllers 30a, 30b variably controls a bit width of the output data DTo outputted from the subsequent-stage shift register SREG2 via the demultiplexer DMUX1. Specifically, as shown in
When the bit width of the output data DTo is controlled to 32 bits, each of the buffer controllers 30a, 30b controls write / write to the buffer BUFi by using the buffer BUFi, which is physically formed in a 32-bit width, as a 32-bit width buffer. Meanwhile, when the bit width of the output data DTo is controlled to 16 bits, each of the buffer controllers 30a, 30b regards the buffer BUFi configured with a 32-bit width as a 16-bit width × 2 buffers, and controls the write / read. Similarly, when the bit width of the output data DTo is controlled to 8 bits or 4 bits, each of the buffer controllers 30a, 30b regards the buffers BUFi as 8-bit width × 4 buffers or 4-bit width × 8 buffers.
For example, when the bit width of the output data DTo is controlled to 8 bits, each of the buffer controllers 30a, 30b can store, in the buffer BUFi configured with a 32-bit width, four pieces of output data DTol to DTo4 inputted from the MAC circuit 21 via the subsequent-stage shift register SREG2 and the like. This makes it possible to efficiently use the buffer BUFi and reduce power consumption associated with the write / read to the buffer BUFi.
Particularly, in a case of the neural network as shown in
As described above, using the semiconductor device according to the second embodiment makes it possible to obtain various effects similar to those described in the first embodiment. In addition to this, providing the buffer controllers 30a, 30b makes it possible to efficiently use the buffers BUFi.
The second difference is that the buffer BUFi is configured so as to have a bit width smaller than the bit width of the subsequent-stage shift register SREG2 and, for example, is configured so as to have a 16-bit width. The third difference is that the MAC unit 20b includes a demultiplexer DMUXlb and a multiplexer MUXlb different from those in
The multiplexer MUXlb selects, based on the selection signal SMX1b, any one of the data DT held in the memory MEM2, the output data DTo held in the buffer BUFi, or the output data DTo held in the buffer BUFc, and outputs it to the preceding-stage shift register SREG1. The output data DTo held in the buffer BUFi becomes intermediate data DTm1 similarly to the case of
In the above configuration, the buffer BUFc is larger in capacity at the same area than the buffer BUFi. Meanwhile, the buffer BUFi is faster in an access speed than the buffer BUFc. Here, when the bit width of the output data DTo is large, the required buffer capacity becomes also large. However, if all the buffers are configured by flip-flops, a speed can be increased, but there is concern about an increase in area. Therefore, the two buffers BUFi, BUFc are provided here, and the two buffers BUFi, BUFc are switched according to the bit width of the output data DTo, in other words, the effective bit width.
If the bit width of the output data DTo is greater than 16 bits, the buffer BUFc is selected as a storage destination of the output data DTo. Meanwhile, when the bit width of the output data DTo is 16 bits or less, the buffer BUFi is selected as a storage destination of the output data DTo. As described in the second embodiment, the bit width of the output data DTo may become smaller each time it passes through the convolutional layer. In this case, the buffer BUFc can be used on an initial-stage side of the convolutional layer, and the buffer BUFi can be used on a final-stage side of the convolutional layer.
As described above, using the semiconductor device according to the third embodiment makes it possible to obtain various effects similar to those described in the first embodiment. In addition to this, providing the two buffers BUFi, BUFc makes it possible to improve a balance between the area and the speed.
The second difference is that the MAC unit 20c further includes a multiplexer MUX3 with the addition of the buffer BUFi2. The multiplexer MUX3 selects, based on a selection signal SMX3, either the weight parameter WP held in the memory MEM1 or the weight parameter WPx held in the buffer BUFi2, and outputs it to the MAC circuit 21.
The plurality of weight parameters WP are repeatedly used in a processing of the neural network engine 15c for one convolutional layer. For example, in obtaining one piece of data out of the feature map FM [1] shown in
As described above, using the semiconductor device according to the fourth embodiment makes it possible to obtain various effects similar to those described in the first embodiment. In addition to this, providing the buffer BUFi2 makes it possible to decrease the access frequency to the memory MEM1 and cutdown the required memory bandwidth.
In the foregoing, the invention made by the inventor of the present invention has been concretely described based on the embodiments. However, it is needless to say that the present invention is not limited to the foregoing embodiments and various modifications and alterations can be made within a range not departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-189169 | Nov 2021 | JP | national |