The disclosure of Japanese Patent Application No. 2023-006452 filed on Jan. 19, 2023 including the specification, drawings and abstract is incorporated herein by reference in its entirety.
The present invention relates to a semiconductor device, for example, a semiconductor device having a product-sum operation circuit.
There is disclosed a technique listed below.
In recent years, the use of artificial intelligence utilizing machine learning has become popular, and techniques to realize the processing required for the machine learning with hardware have been actively developed. Patent Document 1 discloses an example of the technique to realize the machine learning with hardware.
Patent Document 1 discloses a semiconductor device comprising: a memory outputting a plurality of pieces of first data in parallel; a plurality of product-sum operation circuits corresponding to the plurality of pieces of first data; and a plurality of selectors: corresponding to the plurality of product-sum operation circuits; supplied with a plurality of pieces of second data in parallel; selecting one piece of second data from the supplied plurality of pieces of second data according to additional information indicating a position of one piece of second data to be calculated with one piece of first data by the corresponding product-sum operation circuits among the plurality of pieces of second data; and outputting the selected second data, wherein each of the plurality of product-sum operation circuits performs a product-sum operation between the first data different from each other in the plurality of first data and the second data outputted from the corresponding selectors.
However, the technique disclosed in Patent Document 1 has a problem in that the power consumption required for the product-sum operation increases.
Other problems and novel features will be apparent from the description of this specification and the accompanying drawings.
A semiconductor device according to one embodiment includes: an initial value setting unit configured to provide an initial value of a register that holds a cumulative value to be a result of a product-sum operation in a product-sum operation circuit; and an initial value canceling circuit configured to cancel the initial value contained in the cumulative value held by the register and output a final output value, and the initial value setting unit sets a positive or negative value other than zero as the initial value.
In the semiconductor device according to one embodiment, it is possible to reduce the power consumption required for the product-sum operation.
For clarifying the description, the following descriptions and drawings are omitted and simplified as appropriate. Further, in each drawing, the same elements are denoted by the same reference characters and redundant descriptions are omitted as necessary.
The array processor 2 includes a management unit 9, a transfer path unit 11, and an array unit 12. Also, the array unit 12 includes parallel operation circuits 3, memories 4, direct memory access controllers (DMAC) 5 and 7, processor elements 6, and programmable switches PSW1 to PSW3. In the array unit 12, the memories 4, the DMACs 5 and 7, and the processor elements 6 are each formed as a circuit unit to implement an individual function, and a plurality of these circuit units are arranged.
The array processor 2 performs inference processing by a circuit configuration having a neural network structure. A plurality of descriptor strings are stored in the memory 10. The descriptor string includes information specifying the functions of the processor elements 6 in the array unit 12, information determining the states of the programmable switches PSW1 to PSW3 in the array unit 12, and the like. When the descriptor string is supplied to the management unit 9, the management unit 9 decodes the descriptor string, generates a corresponding control signal, and supplies it to the array unit 12 and the like. In this way, processing according to the descriptor string stored in the memory 10 by the user is realized in the array unit 12. In other words, the internal connections and states of the array unit 12 are determined by the management unit 9. Note that the descriptor may include information regarding the parallel operation circuit 3 in the array unit 12.
The transfer path unit 11 is connected between a bus B_W that transfers weight data and the like and the array unit 12, and transfers the weight data and the like to the array unit 12. The array unit 12 processes input data from a bus B_DI and outputs processed results to a bus B_DO as output data. When performing this processing, the array unit 12 uses the weight data and the like. Here, the weight data used in the semiconductor device 1 according to the first embodiment is a value to be a coupling coefficient between neurons, and is hereinafter referred to as a weight parameter in some cases.
Further, the transfer path unit 11 includes a direct memory access controller (DMAC) 8. The DMAC 8 transfers the weight data on the bus B_W to the memory 4. At least a part of the memories 4 are used as a local memory of the parallel operation circuit 3. Note that the memories 4 may include those used when performing the processing in the array unit 12.
Here, the functions of the circuit blocks in the array unit 12 will be described. The DMAC 5 stores input data from the bus B_DI in the memory 4 according to the programmable switch PSW1. The programmable switch PSW1 switches, according to instructions from the management unit 9, to which parallel operation circuit 3 the input data input from the DMAC 5 is provided.
The parallel operation circuit 3 performs a product-sum operation in parallel between the weight data stored in the memory 4 and the input data transferred by the DMAC 5. The parallel operation circuit 3 will be described later in detail with reference to the drawings. Note that, by arranging the parallel operation circuits 3 at the center of the array unit 12, it is possible to shorten the distance between the parallel operation circuits 3. Therefore, the arrangement shown in
The memory 4 functions as a local memory of the parallel operation circuit 3 as described above. Further, the memory 4 may store the results of activation processing performed by the processor element 6 to be described later in addition to the above-mentioned weight data, and may hold the stored activation processing results so as to use it as input data to be provided to the parallel operation circuit 3 in the subsequent product-sum operations. In addition, in the product-sum operation processing in which the results of the activation processing performed in the processor element 6 are processed as input data, it is also possible to provide input data from an external memory (not shown in
The processor element 6 is a circuit block capable of implementing various functions. The functions implemented by the processor element 6 are determined according to control signals from the management unit 9. Further, the programmable switches PSW2 and PSW3 also electrically connect designated circuit blocks according to the control signals from the management unit 9. For example, according to the control signal from the management unit 9, a predetermined processor element is set to have a function of performing a predetermined operation. Furthermore, the predetermined programmable switches PSW2 and PSW3 are set to connect, for example, the predetermined parallel operation circuit 3 and the predetermined processor element 6 according to the control signal from the management unit 9. Although an example in which the array unit 12 includes the plurality of processor elements 6 has been described here, the array unit 12 is not limited to this.
For example, a specific processor element of the processor elements 6 may be changed to a dedicated circuit block whose function is determined in advance. Namely, it is possible to set a batch normalization function in one processor element 6, and set a dedicated circuit block that performs activation processing in another processor element 6. By replacing general-purpose processor elements with dedicated circuit blocks in this way, it is possible to improve processing efficiency and speed. It is assumed that the activation processing described below is performed by any of the processor elements 6.
The DMAC 7 writes the operation result generated by applying the function set in the processor element to the operation result calculated by the parallel operation circuit 3 to, for example, an external memory (not shown in
Next, the neural network structure realized in the semiconductor device 1 according to the first embodiment will be described.
As shown in
The array processor 2 shown in
An outline of operation processing in the neural network will be described with reference to
Then, the array unit 12 first reads input data (feature) necessary for the operation from the external memory 13 (step S1). Subsequently, the configuration and state of the array unit 12 are set by the management unit 9 (step S2). Thereafter, the array unit 12 sequentially provides the input data read from the external memory 13 to the parallel operation circuit 3 (step S3). The parallel operation circuit 3 performs product-sum operation processing by multiplying the sequentially supplied input data by the weight data stored in the local memory 14 in the order of reception (step S4). The operation results of the product-sum operation by the parallel operation circuit 3 are sequentially output to the processor element 6 that realizes processing such as activation performed in the layer in which neurons are arranged shown in
In the array unit 12, operations such as addition and activation are performed using the processor element 6 as necessary to the data obtained by the parallel operation circuit 3 (step S6). Thereafter, the array unit 2 writes the output data obtained as a result of the operation into the external memory 13 as another feature (step S7). The processing in neural network is realized by such processing, and the semiconductor device 1 performs the operation processing necessary for inference processing by repeating such processing.
In the array processor 2 according to the first embodiment, the parallel operation circuit 3 performs regular product-sum operation processing among the necessary operation processings, so that the high speed processing can be realized. Further, operation processings other than the product-sum operation processing are performed by the processor element 6 whose circuit can be dynamically reconfigured by the management unit 9 or the like. In this way, it is possible to flexibly set the processing such as activation in each layer described as the first layer (first layer processing), the second layer (second layer processing), and the output layer (third layer processing) in
Here, the semiconductor device 1 according to the first embodiment has one feature in the configuration of the parallel operation circuit 3, and this feature realizes the reduction in power consumption. Therefore, the parallel operation circuit 3 will be described in detail below.
As shown in
When the input data DI is high bit-width parallel data, the latch 17 may divide the supplied high bit parallel data into data DL0 to DLm and output them in parallel.
When the control signal SCC changes next, the latch 17 outputs and holds the m+2-th supplied input data DI as data DL0, outputs and holds the m+3-th supplied input data DI as data DL1, outputs and holds the m+4-th supplied data DI as data DL2, and outputs and holds the 2(m+1)-th supplied data DI as data DLm3. Thereafter, each time the control signal SCC changes, the latch 17 parallelly outputs and holds sequentially supplied m+1 input data DI in the same manner. In other words, the latch 17 is a conversion circuit configured to convert a serial data string into parallel data.
The selector group 15 includes a plurality of selectors (selectors 151 to 15m in
The local memory 14 is, for example, the memory 4. The local memory 14 includes a plurality of weight holding regions (for example, weight holding regions 141 to 14m), each of which can output independent weight data. The weight holding regions 141 to 14m hold weight data to be an object of the operation in the corresponding product-sum operation circuits. It is assumed that the weight data includes the above-mentioned additional information. Further, the weight holding regions 141 to 14m output weight data WD1 to WDm to the product-sum operation circuits 161 to 16m, respectively.
The product-sum operation circuit group 16 includes a plurality of product-sum operation circuits (for example, product-sum operation circuits 161 to 16m). The number m of product-sum operation circuits is determined by the number of parallel product-sum operations performed in the product-sum operation circuit group 16. The product-sum operation circuits 161 to 16m each perform a product operation of the input data provided from the corresponding selector and the weight data output from the weight holding region and a product-sum operation of adding the products between the input data and the weight data calculated for each processing cycle. In the semiconductor device 1 according to the first embodiment, the reduction in power consumption is achieved by the configuration of the product-sum operation circuit. Therefore, the configuration of the product-sum operation circuit will be described below in more detail.
In the semiconductor device 1 according to the first embodiment, a plurality of product-sum operation circuits 16 that can operate in parallel are provided in the parallel operation circuit 3, so that a large number of product-sum operation processings related to one processing layer can be performed in parallel. Further, in the semiconductor device 1 according to the first embodiment, a plurality of parallel operation circuits 3 are provided, so that the product-sum operation processings related to a plurality of processing layers can be performed in parallel. Accordingly, the semiconductor device 1 according to the first embodiment can perform inference processing at high speed.
As shown in
The multiplier 20 calculates the product of first data (for example, weight data WD1) and second data (for example, input data DLk) whose values change for each processing cycle, and outputs a product operation value mul. Here, the input data DLk is input data selected by the selector 151, where k is an integer between 0 and m. Also, the weight data is a value provided from the weight holding region 141.
The register 22 holds a value obtained by updating the cumulative value of the output values of the multiplier 20 (for example, the product operation value mul) for each processing cycle, and outputs it as a register value regout. The adder 21 adds the output value of the multiplier 20 (for example, the product operation value mul) and the register value regout to update the cumulative value. Here, the register 22 takes in the output value of the adder 21 in synchronization with a clock signal (not shown) and updates the register value regout by taking in the added value add output by the adder 21 as a cumulative value. The initial value canceling circuit 24 cancels the initial value contained in the register value regout and outputs the final output value SR1.
Here, in the first embodiment, the register 22 includes the initial value setting unit 23. The initial value setting unit 23 holds a preset initial value which is a fixed value, and the initial value setting unit 23 sets the register value regout to the initial value when resetting the register value regout of the register 22. Namely, the initial value setting unit 23 may be configured to set the reset value of the register to the initial value, and may be built in the register in terms of hardware. This initial value is a positive or negative value other than zero. More preferably, the initial value is set to such a size that the sign of the cumulative value is not inverted in the processing cycles until one final output value SR1 is determined.
In the semiconductor device 1 according to the first embodiment, input data and weight data may take both positive and negative values, and a phenomenon that the register value regout is switched between a positive value and a negative value occurs when an operation is performed with the initial value set to zero. In the register 22 that holds signed binary data, inversions of held hit values often occur due to switching of the sign of the held values. Therefore, in the product-sum operation circuit according to the first embodiment, the frequency of sign inversion of the register value regout is reduced by providing a positive or negative value as the initial value of the register value regout, whereby the toggle rate of circuits such as the flip-flops in the register 22 is reduced and the power consumption of the circuits is reduced.
Then, the operation of the product-sum operation circuit 161 when −256 is provided as the initial value of the register value regout will be described.
In
In the example shown in
Subsequently, when the processing cycle becomes cycle 1 from cycle 0, the added value add of cycle 0 is taken into the register 22, so that the register value regout is updated by the added value add of cycle 0. At this time, eight bit values are inverted in the register value regout. Furthermore, in cycle 1, the product operation value mul is updated by the product of the input data and the weight data. Further, in cycle 1, in response to the determination of the product operation value mul, the sum of the register value regout updated in cycle 1 and the product operation value mul becomes the added value add.
As described above, in the product-sum operation circuit 161, by taking the added value add calculated in the previous processing cycle into the register 22 as a cumulative value, the register value regout, which is a cumulative value of the products of input data and weight data, is updated for each processing cycle, whereby the final output value SR1, which is a product-sum operation value of a plurality of combinations of input data and weight data, is determined.
Also, referring to
Next, the change of the register value regout shown in
As shown in
Here,
As shown in
From the foregoing description, in the semiconductor device according to the first embodiment, by setting the initial value held by the register 22 of the product-sum operation circuit 16 to a positive or negative value other than zero, the number of occurrences of sign inversion of the register value regout can be reduced, and the power consumption of the semiconductor device 1 can be reduced.
Here, as a modification of the product-sum operation circuit shown in
A product-sum operation circuit 161a shown in
In the second embodiment, an initial value determination method in which the initial value is determined based on the number of times of operation and the size of weight data will be described. Note that, in the description of the second embodiment, the same components as those in the first embodiment are denoted by the same reference characters as those in the first embodiment, and the description thereof will be omitted.
First, the operation of a machine learning system from the determination of weight data to the inference processing applied to the semiconductor device 1 will be described.
Here, the machine learning does not need to use the semiconductor device 1, and can be performed on a computer or cloud system separate from the semiconductor device 1. Also, the weight data downloaded through the process shown in
Therefore, a product-sum operation circuit 161b according to the second embodiment to which the initial value whose value can be changed by processing in this way is applied will be described.
In addition, the product-sum operation circuit 161b includes an initial value storage unit 31 that rewritably holds the initial value. The initial value written to the initial value storage unit 31 may be an arbitrary value stored in a built-in storage device formed on the same semiconductor chip as the register 32 or an arbitrary value provided from outside of the semiconductor chip on which the register 32 is formed.
Here, the initial value determination method applied in the second embodiment will be described in detail. First, a first example of the initial value determination method will be described. In the first example, assuming that all weight data are the maximum values set in advance, the initial value size is determined based on the number of processing cycles required to determine the one final output value (hereinafter referred to as the number of times of product-sum operation).
As shown in
Next,
As shown in
Next, the second example of the initial value determination method will be described. In the second example, the maximum value of the register value regout is calculated for each number of times of operation in consideration of the size of the actual weight data, and the initial value size is determined based on the maximum value of the register value regout and the number of times of product-sum operation.
As shown in
Here, in the second example, the shift amount of the initial value is smaller than that in the first example. Since the number of bits of the register to be used can be reduced by reducing the shift amount of the initial value from zero, the amount of reduction in power consumption can be increased. In other words, by adopting the second example, it is possible to obtain the higher power consumption reduction effect than that in the first example.
Next,
As shown in
From the above description, by adopting the initial value determination method according to the second embodiment, it is possible not only to suppress the inversion of the register value regout but also to suppress the number of register circuits to be used, and it is thus possible to obtain a higher power consumption reduction effect than that of the product-sum operation circuit 161 according to the first embodiment.
In the third embodiment, a product-sum operation circuit 161c, which is another form of the product-sum operation circuit 161b according to the second embodiment, will be described. In the description of the third embodiment, the same components as those in the first and second embodiments are denoted by the same reference characters as those in the first and second embodiments, and the description thereof will be omitted.
Since such a lookup table can be prepared in advance if the weight data and the structure of the neural network are known, the operations required to calculate the initial value can be omitted by using the initial value lookup table 41.
In the fourth embodiment, a product-sum operation circuit 161d, which is another form of the product-sum operation circuit 161c according to the third embodiment, will be described. In the description of the fourth embodiment, the same components as those in the first to third embodiments are denoted by the same reference characters as those in the first to third embodiments, and the description thereof will be omitted.
As described in the second and third embodiments, the number of bits in which the toggle does not occur increases or decreases depending on the initial value size. Therefore, by stopping the clock signals CLK supplied to the D flip-flops corresponding to the bits in which the toggle does not occur by using the clock gating enable generation circuit 51 and the clock gating circuit 52, the semiconductor device according to the fourth embodiment can further reduce the power consumption as compared with the first to third embodiments.
In the fifth embodiment, a product-sum operation circuit 161e, which is another form of the product-sum operation circuit 161c according to the second embodiment, will be described. In the description of the fifth embodiment, the same components as those in the first and second embodiments are denoted by the same reference characters as those in the first and second embodiments, and the description thereof will be omitted.
The subtraction circuit 61 subtracts the initial value from a preset bias value. The bias addition circuit 64 is configured by adding the function of adding the bias value to the initial value canceling circuit 34. Namely, the bias addition circuit 64 cancels the initial value from the cumulative value by the output value of the subtraction circuit, and outputs a value obtained by adding the bias value to the cumulative value after canceling the initial value as the final output value.
The bias addition is a function that is implemented in accordance with the product specifications. By calculating the value obtained by subtracting the initial value from the bias value by providing the subtraction circuit 61 and providing the calculated value to the bias addition circuit 64 as a new bias value, the addition of the bias value to the register value regout and the cancellation of the initial value can be performed. The increase in circuit scale due to the addition of the subtraction circuit 61 and the specification change of the bias addition circuit 64 is very small and can be ignored.
In the foregoing, the invention made by the inventors of this application has been specifically described based on the embodiments, but it goes without saying that the present invention is not limited to the embodiments described above and various modifications can be made within the range not departing from the gist thereof.
Number | Date | Country | Kind |
---|---|---|---|
2023-006452 | Jan 2023 | JP | national |