The present invention relates to a semiconductor device, for example, to a semiconductor device for performing the processing of a neural network.
There are disclosed techniques listed below. [Patent Document 1] Japanese Unexamined Patent Application Publication No. 2019-40403
Patent Document 1 discloses an image recognizing device having a convolution calculation processing circuit for performing an operation using an integration coefficient table in order to reduce the calculation amount of the convolution operation in the CNN (Convolutional Neural Network). The integration coefficient table holds data of N×N, and each of the data of N×N is composed of coefficients and channel numbers. Convolution arithmetic processing circuit includes a product calculation circuit for executing the product operation of N×N of the input image and the coefficient in parallel, performs cumulative addition operation for each channel number with respect to the product operation result, and a channel selection circuit for storing the addition operation result in the output register for each channel number.
For example, in the processing of a neural network such as CNN, when transferring the image data and the weighting factor data stored in the memory to a plurality of accumulators, it is desirable to use a DMA (Direct Memory Access) controller for high speed. On the other hand, in particular, the data amount of the weighting factor data may be very large. Therefore, a method is conceivable in which the weighting factor data that is previously compressed on the memory is stored, and it is restored to the uncompressed weighting factor data by a decompressor and then transferred to a plurality of accumulators.
In this case, as a method of placing the decompressor, a method of placing between the memory and the DMA controller, or a method of placing between the DMA controller and the plurality of accumulators are considered. In the former method, there was a fear that the plurality of accumulators could not be effectively utilized sufficiently. In the latter method, since it is necessary to provide the compressor for each of the plurality of accumulators, there is a possibility that an increase in circuit area and power consumption occurs.
Other objects and novel features will become apparent from the description: of this specification and the accompanying drawings.
Therefore, semiconductor device of an embodiment is for performing the processing of the neural network, and has one or more memories, a decompressing unit, a first DMA controller, an accumulator unit, and a first switch circuit. The one or more memories hold a plurality of pixel values and j compressed weighting factors. The decompressor restores the j compressed weighting factors to the k (k≥j) uncompressed weighting factors. The first DMA controller reads the j compressed weighting factors from the memories and transfers them to the decompressor. The accumulator unit has n (n>k) accumulators multiply a plurality of pixel values and k uncompressed weighting factors, and add cumulatively the multiplied results to the time series. A first switch circuit provided between the decompressor and the accumulator unit transfers the k uncompressed weighting factors restored by the decompressor to n accumulators based on the correspondence represented by the first identifier.
By using semiconductor device of an embodiment, reduce of the circuit area can be realized.
In the following embodiments, when required for convenience, the description will be made by dividing into a plurality of sections or embodiments, but except when specifically stated, they are not independent of each other, and one is related to the modified example, detail, supplementary description, or the like of part or all of the other. In the following embodiments, the number of elements, etc. (including the number of elements, numerical values, quantities, ranges, etc.) is not limited to the specific number, but may be not less than or equal to the specific number, except for cases where the number is specifically indicated and is clearly limited to the specific number in principle. Furthermore, in the following embodiments, it is needless to say that the constituent elements (including element steps and the like) are not necessarily essential except in the case where they are specifically specified and the case where they are considered to be obviously essential in principle. Similarly, in the following embodiments, when referring to the shapes, positional relationships, and the like of components and the like, it is assumed that the shapes and the like are substantially approximate to or similar to the shapes and the like, except for the case in which they are specifically specified and the case in which they are considered to be obvious in principle, and the like. The same applies to the above numerical values and ranges.
In all the drawings for explaining the embodiments, members having the same functions are denoted by the same reference numerals, and repetitive descriptions thereof are omitted. In the following embodiments, descriptions of the same or similar parts will not be repeated in principle except when particularly necessary.
The semiconductor device DEV shown in
Processing Unit), one or more memories MEM1 and MEM2, and a system bus SBUS. The neural network engine NNE executes the processing of the neural network represented by CNN. The memory MEM1 is DRAM(Dynamic Random Access Memory), or the like, and the memory MEM2 is an SRAM(Static Random Access Memory) for caching or the like. The system bus SBUS connects the neural network engine NNE, the memory MEM1,MEM2, and the processor PRC to each other.
The memory MEM1 holds the image data IMD including a plurality of pixel values and the compressed weight factor data WFDC. Here, the amount of data in the weighting factor data may be very large. Therefore, the uncompressed weighting factor data WFD is stored in the memory MEM1 after being converted into a weighting factor data WFDC which is compressed in advance using a compression software or the like. The memory MEM2 is used as a high-speed cache memory for the neural network engine NNE. For example, the image data IMD in the memory MEM1 is previously copied to the memory MEM2.
The neural network engine NNE comprises a plurality of DMA controller DMAC1-DMAC3, a register REG, a decompressor DCMP, a plurality of switch circuit SW1 and SW2, a switch control circuit SWCT, and an accumulator unit ACCU. The DMA controller DMAC1 reads the compressed weighting factor data WFDC from the memory MEM1 and transfers it to the decompressor DCMP. The decompressor DCMP restores the compressed weighting factor data WFDC to uncompressed weighting factor data WFDS.
A switch circuit SW1 is provided between the decompressor DCMP and the accumulator unit ACCU. Although described later in detail, the switch circuit SW1, based on a predetermined correspondence, a plurality of weighting factors included in the uncompressed weighting factor data WFD restored by the decompressor DCMP, in the accumulator unit ACCU transferring to a plurality of accumulators. The DMA controller DMAC3 reads the image data IMD from the memory MEM2 and transfers it to the accumulator unit ACCU.
The accumulator unit ACCU includes a plurality of accumulators for executing a product-sum operation, and a product-sum operation and an uncompressed weighting factor data WFD from the image data IMD and the switch circuit SW1 from the DMA controller DMAC3. A switch circuit SW2 is provided between the accumulator unit ACCU and the DMA controller DMAC2. The switch circuit SW2 will be described later in detail, based on a predetermined correspondence, and transfers the output from the plurality of accumulators in the accumulator unit ACCU to a plurality of channels in the DMA controller DMAC2.
The switch control circuit SWCT, based on the setting data stored in the register REG, and controls the switch circuit SW1,SW2. Specifically, the switch control circuit SWCT controls the correspondence between the respective switch circuit SW1,SW2 described above. The register REG also stores setting data of the address range for the DMA controller DMAC1 to DMAC3, setting data for the accumulator unit ACCU, and the like.
The weighting factor dataset WFDS shown in
The DMA controller DMAC1 transfers the compressed weighting factor data WFDC contained in the weighting factor dataset WFDS, i.e., map data MPD and j compressed weighting factors P(1)-P(j), to the decompressor DCMP, as shown in
The decompressor DCMP restores the compressed weighting factor data WFDC to uncompressed weighting factor data WFDS as shown in
In
A switch circuit SW1 is provided between the decompressor DCMP and the n (n>k) accumulators ACC(1)-ACC(n) included in the accumulator unit, as shown in
The switch circuit SW1, for example, as shown in
The switch control circuit SWCT, in advance, for each value of the identifier ID1, a combination of on/off for the switch S(1, 1)-S(k, n) is set. The switch control circuit SWCT receives the identifier ID1 and controls on/off of switches S(1, 1)-S(k, n) by generating “k×n” switch control signals SS(1, 1)-SS(k, n) corresponding thereto, respectively. Although not shown, the switch circuit SW2 of
In
Each of the n accumulators ACC(1)-ACC(n) has, for example, one multiplier and one cumulative adder. In addition, each of the n accumulators ACC(1)-ACC(n) may have, for example, a bias adder or an activation function calculator required in the processing of the neural network. The n accumulators ACC(1)-ACC(n) multiply, for each control cycle, the n pixel values from the DMA controller DMAC3 and the k uncompressed weighting factors W(1)-W(k) transferred from the switch circuit SW1.
Here, the correspondence between the n accumulators ACC (1)-ACC(n) and k (k<n) weighting factors W(1)-W(k) is determined by the switch circuit SW1. At this time, the switch circuit SW1 transfers at least one of the k weighting factors W(1)-W(k) in parallel to two or more of the n accumulators ACC(1)-ACC(n). Then, each of the n accumulators ACC(1)-ACC(n) accumulates and adds the multiplication result of the pixel value thus obtained and the weighting factor to the time series in a plurality of control cycles. As an example, for k=28, n may be about several 100 to 1000.
The DMA controller DMAC2 includes m channels CH(1)-CH(m). Each of the m channels CH(1)-CH(m) transfers the outputs of the n accumulators ACC(1)-ACC(n) to the write addresses of the memory, for example, the memory MEM2 of
A switch circuit SW2 is provided between the n accumulators ACC(1)-ACC(n) and the DMA controller DMAC2. The switch circuit SW2, the switch control signal SS2 from the switch control circuit SWCT, and thus based on the correspondence represented by the identifier ID2, and transfers the output of the n accumulators ACC(1)-ACC(n) to m channels CH(1)-CH(m) in the DMA controller DMAC2.
On the other hand, in CNN, k weighting factor data WFD(1)-WFD(k), also called kernels, are used according to k output channels. The weighting factor data WFD(1) of the output channel (1) is composed of i weighting factors W(1, 1)-W(1, i). Similarly, the weighting factor data WFD(k) of the output channel (k) is also composed of i weighting factors W(k, 1)-W(k, i).
In the convolution layer, k feature maps FMP(1)-FMP(k) are generated according to k output channels. In the feature map FMP(1) of the output channel (1), the feature amount Va (1) of the pixel corresponding to the two dimensional region A in the image data IMD is calculated by the product-sum operation of the pixel value data XDa and the weighting factor data WFD(1) of the output channel (1). Similarly, in the feature map FMP(1), the feature amount Vb (1) of the pixel corresponding to the two dimensional region B in the image data IMD is calculated by the product-sum operation of the pixel value data XDb and the weighting factor data WFD(1) of the output channel (1).
Further, in the feature map FMP(k) of the output channel (k), the feature amount Va(k) of the pixel corresponding to the two-dimensional region A in the image data IMD is calculated by the product-sum operation of the pixel value data XDa and the weighting factor data WFD(k) of the output channel (k). Similarly, in the feature map FMP(k), the feature amount Vb (k) of the pixel corresponding to the two dimensional region B in the image data IMD is calculated by the product-sum operation of the pixel value data XDb and the weighting factor data WFD(k) of the output channel (k). Incidentally, each feature quantity, such a product-sum operation result, adding the bias value for each output channel, may be calculated further through the operation of the activation function.
In this case, the switch circuit SW1 transfers a plurality of accumulators ACC(1), . . . , in parallel to ACC(r), i weighting factors W(1, 1) to W(1, i) in the output channel (1) in order in i control cycles. Similarly, the switch circuit SW1 transfers i weighting factors W(k, 1) to W(k, i) in the output channel (k) in order in i control cycles in parallel for a plurality of accumulators ACC(q), . . . .
Prior to such processing, the decompressor DCMP also receives, for example, the compressed j weighting factors P(1, 1) to P(j, 1) in the first control cycle and outputs the weighting factors W(1, 1) to W(k, 1) for the k output channels by decompressing it. The header HD is added to this compressed weighting factors P(1, 1) to P(j, 1), as shown in
The switch circuit SW1 receives weighting factors W(1, 1)-W(k, 1) for k output channels from the decompressor DCMP, and transfers each of the weighting factors W(1, 1)-W(k, 1) in parallel to a plurality of accumulators based on the switch control signal SS1 from the switch control circuit SWCT. That is, for example, in
On the other hand, in the DMA controller DMAC3, the channels CH(1) and CH(q) read the i pixel values Xa(1) to Xa(i) sequentially from the memory MEM2 and transfer them to the accumulators ACC(1) and ACC(q) in this order in i control cycles, respectively. In addition, the channel CH(r) sequentially reads the i pixel values Xb(1)-Xb(i) from the memory MEM2 and transfers them to the accumulator ACC(r) in this order in i control cycles. Thus, the accumulator ACC(1), . . . , ACC(r), . . . , ACC(q), . . . , the product-sum operation as shown in
In each channel in the DMA controller DMAC2, the correspondence between the feature maps FMP(1)-FMP(k) corresponding to the output channels as illustrated in
On the other hand, the comparative example shown in
In the configuration in
The comparative example shown in
However, in the configuration example in
On the other hand, the configuration example in
Furthermore, comparing with the configuration example in
(Configuration around Neural Network Engine)
In the configuration example in
Therefore, a plurality of sets of decompressor DCMP as shown in
(Configuration around Neural Network Engine)
That is, the processor PRC outputs an identifier ID1 to the register REG and thus the switch control circuitry SWCT via the system bus SBUS when the DMA controller DMAC1 transfers the compressed weighting factor data WFDC as shown in
The decompressing unit DU3, the switch control circuit SWCT associated with the processing of the weighting factor described above, the decompressor DCMP, the switch circuit SW1 and the register REG as the decompressing unit DU1 comprises the same configuration as the decompressing unit DU1. That is, the memory MEM1 holds the pre-compressed image data. Then, the decompressing unit DU3 transfers the compressed image data to the accumulator unit ACCU while expanding.
Normally, the image data IMD is used in a state stored in a memory MEM2 for caching as uncompressed data because the amount of data is smaller than the weighting factor data WFD. However, if, for example, the number of input channels of the image data IMD increases, it may be difficult to sufficiently secure the storage capacity associated with the image data IMD in the memory MEM2. Therefore, by using the configuration example as shown in
Although the invention made by the present inventor has been specifically described based on the embodiment, the present invention is not limited to the embodiment described above, and it is needless to say that various modifications can be made without departing from the gist thereof.