This disclosure relates to an arithmetic device, and relates, for example, to an arithmetic device that performs a computation in a convolutional layer of a neural network.
It has been known in an algorithm of a neural network that, by making connection among neurons between layers sparse, processing can be faster owing to reduction in amount of computation and a capacity of a memory for storing a weight coefficient can be reduced.
In spite of sparse connection among neurons, however, performance and power consumption of an existing arithmetic device dedicated for a neural network are not necessarily improved. NPL 1 has proposed an inference arithmetic device directed to a convolutional layer of a sparse neural network.
Specifically, according to this literature, an arithmetic device includes a buffer controller (BC) and a plurality of processing elements (PE) connected in parallel to one another. A value of one output neuron is calculated in any processing element. The plurality of processing elements asynchronously operate in parallel.
The buffer controller controls an input data buffer to selectively supply input data required by each processing element to each processing element. More specifically, the buffer controller includes: an internal storage area where some of data copied from the input data buffer is stored; and a plurality of index modules that individually correspond to the processing elements and supply data to respective corresponding processing elements.
NPL 1: Shijin Zhang, et al., “Cambricon-X: An Accelerator for Sparse Neural Networks,” Proceedings of 49th IEEE/ACM International Symposium on Microarchitecture, 2016
In a configuration in NPL 1, n processing elements are provided. In this case, the buffer controller should include n ports for simultaneous access to n processing elements, or when each processing element includes p computation units that operate in parallel, the buffer controller should include n×p ports that can allow simultaneous access.
In order to improve processing performance of an arithmetic device, it is effective to increase the number n of processing elements or the number p of computation units in the processing element. According to the configuration of the arithmetic device in NPL 1, however, the number of ports for accessing the internal storage area in the buffer controller increases in proportion to the number n×p of processing elements. Such increase in number of ports brings about increase in circuit area and lowering in operating frequency. Therefore, scalable improvement in performance of the arithmetic device with increase in number n of processing elements and number p of internal computation units cannot be achieved.
This disclosure takes into consideration the problems above, and one of objects thereof is to provide an arithmetic device that allows scalable improvement in performance with increase in number of processing elements or increase in number of computation units within the processing element, in a convolutional layer of a neural network where connection between layers is made sparse.
An arithmetic device in one embodiment is an arithmetic device for a computation in a convolutional layer of a convolutional neural network. The convolutional layer includes a plurality of input neurons and a plurality of output neurons each connected to at least one of the plurality of input neurons. The arithmetic device includes a first register that stores input data as values of the plurality of input neurons, a plurality of ports, and a plurality of processing element groups that correspond to the plurality of ports, respectively, and can access the first register through respective corresponding ports. Each of the processing element groups includes a plurality of processing elements. Each of the processing elements is associated with at least one of the plurality of output neurons and performs a multiply-and-accumulate computation in which a value of at least one input neuron connected to a corresponding output neuron is multiplied by a weight coefficient and a result of multiplication is accumulated.
According to the embodiment, since a processing element group including a plurality of processing elements accesses the first register through a corresponding port, performance thereof can scalably be improved with increase in number of processing elements or increase in number of computation units within the processing element.
Each embodiment will be described below in detail with reference to the drawings. The same or corresponding elements have the same reference characters allotted and description thereof will not be repeated.
[Exemplary Configuration of Neural Network]
Each neuron buffer 31 is used for temporarily storing data output from a preceding layer and/or temporarily storing data to be input to a next layer. For example, neuron buffer 31A corresponds to a plurality of input neurons of convolutional layer 32A. Neuron buffer 31B corresponds to a plurality of output neurons of convolutional layer 32A and to a plurality of input neurons of pooling layer 33A.
Each convolutional layer 32 multiplies a value of at least one input neuron connected for each output neuron by a weight coefficient and adds up results of multiplication. This computation is referred to as a multiply-and-accumulate computation. Each convolutional layer 32 adds a bias value to a result of the multiply-and-accumulate computation and applies an activation function (for example, a non-linear computation) to a result of addition.
Referring to
After the computation within the range of one kernel is completed, convolution kernel 90 is slid and the multiply-and-accumulate computation is similarly performed. When all computations with the kernel being slid in a horizontal direction and a vertical direction are completed, processing in the convolutional layer is completed.
Specifically, in the example in
Referring again to
As shown in
[Exemplary Configuration of Convolutional Layer]
Referring to
y0←w0,0·x0+w0,2·x2+w0,5·x5+w0,6·x6+w0,9·x9+w0,11·x11+w0,12·x12 (1A)
y1←w1,0·x0+w1,5·x5+w1,6·x6+w1,7·x7+w1,9·x9+w1,10·x10+w1,12·x12 (1B)
y2←w2,0·x0+w2,2·x2+w2,5·x5+w2,7·x7+w2,9·x9+w2,12·x12+w2,15·x15 (1C)
y0←f0(y0) (2A)
y←f1(y1) (2B)
y2←f2(y2) (2C)
In the expressions (1A) to (1C), wij (0≤i≤2, 0≤j≤15) represents a weight coefficient. A bias value may be added to each of y0, y1, and y2. In the expressions (2A) to (2C), f0, f1, and f2 each represent an activation function.
The multiply-and-accumulate computation in the expressions (1A) to (1C) can be divided into a first stage of taking out an element necessary for the multiply-and-accumulate computation from input neurons x0, x1, . . . , and x15, a second stage of multiplying the element taken out in the first stage by a corresponding weight coefficient, and a third stage of adding up results of multiplication.
In the first stage, values aligned in an order of index numbers of input neurons x0, x1, . . . , and x15 are referred to as a “usage bit string,” with a value corresponding to an element necessary for the multiply-and-accumulate computation being defined as “1” and a value corresponding to an element unnecessary for the multiply-and-accumulate computation being defined as “0”. The usage bit string in the example in the expression (1A) is expressed as (1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0). The usage bit string in the example in the expression (1B) is expressed as (1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0). The usage bit string in the example in the expression (1C) is expressed as (1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1).
[Overall Configuration of Arithmetic Device]
Referring to
Each processing element group (GPE) 42 includes a plurality of processing elements 43 (PE). Each processing element (PE) 43 performs a multiply-and-accumulate computation for each output neuron. Specifically, in the example in
Input data buffer (NBin) 40 stores a result of computation in the preceding layer as a value of each input neuron in a subject layer.
Data scatter unit (DSU) 41 holds some of data in input data buffer 40 and accepts an access request from each processing element group (GPE) 42. Data to which access has been requested (that is, a value of an input neuron) is transferred to processing element group (GPE) 42 that has issued the request.
Data gather unit (DGU) 44 gathers results of computations by processing elements (PE) 43, applies an activation function to the gathered results of computation, and further aligns and synchronizes the results.
Output data buffer (NBout) 45 stores the results of computation in the subject layer as values of the output neurons.
[Mapping of Calculation Algorithm within One Kernel Range to Hardware]
As described with reference to
As shown in
A result of computation in each processing element (PE) 43 is temporarily stored in an embedded register (corresponding to a data gather register (DGR) 71 in
As described with reference to
Referring to
[Procedure of Processing in Arithmetic Device]
Referring to
As described previously, the convolution kernel refers to a weight coefficient matrix and it is also referred to as a filter. In a computation in the convolutional layer of the neural network, the convolution kernel is applied to input data while it is slid.
In next step S106, arithmetic device 39 indicates a turn number to each processing element group (GPE) 42 and each processing element (PE) 43. When the number of output neurons 94 within one kernel range is equal to or smaller than the total number of processing elements (PE) 43, a notification about the turn number is given only once (for example, a notification of number 0 is given as the turn number). When the number of output neurons 94 within one kernel range is larger than the total number of processing elements (PE) 43, arithmetic device 39 gives a notification about the turn number (0, 1, 2, . . . ) in performing the multiply-and-accumulate computation corresponding to output neurons in each turn.
In next step S110, processing element groups (GPE) 42 read necessary data from data scatter unit (DSU) 41 by accessing data scatter unit (DSU) 41 in parallel. A value of an index of an accessed input neuron is different for each turn.
In next step S120, processing element group (GPE) 42 scatters read data to processing elements (PE) 43 included in the processing element group itself.
In next step S130, each processing element (PE) 43 multiplies scattered data by a corresponding weight coefficient and accumulates a result of multiplication in an internal register. A value of the weight coefficient is different for each turn.
Steps S110, S120, and S130 above are performed in parallel in processing element groups (GPE) 42. In each processing element group (GPE) 42, until the multiply-and-accumulate computation in each processing element (PE) 43 is completed (YES in step S140), steps S110 to S130 are repeatedly performed. A bias value may be accumulated to the result of the multiply-and-accumulate computation. A final result of the multiply-and-accumulate computation by each processing element (PE) 43 is transferred to data gather unit (DGU) 44.
In next step S150, data gather unit (DGU) 44 applies an activation function to the result of the multiply-and-accumulate computation by each processing element (PE) 43. Furthermore, in next step S160, data gather unit (DGU) 44 has an embedded register (that is, data gather register (DGR) 71 in
When output neurons 94 within one kernel range are divided into a plurality of turns, until the convolution computation in all turns is completed (that is, determination as YES is made in step S164), steps from step S106 are repeated. In this case, data gather unit (DGU) 44 has data gather register (DGR) 71 store a result of computation for each turn and waits for completion of the computations by processing elements (PE) 43 in all turns.
In next step S170, data gather unit (DGU) 44 outputs data from the embedded register to output data buffer (NBout) 45. The procedure from step S102 to step S170 is repeated until the convolution computation for all kernels is completed (until determination as YES is made in step S180) while convolution kernel 90 is slid.
[Exemplary Configuration of Data Scatter Unit]
Referring to
Data scatter register (DSR) 50 holds some data in input data buffer (NBin) 40. Each port 51 accepts an access request from corresponding processing element group (GPE) 42. Each port 51 takes out data to which access has been requested (a value of an input neuron) from data scatter register (DSR) 50 and transfers the data to corresponding processing element group (GPE) 42.
Thus, in data scatter unit (DSU) 41, an access port for each processing element (PE) is not prepared. As data scatter register (DSR) 50 is configured to be accessible for each processing element group (GPE), increase in number of ports 51 with increase in number of processing elements (PE) 43 can be suppressed.
[Exemplary Configuration of Processing Element Group]
Referring to
Each processing element (PE) 43 includes a second memory (which is collectively referred to as a weight coefficient string memory 61 below) that holds a weight coefficient used for a multiply-and-accumulate computation for each output neuron and a multiply-and-accumulate computation unit 62. In processing element group GPE0 in
Specifically, in the configuration shown in
The weight coefficient stored in weight coefficient string memory 61 is associated with an index number stored in index string memory 60. A pointer 63 is used for indicating a specific index number stored in index string memory 60 and a pointer 64 is used for indicating a specific weight coefficient stored in weight coefficient string memory 61.
Each processing element group (GPE) 42 takes out an index number indicated by pointer 63 from index string memory 60, and requests data scatter unit (DSU) 41 to provide data (that is, a value of an input neuron) corresponding to the index number. Then, each processing element group (GPE) 42 scatters data (data) obtained from data scatter unit (DSU) 41 to embedded processing elements (PE) 43. Each processing element (PE) 43 takes out a value of the weight coefficient indicated by pointer 64 from weight coefficient string memory 61, multiplies the scattered data by the weight coefficient, and accumulates a result of multiplication.
As the element indicated by pointer 63 is sequentially switched, a similar operation is performed on each element in index string memory 60. Consequently, a value output from each processing element (PE) 43 (that is, a value of each output neuron) before an activation computation is calculated.
Each multiply-and-accumulate computation unit 62 includes a multiplier 65, an adder 66, and a flip flop (FF) 67. Multiplier 65 multiplies data (data) obtained from data scatter unit (DSU) 41 by a weight coefficient taken out of memory 61. Adder 66 adds data held in flip flop 67 to a result of multiplication by multiplier 65 and has a result of addition held in flip flop 67. Results of multiplication by multiplier 65 are thus accumulated.
In processing element (PE) 43 in
Each processing element group (GPE) 42 simultaneously requests data scatter unit (DSU) 41 to provide data (DATA) corresponding to index numbers as many as the number of multipliers 65. Data (DATA) corresponding to the plurality of index numbers is scattered to processing elements (PE) 43. In each processing element (PE) 43, the plurality of multipliers 65 multiply taken-out data (DATA) (that is, values of a plurality of input neurons) by corresponding respective weight coefficients.
According to the configuration as above, the number of computation units in parallel within each processing element (PE) 43 can be increased.
Index string memory 60 and weight coefficient string memory 61 described with reference to
[Exemplary Configuration of Data Gather Unit]
Referring to
Activator (ACT) 70 applies an activation function to a result of a multiply-and-accumulate computation output from corresponding processing element (PE) 43. Data gather register (DGR) 71 temporarily holds the result of the computation until the multiply-and-accumulate computation in all processing element groups (GPE) 42 and application of the activation function to the result of the multiply-and-accumulate computation are completed. At a time point of completion of all computations, data gather unit (DGU) 44 simultaneously writes the result of computation into output data buffer (NBout) 45.
[Method of Creating Index String and Weight Coefficient String]
A method of creating an index string and a weight coefficient string that are data stored in index string memory 60 and weight coefficient string memory 61, respectively, described with reference to
Referring to
Referring to
Referring to
Then, referring to
Then, referring to
[Effect of First Embodiment]
As set forth above, according to the arithmetic device for the neural network in the first embodiment, processing element group (GPE) 42 is constituted of a plurality of processing elements (PE) 43, with the plurality of processing elements (PE) 43 being defined as one unit. Processing element groups (GPE) 42 can access in parallel to data scatter register (DSR) 50 provided in data scatter unit (DSU) 41 that stores input data. Therefore, in data scatter unit (DSU) 41, an access port is provided individually for each processing element group (GPE) 42, rather than for each processing element (PE) 43. Consequently, increase in number of access ports when the number of processing elements (PE) 43 and the number of computation units provided in each processing element (PE) 43 increase can be suppressed. Therefore, increase in circuit area in proportion to the number of access ports can be suppressed and processing performance of the arithmetic device can be enhanced without lowering in operating frequency.
According to the hardware configuration in the first embodiment, as a ratio of sharing of input data among a plurality of output neurons allocated to identical processing element group (GPE) 42 is higher, the number of times of issuance of a request for access to data scatter unit (DSU) 41 is reduced and a processing time period can be reduced.
For example, as a bit length of index string Idxn shown in
According to an arithmetic device in the second embodiment, output neurons within an identical layer can be allocated to processing element (PE) 43 at any position so that a processing speed can be higher in accordance with characteristics (that is, how an input neuron and an output neuron are connected) of an algorithm of a neural network. Specific description will be given below with reference to the drawings.
[Exemplary Configuration of Data Gather Unit]
As described previously, according to the arithmetic device in the second embodiment, output neurons are allocated to processing elements (PE) 43 in an order different from the order of index numbers of the output neurons. In other words, the output neurons are allocated to processing elements (PE) 43 in the order different from the order of computation assumed in an algorithm of a neural network. Switch 72 then rearranges results of computation output from a plurality of activators (ACT) 70 into the original order of the index. The rearranged results of computation are stored in data gather register (DGR) 71. A switch capable of switching connection into any order such as a crossbar switch is employed as switch 72.
Since
Switch 72 provided in data gather unit (DGU) 44A in
Since
[Operations by Arithmetic Device]
In step S155, switch 72 in
[Specific Example of Data Processing]
An effect of the second embodiment will be described below with reference to specific examples.
Specifically, in the example in
y0←w0,6·x6+w0,9·x9+w0,11·x11+w0,12·x12+w0,13·x13+w0,14·x14 (3A)
y1←w1,1·x1+w1,2·x2+w1,4·x4+w1,8·x8+w1,9·x9+w1,11·x11 (3B)
y2←w2,4·x4+w2,5·x5+w2,6·x6+w2,9·x9+w2,11·x11+w2,12·x12+w2,14·x14 (3C)
y3←w3,2·x2+w3,3·x3+w3,4·x4+w3,7·x7+w3,8·x8+w3,9·x9 (3D)
In the expressions (3A) to (3D), wij (0≤i≤3, 0≤j≤15) represents a weight coefficient. A bias value may be added to each of y0, y1, y2, and y3.
An example in which four output neurons shown in
As set forth above, in allocation of processing elements in
As set forth above, in allocation of processing elements in
[Effect of Second Embodiment]
As set forth above, according to the arithmetic device in the second embodiment, output neurons are allocated to an appropriate processing element (PE) 43 regardless of the order of index numbers of the output neurons defined in the algorithm of the neural network. Thus, output neurons high in ratio of sharing of input data can be allocated to a plurality of processing elements (PE) 43 belonging to identical processing element group (GPE) 42, so that the number of times of access to data scatter unit (DSU) 41 from processing element group (GPE) 42 can be reduced. Consequently, processing performance of the arithmetic device can be enhanced.
According to the arithmetic device in the second embodiment, output neurons 93 within an identical layer can be allocated to any processing element (PE) 43, so that the ratio of sharing of input data accessed by each processing element (PE) 43 can be higher and thus the processing time period can be reduced. In the arithmetic device in the second embodiment, however, only values of input neurons within a range of convolution kernel 90 at a certain slide position are taken into data scatter unit (DSU) 41 as input data, and output neurons corresponding to these input neurons are allocated to a plurality of processing elements (PE) 43. Since the range of input neurons is thus restricted, there is an upper limit of the ratio of sharing of input data.
According to an arithmetic device in a third embodiment, in a convolution computation with convolution kernel 90 being slid with respect to input neurons 91, values of input neurons corresponding to a plurality of convolution kernels 90 different in slide position are simultaneously stored in data scatter unit (DSU) 41 as input data. Then, output neurons corresponding to these input neurons are allocated to a plurality of processing elements (PE) 43. The ratio of sharing of the input data can thus be higher and the processing time period can be shorter than in the second embodiment. Specific description will be given below with reference to the drawings.
[Configuration of Arithmetic Device in Third Embodiment]
Referring to
In the third embodiment, data scatter unit (DSR) 50 in
Furthermore, data gather unit (DGU) 44 in
Switch 72 rearranges results of the computation output from the plurality of activators (ACT) 70 into the original order of indices and transfers the results to data gather register (DGR) 71 as in
As described with reference to
Data gather register (DGR) 71 is divided into a plurality of subregisters (DGR1 to DGRα) corresponding to a respective kernels. Data gather register (DGR) 71 can thus simultaneously hold output data corresponding to a kernels from the first kernel to the αth kernel. Data gather register (DGR) 71 outputs the stored results of computation to output data buffer (NBout) 45 at the time point of completion of the convolution computation of input neurons within the range corresponding to the α kernels.
[Operations by Arithmetic Device]
Specifically, in step S102A, data scatter unit (DSU) 41 takes in data necessary for the convolution computation within the α kernel ranges from the first kernel range to the αth kernel range from input data buffer (NBin) 40. Specifically, data necessary for the convolution computation for a kernels is copied from input data buffer (NBin) 40 to data scatter unit (DSU) 41.
The convolution computation thereafter is repeated until the convolution computation for α kernels is completed. In other words, the procedure from steps S106 to S160 is repeated until determination as YES is made in step S164A.
In step S170A, data gather unit (DGU) 44 outputs output data representing results of convolution computation within α kernel ranges from embedded data gather register (DGR) 71 to output data buffer (NBout) 45.
[Specific Example of Data Processing]
An effect of the third embodiment will be described below with reference to specific examples.
As shown in
Specifically, in order to calculate values of output neurons y0,0 to y1,1, a multiply-and-accumulate computation is carried out in accordance with expressions (4A) to (4D) below and thereafter an activation function is applied.
y0,0←w0,4·x4+w0,5·x5+w0,6·x6+w0,13·x13+w0,14·x14 (4A)
y0,1←w1,0·x0+w1,2·x2+w1,3·x3+w1,9·x9+w1,10·x10+w1,11·x11+w1,12·x12 (4B)
y1,0←w0,4·x8+w0,5·x9+w0,6·x10+w0,13·x17+w0,14·x18 (4C)
y1,1←w1,0·x4+w1,2·x6+w1,3·x7+w1,9·x13+w1,10·x14+w1,11·x15+w1,12·x16 (4D)
In the expressions (4A) to (4D), wij (0≤i≤1, 0≤j≤19) represents a weight coefficient. A bias value may be added to each of y0,0, y0,1, y1,0, and y1,1.
An example in which four output neurons shown in
In the example in
Initially, processing element group (GPE0) 42 simultaneously calculates values of output neurons y0,0 and y0,1. Then, processing element group (GPE0) 42 simultaneously calculates values of output neurons y1,0 and y1,1.
In this case, since the simultaneously calculated values of the output neurons correspond to common convolution kernel 90, they can be output as they are without change in order of output. Therefore, a storage area for two output neurons should only be allocated to data gather register (DGR) 71 in data gather unit (DGU) 44. Specifically, when data gather register (DGR) 71 receives values of output neurons y0,o and y0,1 corresponding to the first kernel, it outputs the values as they are to output data buffer (NBout) 45. Then, when data gather register (DGR) 71 receives values of output neurons y1,0 and y1,1 corresponding to the second kernel, it outputs the values as they are to output data buffer (NBout) 45.
In the example in
Initially, processing element group (GPE0) 42 simultaneously calculates values of output neurons y1,0 and y0,1. Then, processing element group (GPE0) 42 simultaneously calculates values of output neurons y0,0 and y1,1.
In this case, since the simultaneously calculated values of the output neurons correspond to different convolution kernels 90, switch 72 or connector 73 rearranges the order of output of the output data. Furthermore, a storage area for output neurons y0,0 and y0,1 corresponding to the first kernel and a storage area for output neurons y1,0 and y1,1 corresponding to the second kernel, that is, a storage area for four output neurons, are required in data gather register (DGR) 71 within data gather unit (DGU) 44. The output data is transferred from data gather register (DGR) 71 to output data buffer (NBout) 45 at the time point of completion of calculation of values of four output neurons.
Therefore, as shown in
Therefore, as shown in
As set forth above, it has been shown that the number of times of access to data scatter unit (DSU) 41 can be reduced and the processing speed can be increased by adopting the hardware configuration in the third embodiment and by optimally allocating output neurons to processing elements (PE) 43 as described in the second embodiment.
[Effect of Third Embodiment]
As set forth above, according to the arithmetic device in the third embodiment, in performing a convolution computation with convolution kernel 90 being slid with respect to input neurons 91 in the algorithm of the neural network, a plurality of output neurons corresponding to a plurality of convolution kernels 90 different in position of sliding are simultaneously allocated to processing elements (PE). Since computations for convolution kernels high in ratio of sharing of input data can thus be allocated to a plurality of processing elements (PE) 43 belonging to the same processing element group (GPE) 42, the number of times of access from processing element group (GPE) 42 to data scatter unit (DSU) 41 can be reduced. Consequently, processing performance of the arithmetic device can be enhanced.
A fourth embodiment is different from the first to third embodiments in a method of passing data between layers of a neural network. Processing in a layer in a subsequent stage can thus be started before completion of processing in a layer in a preceding stage. Description will be given below with reference to the drawings.
The neural network in
Furthermore, a data scatter unit (DSU) 41A in the arithmetic device in
According to the configuration, each convolutional layer in the neural network can start computation processing at the time when input data of one kernel necessary for a computation in processing element group (GPE) 42 is ready even before completion of processing in the preceding stage. Since layers in the neural network are thus executed in parallel, processing can be faster.
Referring to
In next step S110, corresponding processing element group (GPE) 42 reads data from data scatter register (DSR) 50 by accessing data scatter register (DSR) 50. Since processing from next step S120 to step S164 (that is, the step of determining whether or not the convolution computation for one kernel range has been completed) is the same as in
Data stored in data gather register (DGR) 71 is not output to output queue (FIFOout) 81 until a computation of output neurons within one kernel range is completed (that is, determination as YES is made in step S164). A procedure from step S104 to step S172 (that is, the step of outputting data from data gather unit (DGU) 44 to output queue (FIFOout) 81) is repeated until the convolution computation is completed for all kernels (determination as YES is made in step S180).
As set forth above, according to the arithmetic device of the neural network in the fourth embodiment, a queue is used for passing data between layers of the neural network and data input through the queue from a preceding stage is transferred to data scatter register (DSR) 50 through line buffer 82. By interposing line buffer 82, data scatter unit (DSU) 41A can have data scatter register (DSR) 50 store input data in one kernel in an order different from the order of input data given from the queue. Since processing in a layer in a subsequent stage can thus be started before completion of processing in a layer in a preceding stage, processing can be faster.
It should be understood that the embodiments disclosed herein are illustrative and non-restrictive in every respect. The scope of the present invention is defined by the terms of the claims rather than the description above and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
30 neural network; 31 neuron buffer; 32 convolutional layer; 33 pooling layer; 39 arithmetic device; 40 input data buffer (NBin); 41 data scatter unit (DSU); 42 processing element group (GPE); 43 processing element (PE); 44 data gather unit (DGU); 45 output data buffer (NBout); 50 data scatter register (DSR); 51 port; 60 index string memory; 61 weight coefficient string memory; 62 multiply-and-accumulate computation unit; 63, 64 pointer; 65 multiplier; 66 adder; 67 flip flop; 70 activator (ACT); 71 data gather register (DGR); 72 switch; 73 connector; 80 input queue (FIFOin); 81 output queue (FIFOout); 82 line buffer; x0 to x19 input neuron; y0 to y3, y0,0 to y1,1 output neuron
Number | Date | Country | Kind |
---|---|---|---|
2018-093750 | May 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/002780 | 1/28/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2019/220692 | 11/21/2019 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5553196 | Takatori et al. | Sep 1996 | A |
11042370 | Maiyuran | Jun 2021 | B2 |
20040030712 | Sano | Feb 2004 | A1 |
20080077926 | Jeter | Mar 2008 | A1 |
20110235531 | Vangal | Sep 2011 | A1 |
20130073498 | Izhikevich | Mar 2013 | A1 |
20150324690 | Chilimbi | Nov 2015 | A1 |
20160283240 | Mishra | Sep 2016 | A1 |
20180039480 | Komagata et al. | Feb 2018 | A1 |
20180046894 | Yao | Feb 2018 | A1 |
20180082181 | Brothers | Mar 2018 | A1 |
20180096226 | Aliabadi | Apr 2018 | A1 |
20190042538 | Koren | Feb 2019 | A1 |
20190303297 | Fleming, Jr. | Oct 2019 | A1 |
Number | Date | Country |
---|---|---|
207440765 | Jun 2018 | CN |
207517054 | Jun 2018 | CN |
108268943 | Jul 2018 | CN |
3343460 | Jul 2018 | EP |
3557425 | Oct 2019 | EP |
3557485 | Oct 2019 | EP |
H02300870 | Dec 1990 | JP |
6205780 | Dec 2017 | JP |
2018022339 | Feb 2018 | JP |
WO-2018108126 | Jun 2018 | WO |
WO-2018120989 | Jul 2018 | WO |
WO-2018193353 | Oct 2018 | WO |
Entry |
---|
Yizhou Shan, “Near Memory Processing”, published on Feb. 15, 2018 to http://lastweek.io/lego/paper/nmp/, retrieved May 3, 2024. (Year: 2018). |
Hyoukjun Kwon, etc., “Rethinking NoCs for Spatial Neural Network Accelerators”, published via NOCS '17, Seoul, Republic of Korea, Oct. 19-20, 2017, retrieved Aug. 21, 2024. (Year: 2017). |
Shaoli Liu, etc., “Cambricon: An Instruction Set Architecture for Neural Networks”, published via 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture, Jun. 18-22, 2016, Seoul, South Korea, retrieved Aug. 21, 2024. (Year: 2016). |
Kevin Kiningham, etc., “Design and Analysis of a Hardware CNN Accelerator”, published in 2017 and retrieved Aug. 21, 2024 from https://cs231n.stanford.edu/reports/2017/pdfs/116.pdf. (Year: 2017). |
Bryon Moyer, “How Does Scatter/Gather Work?”, published Feb. 9, 2017 to https://www.eejournal.com/article/20170209-scatter-gather, retrieved Aug. 21, 2024. (Year: 2017). |
David Whelihan, etc., “P-sync: A Photonically Enabled Architecture for Efficient Non-local Data Access”, published via 2013 IEEE 27th International Symposium on Parallel & Distributed Processing, May 20-24, 2013, retrieved Aug. 21, 2024. (Year: 2013). |
Kaiyuan Guo, etc., “A Survey of FPGA-Based Neural Network Inference Accelerator”, published via ACM Transactions on Reconfigurable Technology and Systems, vol. 9, No. 4, Article 11, Dec. 2017, retrieved Aug. 21, 2024. (Year: 2017). |
Edouard Oyallon, etc., “Scaling the Scattering Transform: Deep Hybrid Networks”, published via 2017 IEEE International Conference on Computer Vision, Oct. 22-29, 2017, Venice, Italy, retrieved Aug. 21, 2024. (Year: 2017). |
International Search Report (PCT/ISA/210), with translation, and Written Opinion (PCT/ISA/237) mailed on Apr. 16, 2019, by the Japanese Patent Office as the International Searching Authority for International Application No. PCT/JP2019/002780. |
Musha, et al., “Deep Learning Acceleration in Large-Scale Multi-FPGA System”, The Institute of Electronics, Information and Communication Engineers, Oct. 2017, vol. 117, No. 278, pp. 1-6. |
Ohba, et al., “Broadcasting of Data Transfer in Multicore Neural Network Accelerator”, IPSJ SIG Technical Report, Mar. 2017, pp. 1-6. |
Zhang, et al., “Cambricon-X: An Accelerator for Sparse Neural Networks,” Proceedings of 49th IEEE/ACM International Symposium on Microarchitecture, 2016 (month unknown), 12 pages. |
Office Action dated Nov. 16, 2021, issued in corresponding Japanese Patent Application No. 2020-518966, 8 pages including 4 pages of English translation. |
Number | Date | Country | |
---|---|---|---|
20210241083 A1 | Aug 2021 | US |