This application claims priority to foreign French patent application No. FR 2008236, filed on Aug. 3, 2020, the disclosure of which is incorporated by reference in its entirety.
The invention relates in general to digital neuromorphic networks, and more particularly to a reconfigurable computer architecture for the computing of artificial neural networks based on convolutional or fully connected layers.
Artificial neural networks are computational models imitating the operation of biological neural networks. Artificial neural networks comprise neurons that are interconnected by synapses, and each synapse is attached to a weight, implemented for example by digital memories. Artificial neural networks are used in various fields in which (visual, audio, inter alia) signals are processed, such as for example in the field of image classification or of image recognition.
Convolutional neural networks correspond to a particular model of artificial neural networks. Convolutional neural networks were first described in the article by K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193-202, 1980. ISSN 0340-1200. doi: 10.1007/BF00344251”.
Convolutional neural networks (as they are known, or “deep (convolutional) neural networks” or even “ConvNets”) are neural networks inspired by biological visual systems.
Convolutional neural networks (CNN) are used notably in image classification systems to improve classification. When applied to image recognition, these networks make it possible to learn intermediate representations of objects in images. Intermediate representations representing elementary features (in terms of shapes or contour for example) are smaller and able to be generalized for similar objects, thereby making them easier to recognize. However, the intrinsically parallel operation and the complexity of convolutional-neural-network classifiers makes them difficult to implement in embedded systems with limited resources. Specifically, embedded systems impose strict constraints in terms of the footprint of the circuit and in terms of electricity consumption.
The convolutional neural network is based on a sequence of layers of neurons, which may be convolutional layers, fully connected layers or layers carrying out other processing operations on data of an image. In the case of fully connected layers, a synapse connects each neuron of a layer to a neuron of the preceding layer. In the case of convolutional layers, only a subset of the neurons of a layer is connected to a subset of the neurons of another layer. Moreover, convolutional neural networks are able to process multiple input channels so as to generate multiple output channels. Each input channel corresponds for example to a different data matrix.
The input channels contain input images in matrix form, thus forming an input matrix; an output matrix image is obtained on the output channels.
The matrices of synaptic coefficients for a convolutional layer are also called “convolution kernels”.
In particular, convolutional neural networks comprise one or more convolutional layers, which are particularly expensive in terms of number of operations. The operations that are performed are mainly multiplication and accumulation (MAC) operations. Moreover, in order to comply with the latency and processing time constraints specific to the targeted applications, it is necessary to parallelize the computations as much as possible.
More particularly, when convolutional neural networks are embedded in a mobile system for telephony for example (as opposed to an implementation in data centre infrastructures), reducing electricity consumption becomes an essential criterion for implementing the neural network. In this type of implementation, the solutions from the prior art contain memories external to the computing units. This increases the number of read and write operations between separate electronic chips of the system. These data exchange operations between various chips are highly energy-consuming for a system dedicated to a mobile application (telephony, autonomous vehicle, robotics, etc.).
There is therefore a need for computers that are able to implement a convolutional layer of a neural network with limited complexity in order to satisfy the constraints of embedded systems and of the targeted applications. More particularly, there is a need to adapt the architectures of neural network computers so as to integrate memory blocks into the same chip containing the computing units (MAC). This solution limits the distances covered by the computing data and thus makes it possible to reduce the consumption of the entire neural network by limiting the number of read and write operations from and to said memories.
A neural network may propagate data from the input layer to the output layer, but also back-propagate error signals computed during a learning cycle from the output layer to the input layer. If the weights are put into a weight matrix so as to produce an inference (propagation), the order of the weights in this matrix is not suited to the computations carried out for a back-propagation phase.
More particularly, in neural network computing circuits according to the prior art, the synaptic coefficients (or weights) are stored in an external memory. During the execution of a computing step, buffer memories temporarily receive a certain number of the synaptic coefficients. These buffer memories are then refilled in each computing step with the weights to be used during a computing phase (inference or back-propagation) and in the order specific to the carrying out of this computing phase. These recurrent data exchanges considerably increase the consumption of the circuit. In addition, it is not feasible to double the number of memories (each suited to a computing phase) since this considerably increases the footprint of the circuit. The idea is to use internal memories containing the weights in a certain order while at the same time adapting the computer circuit in accordance with two configurations each suited to carrying out a computing phase (propagation or back-propagation).
The invention proposes a computer architecture that makes it possible to reduce the electricity consumption of a neural network implemented on a chip, and to limit the number of read and write access operations between the computing units of the computer and the external memories. The invention proposes an artificial neural network accelerator computer architecture such that all of the memories containing the synaptic coefficients are implemented on the chip containing the computing units of the layers of neurons of the network. The architecture according to the invention exhibits configuration flexibility implemented via an arrangement of multiplexers for configuring the computer in accordance with two separate configurations. Combining this configuration flexibility and an appropriate distribution of the synaptic coefficients in the internal memories for the weights makes it possible to execute the many computing operations during an inference phase or a learning phase. The architecture proposed by the invention thus minimizes data exchanges between the computing units and the external memories or memories situated a relatively great distance away in the system-on-chip. This leads to an improvement in the energy efficiency of the neural network computer embedded in a mobile system. The accelerator computer architecture according to the invention is compatible with developing memory technologies such as NVM (non-volatile memory) requiring a limited number of write operations. The accelerator computer according to the invention is also compatible for executing operations of updating the weights. The accelerator computer according to the invention is compatible with inference and back-propagation computations (depending on the chosen configuration) for computing convolutional layers and fully connected layers in accordance with the specific distribution of the synaptic coefficients or the convolution kernels in the weight memories.
The invention relates to a computer for computing a layer of an artificial neural network. The neural network is formed of a sequence of layers each consisting of a set of neurons. Each layer is associated with a set of synaptic coefficients forming at least one weight matrix.
The computer is able to be configured in accordance with two separate configurations and comprises:
a transmission line for distributing input data;
a set of computing units of ranks n=0 to N, where N is an integer greater than or equal to 1, for computing an input data sum weighted by synaptic coefficients; a set of weight memories each associated with a computing unit, each weight memory containing a subset of synaptic coefficients required and sufficient for the associated computing unit to carry out the computations necessary for either one of the two configurations;
control means for configuring the computing units of the computer in accordance with either one of the two configurations; in the first configuration, the computing units are configured such that a weighted sum is computed in full by one and the same computing unit; in the second configuration, the computing units are configured such that a weighted sum is computed by a chain of multiple computing units arranged in series.
According to one particular aspect of the invention, the first configuration and the second configuration correspond, respectively, to operation of the computer in either one of the phases from among a data propagation phase and an error back-propagation phase.
According to one particular aspect of the invention, the input data are data propagated in the data propagation phase or errors back-propagated in the error back-propagation phase.
According to one particular aspect of the invention, the number of computing units is lower than the number of neurons in a layer.
According to one particular aspect of the invention, each computing unit comprises:
i. an input register for storing an input datum;
ii. a multiplier circuit for computing the product of an input datum and a synaptic coefficient;
iii. an adder circuit having a first input connected to the output of the multiplier circuit and being configured so as to carry out operations of summing partial computing results of a weighted sum;
iv. at least one accumulator for storing partial or final computing results of the weighted sum.
According to one particular aspect of the invention, the computer furthermore comprises: a data distribution element having N+1 outputs, each output being connected to the register of a computing unit of rank n. The distribution element is commanded by the control means so as to simultaneously distribute an input datum to all of the computing units when the first configuration is activated.
According to one particular aspect of the invention, the computer furthermore comprises a memory stage operating in accordance with a “first in first out” principle so as to propagate a partial result from the last computing unit of rank n=N to the first computing unit of rank n=0, the memory stage being activated by the control means when the second configuration is activated.
According to one particular aspect of the invention, each computing unit comprises at least a number of accumulators equal to the number of neurons per layer divided by the number of computing units rounded up to the nearest integer.
According to one particular aspect of the invention, each set of accumulators comprises a write input able to be selected from among the inputs of each accumulator of the set and a read output able to be selected from among the outputs of each accumulator of the set.
Each computing unit of rank n=1 to N comprises: a multiplexer having a first input connected to the output of the set of accumulators of the computing unit of rank n, a second input connected to the output of the set of accumulators of a computing unit of rank n−1 and an output connected to a second input of the adder circuit of the computing unit of rank n.
The computing unit of rank n=0 comprises: a multiplexer having a first input connected to the output of the set of accumulators of the computing unit of rank n=0, a second input connected to the output of the set of accumulators of the computing unit of rank n=0 and an output connected to a second input of the adder circuit of the computing unit of rank n=0.
The control means are configured so as to select the first input of each multiplexer when the first configuration is chosen and to select the second input of each multiplexer when the second configuration is activated.
According to one particular aspect of the invention, all of the sets of accumulators are interconnected so as to form a memory stage for propagating a partial result from the last computing unit of rank n=N to the first computing unit of rank n=0, the memory stage operating in accordance with a “first in first out” principle when the second configuration is activated.
According to one particular aspect of the invention, the computer comprises a set of error memories, such that each one is associated with a computing unit, for storing a subset of computed errors.
According to one particular aspect of the invention, for each computing unit, the multiplier is connected to the error memory associated with the same computing unit so as to compute the product of an input datum and a stored error signal during a phase of updating the weights.
According to one particular aspect of the invention, the computer comprises a read circuit connected to each weight memory for commanding the reading of the synaptic coefficients.
According to one particular aspect of the invention, in the computer, a computed layer is fully connected to the preceding layer, and the associated synaptic coefficients form a weight matrix of size M×M′, where M and M′ are the respective numbers of neurons in the two layers.
According to one particular aspect of the invention, the distribution element is commanded by the control means so as to distribute an input datum associated with a neuron of rank i to a computing unit of rank n, such that i modulo N+1 is equal to n when the second configuration is activated.
According to one particular aspect of the invention, when the first configuration is activated, all of the multiplication and addition operations for computing the weighted sum associated with the neuron of rank i are carried out exclusively by the computing unit of rank n, such that i modulo N+1 is equal to n.
According to one particular aspect of the invention, when the second configuration is activated, each computing unit of rank n=1 to N carries out the operation of multiplying each input datum associated with the neuron of rank j by a synaptic coefficient, such that j modulo N+1 is equal to n, followed by addition of the output from the computing unit of rank n-1, so as to obtain a partial or total result of a weighted sum.
According to one particular aspect of the invention, the subset of synaptic coefficients stored in the weight memory of rank n corresponds to the synaptic coefficients of all of the rows of rank i of the weight matrix, such that i modulo N+1 is equal to n, when the first configuration is a computing configuration for the data propagation phase and the second configuration is a computing configuration for the error back-propagation phase.
According to one particular aspect of the invention, the subset of synaptic coefficients stored in the weight memory of rank n corresponds to the synaptic coefficients of all of the columns of rank j of the weight matrix, such that j modulo N+1 is equal to n, when the first configuration is a computing configuration for the error back-propagation phase and the second configuration is a computing configuration for the data propagation phase.
According to one particular aspect of the invention, the neural network comprises at least one convolutional layer of neurons, the layer having a plurality of output matrices of rank q=0 to Q, where Q is a positive integer, each output matrix being obtained from at least one input matrix of rank p=0 to P, where P is a positive integer, for each input matrix of rank p and output matrix of rank q pair, the associated synaptic coefficients forming a weight matrix.
According to one particular aspect of the invention, when the first configuration is activated, all of the multiplication and addition operations for computing an output matrix of rank q are carried out exclusively by the computing unit of rank n, such that q modulo N+1 is equal to n.
According to one particular aspect of the invention, when the second configuration is activated, each computing unit of rank n=1 to N carries out the operations of computing the partial results obtained from each input matrix of rank p, such that p modulo N+1 is equal to n, followed by addition of the partial result from the computing unit of rank n-1.
According to one particular aspect of the invention, the subset of synaptic coefficients stored in the weight memory of rank n corresponds to the synaptic coefficients belonging to all of the weight matrices associated with the output matrix of rank q, such that q modulo N+1 is equal to n, when the first configuration is a computing configuration for the data propagation phase and the second configuration is a computing configuration for the error back-propagation phase.
According to one particular aspect of the invention, the subset of synaptic coefficients stored in the weight memory of rank n corresponds to the synaptic coefficients belonging to all of the weight matrices associated with the input matrix of rank p, such that p modulo N+1 is equal to n, when the first configuration is a computing configuration for the error back-propagation phase and the second configuration is a computing configuration for the data propagation phase.
Other features and advantages of the present invention will become more clearly apparent upon reading the following description with reference to the following appended drawings.
By way of indication, we will begin by describing one example of the overall structure of a convolutional neural network containing convolutional layers and fully connected layers.
Each layer consists of a set of neurons, which are connected to one or more preceding layers. Each neuron of a layer may be connected to one or more neurons of one or more preceding layers. The last layer of the network is called the “output layer”. The neurons are connected to one another by synapses associated with synaptic weights, which weight the efficiency of the connection between the neurons, form the adjustable parameters of a network and which store the information contained in the network. The synaptic weights may be positive or negative.
The input data of the neural network correspond to the input data of the first layer of the network. Running through the sequence of layers of neurons, the output data computed by an intermediate layer correspond to the input data of the following layer. The output data from the last layer of neurons correspond to the output data from the neural network.
The neural networks referred to as “convolutional” networks (or even “deep convolutional” networks or “convnets”) furthermore consist of layers of particular types, such as convolutional layers, pooling layers and fully connected layers. By definition, a convolutional neural network comprises at least one convolutional layer or “pooling” layer.
The architecture of the accelerator computer circuit according to the invention is compatible for executing computations of convolutional layers or fully connected layers. We will first of all start by describing the appropriate embodiment with the computation of a fully connected layer.
The layer of neurons Ck of rank k comprises M+1 neurons of rank j=0 to M, where M is a positive integer greater than or equal to 1. The neuron Njk of rank j belonging to the layer of rank k produces a value denoted Xjk at output.
The layer of neurons Ck+1 of rank k+1 comprises M′+1 neurons of rank i=0 to M′, where M′ is a positive integer greater than or equal to 1. The neuron Nik+1 of rank i belonging to the layer of rank k+1 produces a value denoted Xik+1 at output. In the example of
Since the layer Ck+1 is fully connected, each neuron Nik+1 belonging to this layer is connected to each of the neurons Nik by an artificial synapse. The synaptic coefficient that connects the neuron Nik+1 of rank i of the layer Ck+1 to the neuron Njk of rank j of the layer Ck is the scalar wijk+1. The set of synaptic coefficients linking the layer Ck+1 to the layer Ck thus form a weight matrix of size (M′+1)×(M+1), denoted [MP]k+1. In
Let [Li]k+1 be the row vector of index i of the weight matrix [MP]k+1. [Li]k+1 consists of the following synaptic coefficients:
[Li]k+1=(wi0k+1,wi1k+1,wi2k+1,wi3k+1 . . . ,wi(M-2)k+1,wi(M-1)k+1,wiMk+1).
The set of synaptic coefficients that form the row vector [Li]k+1 of the weight matrix [MP]k+1 correspond to all of the synapses connected to the neuron Nik+1 of rank i of the layer Ck+1, as shown in
Following the propagation direction “PROP” indicated in
Developing the formula of the weighted sum used in the computation of Xi(k+1) during propagation of the data from the layer Ck to the layer Ck+1 gives the following sum:
X
i
(k+1)
=S(X0k·wi0k+1+X1k·wi1k+1+X2k·wi2k+1+ . . . +X(M-1)k·wi(M-1)k+1+XMk·wiMk+1+bi)
This then demonstrates that the subset denoted Fi of the synaptic coefficients used to compute the weighted sum Σj(Xjk·wijk+1) in order to obtain the output datum Xi(k+1) from the neuron Nik+1 is [Li]k+1 the row vector of index i of the weight matrix [MP]k+1.
In preparation for the description of
A first propagation step for learning consists in processing a set of input images in exactly the same way as in inference mode (but in floating point mode). Unlike inference, it is necessary to store all of the values of Xi(k) (therefore of all of the layers) for all of the images.
When the last output layer is computed, the second step of computing a cost function is triggered. The result of the preceding step in the last layer of the network is compared, by way of a cost function, with labelled references. The derivative of the cost function is computed so as to obtain an error δik for each neuron NiK of the final output layer CK. The computing operations in this step (cost function+differentiation) are carried out by an embedded microcontroller different from the computer that is the subject of the invention.
The following step consists in back-propagating the errors computed in the preceding step through the layers of the neural network starting from the output layer of rank K. More detail about this back-propagation phase will be given in the description of
The final step corresponds to updating the synaptic coefficients wijk of the entire neural network based on the results of the preceding computations for each neuron of each layer.
The direction of the back-propagation is illustrated in
Starting from the back-propagation direction “RETRO_PROP”, in a learning phase, the error δjk associated with the neuron Njk of the layer Ck is computed using the following formula: δjk=Σi(δik+1·wijk+1)·∂S(x)/∂x, where ∂S(x)/∂x is the derivative of the activation function, which is equal to 0 or 1 if using a ReLu function. More generally, the multiplication by the derivative of the activation function is carried out by a dedicated operator circuit different from the accelerator computer that is the subject of the invention, the main role of which is that of computing the weighted sum Σi(δik+1·wijk+1).
Developing the formula of the weighted sum used in the computation of δjk during back-propagation of the errors from the layer Ck+1 to the layer Ck gives the following sum:
δjk=δ0k+1·w0jk+1+δ1k+1·w1jk+1+δ2k+1·w2jk+1+ . . . +δM−1k+1·w(M-1)jk+1+δMk+1·wMjk+1
This then demonstrates that the subset of the synaptic coefficients used to compute the weighted sum Σi(δik+1·wijk+1) of the neuron Njk corresponds to [Cj]k+1 the column vector of the weight matrix [MP]k+1 of index j of the weight matrix [MP]k+1, where [Cj]k+1=(w0jk+1, w1jk+1, w2jk+1, w3jk+1 . . . , w(M-2)jk+1, w(M-1)jk+1, wMjk+1).
In
One objective of the neural layer computer CALC according to the invention consists in using the same memories to store the synaptic coefficients in accordance with a distribution appropriately chosen to execute both the data propagation phase and the error back-propagation phase. The computer is able to be configured in accordance with two separate configurations, respectively denoted CONF1 and CONF2, implemented via a specific arrangement of multiplexers that is described below. The computer thus makes it possible to compute weighted sums during a data propagation phase or an error back-propagation phase depending on the chosen configuration.
The computer CALC according to the invention comprises a transmission line denoted L_data for distributing input data Xjk or error data δik+1 in accordance with the execution of a propagation phase or back-propagation phase; a set of computing units denoted PEn of ranks n=0 to N, where N is a positive integer greater than or equal to 1, for computing a sum of input data weighted by synaptic coefficients; a set of weight memories denoted MEM_POIDSn, such that each weight memory is connected to a computing unit; control means for configuring the operation and the internal or external connections of the computing units in accordance with the first configuration CONF1 or the second configuration CONF2.
The computer CALC furthermore comprises a read stage denoted LECT connected to each weight memory MEM_POIDSn for commanding the reading of the synaptic coefficients wi,jk during the execution of the operations of computing the weighted sums.
The computer CALC furthermore comprises a set of error memories denoted MEM_errn of ranks n=0 to N, where N+1 is the number of computing units PEn in the computer CALC. Each error memory is associated with a computing unit for storing a subset of computed errors δjk that are used during the phase of updating the weights.
To understand the operation of the accelerator computer CALC according to the invention for each computing phase, specifically the propagation or the back-propagation,
Each computing unit PEn of rank n=0 to 3 comprises an input register denoted Reg_inn for storing an input datum used in the computing of the weighted sum, be this a propagated datum Xi(k) or a back-propagated error δik+1 depending on the executed phase; a multiplier circuit denoted MULTn having two inputs and one output, an adder circuit denoted ADDn having a first input connected to the output of the multiplier circuit MULTn and being configured so as to carry out operations of summing partial computing results of a weighted sum; at least one accumulator denoted ACCin for storing partial or final computing results of the weighted sum computed by the computing unit PEn of rank n or another computing unit of a different rank, depending on the selected configuration.
The input data from the transmission line L_data are distributed to the various computing units PEn by controlling the activation of the loading of the input registers Reg_inn. Activation of the loading of an input register Reg_inn is commanded by the control means of the system. If the loading of a register Reg_inn is not activated, the register keeps the stored datum from the preceding computing cycle. If the loading of a register Reg_inn is activated, it stores the datum transmitted by the transmission line L_data during the current computing cycle.
As an alternative, the computer CALC furthermore comprises a distribution element denoted D1 commanded by the control means so as to organize the distribution of the input data from the transmission line L_data to the computing units PEn in accordance with the chosen computing configuration.
In the described embodiment, when the number of neurons per layer is greater than the number of computing units PEn in the computer CALC, each computing unit PEn comprises a plurality of accumulators ACCin. The set of accumulators belonging to the same computing unit comprises a write input denoted E1n able to be selected from among the inputs of each accumulator of the set and a read output denoted S1n able to be selected from among the outputs of each accumulator of the set. It is possible to implement this write input and read output selection functionality for a stack of accumulator registers through commands to activate the loading of the registers in write mode and multiplexers for the outputs, not shown in
Each computing unit PEn of rank n=0 to 3 furthermore comprises a multiplexer MUXn having two inputs denoted I1 and I2 and one output connected to the second input of the adder ADDn belonging to the computing unit PEn.
For the computing units PEn of rank n=1 to 3, the first input I1 of a multiplexer MUXn is connected to the output S1n of the set of accumulators {ACC0n ACC1n ACC2n . . . } belonging to the computing unit of rank n, and the second input I2 is connected to the output S1n-1 of the set of accumulators {ACC0n-1 ACC1n-1 ACC2n-1 . . . } of the computing unit of rank n−1. The output of the multiplexer MUXn is connected to the second input of the adder circuit ADDn belonging to the same computing unit PEn of rank n.
For the initial computing unit PE0 of rank 0, the two inputs of the multiplexer MUX0 are connected to the output S10 of the set of accumulators {ACC00 ACC10 ACC20} of the initial computing unit of rank 0. It is possible to dispense with this multiplexer, but it has been retained in this embodiment so as to obtain symmetrical computing units.
Each computing unit PEn of rank n=0 to 3 furthermore comprises a second multiplexer MUX′n having two inputs and one output connected to the second input of the multiplier circuit MULTn belonging to the same computing unit PEn. The first input of the multiplexer MUX′n is connected to the error memory MEM_errn of rank n and the second input is connected to the weight memory MEM_POIDSn of rank n. The multiplexer MUX′n thus makes it possible to select whether the multiplier MULTn computes the product of the input datum stored in the register Reg_inn and a synaptic coefficient wijk from the weight memory MEM_POIDSn (during a propagation or back-propagation) or an error value δjk stored in the error memory MEM_errn (during the updating of the weights).
As demonstrated above, the subset of the synaptic coefficients necessary and sufficient to compute the weighted sum (Σj(Xjk·wijk+1) in order to obtain the output datum Xi(k+1) from the neuron Nik+1 during a propagation phase corresponds to [Li]k+1 the row vector of index i of the weight matrix [MP]k+1.
In order to solve the problem linked to minimizing the energy consumption of the neural network, the synaptic coefficients should be expediently distributed among the set of weight memories MEM_POIDSn so as to comply with the following criteria: the possibility of integrating the weight memories into the same chip of the computer; minimizing the number of write operations to the weight memories and minimizing the distances covered by the data during an exchange between a computing unit and a weight memory.
During a data propagation phase, the computing unit of rank n PEn carries out all of the multiplication and addition operations so as to compute the weighted sum Σj(Xjk·wijk+1) in order to obtain the output datum Xi(k+1) from the neuron Nik+1; the weight memory MEM_POIDSn of rank n associated with the computing unit PEn of rank n should contain the synaptic coefficients that form the row vector [Li]k+1 of the matrix [MP]k+1.
If the layer of neurons contains a number of neurons greater than the number of computing units, the computations are organized as follows: The computing unit of rank n PEn carries out all of the multiplication and addition operations so as to compute the weighted sum of each of the neurons of rank i Nik+1, such that i modulo (N+1) is equal to n.
By way of example, if the layer Ck+1 contains sixteen neurons and the computer CALC comprises N=4 computing units {PE0, PE1, PE2, PE3}:
The computing unit PE0 computes the output data Xi(k+1) from the neurons N0k+1, N4k+1, N8k+1, N12k+1.
In parallel, the computing unit PE1 computes the output data Xi(k+1) from the neurons N1k+1, N5k+1, N9k+1, N13k+1.
In parallel, the computing unit PE2 computes the output data Xi(k+1) from the neurons N2k+1, N6k+1, N10k+1, N14k+1.
In parallel, the computing unit PE1 computes the output data Xi(k+1) from the neurons N1k+1, N7k+1, N11k+1, N15k+1.
To achieve the computing parallelism described above during a propagation phase (computing performance criterion), while at the same time complying with the abovementioned criteria linked to the memories (consumption criterion and implementation criterion), the synaptic coefficients wijk+1 are distributed among the weight memories such that each weight memory of rank n MEM_POIDSn contains exclusively the row vectors [Li]k+1 of the matrices [MP]k+1 for all of the fully connected layers, such that i modulo (N+1)=n.
We will keep this distribution to explain the sequence of the computations executed by the computer according to the invention with the following figures:
In a data propagation phase, each multiplexer MUX′n of rank n is configured, by the control means, so as to select the input connected to the associated weight memory.
When the first configuration CONF1 is chosen, the control means configure each multiplexer MUXn belonging to the computing unit PEn so as to select the input 11 connected to the set of accumulators {ACC0n ACC1n ACC2n . . . } of the same computing unit. The computing units PEn are thus disconnected from one another when the configuration CONF1 is chosen.
It will be recalled that each weight memory of rank n contains the subset of synaptic coefficients corresponding to the row vector [Li]k+1 of rank i of the matrix [MP]k+1 associated with the layer of neurons Ck+1, such that i modulo (N+1)=n.
When the computer is configured in accordance with the first configuration CONF1, the control means command the loading of the registers Reg_inn (or the distribution element D1 in an alternative embodiment) so as to simultaneously supply the same input datum Xik from the preceding layer Ck to all of the computing units PEn.
At a time t1, the computing unit PE0 computes the product w00k+1·X0k corresponding to the first term of the weighted sum Σj(Xjk·w0jk+1) corresponding to the output datum from the neuron N0k+1; the computing unit PE1 computes the product w10k+1·X0k corresponding to the first term of the weighted sum Σj(Xjk·w1jk+1) corresponding to the output datum from the neuron N1k+1; the computing unit PE2 computes the product w20k+1·X0k corresponding to the first term of the weighted sum Σj(Xjk·w2jk+1) corresponding to the output datum from the neuron N2k+1; the computing unit PE3 computes the product w30k+1·X0k corresponding to the first term of the weighted sum Σj(Xjk·w3jk+1) corresponding to the output datum from the neuron N3k+1. Each computing unit PEn stores the obtained first term of the weighted sum in an accumulator ACC0n of the set of accumulators associated with the same computing unit. At t2, the computing unit PE0 computes the product w01k+1·X1k corresponding to the second term of the weighted sum Σj(Xjk·w0jk+1) corresponding to the output datum from the neuron N0k+1 and the adder ADD0 sums the first term w00k+1·X0k stored in the accumulator ACC00 and the second term w01k+1·X1k via the loopback internal to the computing unit in accordance with the configuration CONF1; the computing unit PE1 computes the product w11k+1·X1k corresponding to the second term of the weighted sum Σj(Xjk·w1jk+1) corresponding to the output datum from the neuron N1k+1 and the adder ADD1 sums the first term w10k+1·X0k stored in the accumulator ACC01 and the second term w11k+1·X1k via the loopback internal to the computing unit in accordance with the configuration CONF1. The same computing process is executed by the computing units PE2 and PE3 to compute and store the partial results of the neurons N2k+1 and N3k+1.
If the weighted sum contains M terms (computed from M neurons of the layer Ck), the operation described above is reiterated M times until obtaining final results Xik+1 of the four first neurons of the output layer Ck+1, specifically {N0k+1, N1k+1, N2k+1, N3k+1}. In the cycle tM+1, the computing unit PE0 begins a new series of iterations in order to compute the terms of the weighted sum X4k+1 of the neuron N4k+1; the computing unit PE1 begins a new series of iterations in order to compute the terms of the weighted sum X5k+1 of the neuron N5k+1, the computing unit PE2 begins a new series of iterations in order to compute the terms of the weighted sum X6k+1 of the neuron N6k+1, and the computing unit PE3 begins a new series of iterations in order to compute the terms of the weighted sum X7k+1 of the neuron N7k+1. Thus, after M cycles, the computer CALC has computed the neurons {N4k+1, N5k+1, N6k+1, N7k+1}.
The operation is reiterated until obtaining all of the Xik+1 from the output layer Ck+1. This computing method carried out by the computer does not require any write operation to the weight memories MEM_POIDSn since the distribution of the synaptic coefficients wi,jk+1 allows each computing unit to carry out all of the multiplication operations necessary and sufficient to compute the subset of the output neurons associated therewith.
Below, we will present a second computing method compatible with the computer CALC and for minimizing the number of write operations to the input registers Reg_inn.
As an alternative, another method for computing the fully connected layer Ck+1 may be executed by the computer CALC while at the same time avoiding loading an input datum Xik to the input registers Reg_inn multiple times.
To carry out the alternative computing method, the computer CALC operates as follows: At t1, the same computations are carried out by each computing unit PEn so as to obtain the first terms of the weighted sum of each of the neurons {N0k+1, N1k+1, N2k+1, N3k+1} that are stored in one of the associated accumulators. At t2, in contrast to the preceding computing method, the computing unit PE0 of rank n=0 does not compute the second term of the weighted sum of the output neurons N0k+1, but computes the first term of the weighted sum of the output neuron N4k+1 and stores the result in another accumulator ACC10 of the same computing unit. Next, at t3, the computing unit PE0 computes the first term of the output neuron N8k+1 and records the result in the following accumulator ACC20. The operation is reiterated until the computing unit PE0 obtains all of the first terms of each weighted sum of all of the output neurons Nik+1, such that i modulo (N+1)=0.
In parallel, each computing unit PEn of rank n computes and records the first partial results of all of the output neurons Nik+1, such that i modulo (N+1)=n.
Once the first partial results of each output neuron have been computed and recorded in the corresponding accumulator, the following input datum X1k is propagated to all of the input registers Reg_inn in order to compute and add the second term of each weighted sum in accordance with the same computing principle.
The same operation is repeated until having computed and added all of the partial results of all of the weighted sums of each output neuron.
This makes it possible to avoid writing the same input datum Xik to the input registers Reg_inn multiple times.
It will be recalled that, if the number of output neurons Nik+1 is greater than the number of computing units, it is necessary to have a plurality of accumulators in each computing unit. The minimum number of accumulators in a computing unit is equal to the number of output neurons Nik+1 denoted M+1 divided by the number of computing units N+1, and more precisely rounded up to the nearest integer of the division result.
The computer CALC associated with the operation described above, configured in accordance with the first configuration CONF1 and with an appropriately determined distribution of the synaptic coefficients wijk+1 between the weight memories MEM_POIDSn, executes all of the operations of computing a fully connected layer of neurons during propagation of the data or inference.
In an error back-propagation phase, each multiplexer MUX′n of rank n is configured, by the control means, so as to select the input connected to the associated weight memory.
When the second configuration CONF2 is chosen, the control means configure each multiplexer MUXn belonging to the computing unit PEn, where n=1 to N, so as to select the second input I2 connected to the output S1n-1 of the set of accumulators {ACC0n-1 ACC1n-1 ACC2n-1 . . . } of the preceding computing unit PEn-1 of rank n-1. The adder ADDn of each computing unit PEn (except for the initial computing unit) thus receives the partial computing results from the preceding computing unit so as to add it to the output from the multiplier circuit MULTn. With regard to the initial computing unit PE0 the adder ADD0 is still connected to the set of accumulators {ACC00 ACC10 ACC20 . . . } of the same computing unit.
It will be recalled firstly that each weight memory MEM_POIDSn of rank n comprises each row vector [Li]k+1=(wi0k+1, wi1k+1, wi2k+1, wi3k+1 . . . , wi(M-2)k+1, wi(M-1)k+1, wiMk+1) of the matrix [MP]k+1 such that i modulo (N+1)=n.
Secondly, the subset of the synaptic coefficients used to compute the weighted sum Σi(δik+1·wijk+1) in order to obtain the output error δjk of the neuron Njk corresponds to [Cj]k+1 the column vector of the weight matrix [MP]k+1 of index j of the weight matrix [MP]k+1, where [Cj]k+1=(w0jk+1, w1jk+1, w2jk+1, w3jk+1 . . . , w(M-2)jk+1, w(M-1)jk+1, wMjk+1).
A computing unit PEn of rank n thus cannot carry out all of the multiplication operations for computing the weighted sum Σi(δik+1·wijk+1) on its own. In this case, the execution of the output neuron Nik computing operations during a back-propagation phase should be shared by all of the computing units, hence the establishment of a series connection between the computing units in order to be able to transfer the partial results through the chain consisting of the computing units PEn.
When the second configuration CONF2 is selected, the various sets of accumulators ACCij form a matrix of interconnected registers for operating in accordance with a “first in first out” (FIFO) principle. Without a loss of generality, this type of implementation is one example for propagating the flow of partial results between the last computing unit and the first computing unit of the chain. A simplified example for explaining the operating principle of the “FIFO” memory in the computer according to the invention will be described below.
In one alternative embodiment, it is possible to implement the operation in accordance with the “first in first out” (FIFO) principle using a FIFO memory stage whose input is connected to the accumulator ACC0N of the last computing unit PEN and whose output is connected to the input I2 of the multiplexer MUX0 of the initial computing unit PE0. In this embodiment, each computing unit PEn of rank n comprises only one accumulator ACC0n comprising the partial results of the computing of the weighted sum carried out by the same computing unit PEn.
In the 1st computing cycle t1, the computing unit PE0 multiplies the first error datum δ0(k+1) by the weight w00(k+1) and transmits the result to the following computing unit PE1 which, in the second computing cycle t2, adds to it the product of the second datum δ1(k+1) and the weight w10 and transmits the result to the computing unit PE2, and so on, in order to compute the output until obtaining the partial sum consisting of the four first terms of the weighted sum of the output δ0(k) equal to:
δ0(k+1)·w00k+1+δ1(k+1)·w10k+1+δ2(k+1)·w20k+1+δ0(k+1)·w20k+1
During this same second cycle t2, the computing unit PE0 multiplies the 1st datum δ0(k+1) still stored in its input register REG_in0 by the weight w01(k+1) and transmits the result to the following computing unit PE1 so as to add δ0(k+1)·w01(k+1) to δ1(k+1)·w11(k+1) at t3 in order to compute the output δ1(k). The same principle is repeated along the chain of computing units, as illustrated in
At the end of the fourth cycle t4, the last computing unit of the chain PE3 therefore obtains a partial result of δ0(k) on the four first data. The partial result enters the FIFO structure formed by the accumulators of all of the computing units.
The depth of the memory stage operating in FIFO mode should be dimensioned so as to achieve the following operation. By way of example, the first partial result δ0(k): δ0(k+1)·w00k+1+δ1(k+1)·w10k+1+δ2(k+1)·w20k+1+δ3(k+1)·w30k+1 should be present in an accumulator of the set of accumulators of the initial computing unit in the corresponding cycle upon the resumption of the computation δ0(k) by the initial computing unit PE0.
This then depends on the sequence of the computing operations carried out by the computer CALC during the back-propagation phase. Without a loss of generality, we will describe one possible operation of the set of accumulators for avoiding having to carry out multiple successive read operations on input data in the input registers Reg_inn.
In the computing cycle t4, the initial computing unit PE0 computes the first term of the weighted sum of the error δ4(k). After M computing cycles, the initial computing unit PE0 resumes computing the error δ0(k) after having computed the partial result consisting of the four first terms for all of the output neurons Nik. In this case, the depth of the memory stage operating in FIFO mode should be equal to the number of neurons of the layer Ck. Each computing unit thus comprises a set of accumulators consisting of S accumulators, such that S is equal to the number of neurons of the output layer Ck divided by the number of computing units PEn rounded up to the nearest integer.
To explain the routing of the computed partial results through the set of accumulators in accordance with the “first in first out” principle,
The number of neurons in the input layer Ck+1 is 8.
The number of neurons in the output layer Ck is 8.
The computer CALC contains four computing units PEn, where n is from 0 to 3.
Each computing unit PEn of rank n contains two accumulators ACC0n and ACC1n.
Let RPj(δi(k)) be the partial result consisting of the j first terms of the weighted sum corresponding to the output result δi(k).
The sequence of the computations during the four first cycles t1 to t4 has been described above. At t4, the accumulator ACC03 of the last computing unit PE3 contains the partial result of δ0(k) containing the four first terms denoted RP4(δ0(k)); the accumulator ACC02 of the computing unit PE2 contains the partial result of δ1(k) consisting of the three first terms denoted RP3(δ1(k)), the accumulator ACC01 of the computing unit PE1 contains the partial result of δ2(k) consisting of the two first terms denoted RP2(δ2(k)), and the accumulator ACC00 of the computing unit PE0 contains the partial result of δ3(k) consisting of the first term denoted RP1(δ3(k)) The rest of the accumulators {ACC10 ACC11 ACC12 ACC13} used to implement the FIFO function are empty in this computing step.
At t5, the partial result RP4(δ0(k)) is transferred to the second accumulator of the computing unit PE3, denoted ACC13. The partial result RP4(δ0(k)) thus enters the row of accumulators {ACC10 ACC11 ACC12 ACC13} that form the FIFO. At the same time, the initial computing unit PE0 computes the first product of the error δ4(k) so as to store, in ACC00, the partial result of δ4(k) consisting of the first term, denoted RP1(δ4(k)); the computing unit PE1 computes the second product of the error δ3(k) so as to store, in ACC01, the partial result of δ3(k) consisting of the two first terms, denoted RP2(δ3(k)). In the same way, ACC02 contains the partial result RP3(δ2(k)) and ACC03 contains the partial result RP4(δ2(k)).
At t6, the partial result RP4(δ0(k)) is transferred to the second accumulator ACC12 of the preceding computing unit. The partial result RP4(δ1(k)) is transferred to the accumulator ACC13 and thus enters the group of accumulators that forms the FIFO. The computations through the computing unit chain continue in the same way as described above.
Thus, in each computing cycle, each partial result computed by the last computing unit enters the chain of accumulators {ACC10 ACC11 ACC12 ACC13} that form the FIFO, and the initial computing unit initiates the computations of the first term of a new output result δi(k).
The partial result RP4(δ0(k)) runs through the FIFO chain, being transferred to one of the accumulators of the preceding computing unit in each computing cycle.
At t8, the partial result RP4(δ0(k)) is stored in the last accumulator of the FIFO chain corresponding to ACC10, while the initial computing unit PE0 computes the first term of the partial result RP1(δ7(k)) stored in the accumulator ACC00 and corresponding to the last neuron of the computed layer.
At t9, the initial computing unit PE0 resumes computing the error δ0(k). The computing unit PE0 adds RP4(δ0(k)), stored beforehand in the accumulator ACC10, to the multiplication result at the output of MULT and stores the obtained partial result RP5(δ0(k)) in ACC00. A second cycle of multiplication and summing operations through the computing unit chain PEn is started.
The same principle applies to the other partial results of the other errors δi(k), thereby creating operation in which the partial results run in succession in a defined order through the FIFO memory stage from the last computing unit PE3 to the initial computing unit PE0.
This mode of operation may be generalized with a chain of FIFO accumulators comprising multiple rows of accumulators if the ratio between the number of neurons in the computed layer and the number of computing units is greater than 2.
Thus, when the second configuration CONF2 is chosen, each computing unit PEn comprises a set of accumulators ACC such that at least one accumulator is intended to store the partial results from the same computing unit PEn, and the rest of the accumulators are intended to form the FIFO chain with the adjacent accumulators belonging to the same computing unit or to an adjacent computing unit.
The accumulators used to form the FIFO chain serve to transmit a partial result computed by the last computing unit PE3 to the first computing unit PE0 in order to continue computing the weighted sum when the number of neurons is greater than the number of computing units.
The FIFO chain consisting of a plurality of accumulators may be implemented by connecting the accumulators to a 3-state bus, these states connecting the outputs of the associated sets of accumulators to various computing units.
As an alternative, the FIFO chain may also be implemented by converting the accumulator registers to shift registers.
In conclusion, the computer CALC according to the invention makes it possible to compute a fully connected layer of neurons in a propagation phase when the first configuration CONF1 is chosen. The computer additionally computes a fully connected layer of neurons in a back-propagation phase when the second configuration CONF2 is chosen. This mode of operation is compatible with the following distribution of the synaptic coefficients: the subset of synaptic coefficients stored in the weight memory MEM_POIDSn of rank n corresponds to the synaptic coefficients wi,jk of all of the rows [Li] of rank i of the weight matrix [MP]k, such that i modulo (N+1) is equal to n.
As an alternative, by symmetry, the computer CALC may furthermore compute a fully connected layer of neurons in a propagation phase when the second configuration CONF2 is chosen. The computer additionally computes a fully connected layer of neurons in a back-propagation phase when the first configuration CONF1 is chosen. This mode of operation is compatible with the following distribution of the synaptic coefficients: the subset of synaptic coefficients stored in the weight memory MEM_POIDSn of rank n corresponds to the synaptic coefficients wi,jk of all of the columns [Ci] of rank i of the weight matrix [MP]k, such that i modulo (N+1) is equal to n.
To carry out a learning phase for a neural network, the synaptic coefficients are updated based on the data propagated during a propagation phase and the errors computed for each layer of neurons following back-propagation of errors for a set of image samples used for learning.
The multiplexers MUXn are configured in accordance with the first configuration CONF1, and what changes is the selection of the input of the multiplier circuits MULTn. Specifically, the phase of updating the weights comprises the following computation: ΔWij(k)=1/Nbatch*ΣNbatchXi(k)·δj(k), where Nbatch is the number of image samples used for the learning and ΔWij(k) are the weight increments used for the updating.
During the computing of the errors δj(k) of a layer of neurons Ck, the output results δj(k) are stored as they are generated in the error memories MEM_errn belonging to the various computing units PEn. The errors are distributed among the various memories as follows: the error δj(k) of rank j is stored in the error memory MEM_errn of rank n, such that j modulo (N+1) is equal to n.
The multiplexers MUX′n are then configured by the control means so as to select the errors δj(k) recorded beforehand in the error memories MEM_errn during the back-propagation phase, as the error results are obtained. The stored errors δj(k) are multiplied by the distributed data Xi(k) in a sequence of computing operations chosen by the designer.
The computing architecture proposed by the invention thus makes it possible to carry out all of the computing phases executed by a neural network with one and the same partially reconfigurable architecture.
In the following section, we will explain the application of the accelerator computer CALC for computing a convolutional layer. The operating principle in accordance with the two configurations CONF1 and CONF2 of the computer remains unchanged. However, the distribution of the weights among the various weight memories MEM_POIDSn should be adapted so as to carry out the computations that are performed for a convolutional layer.
A value Oi,j of the output matrix [O] (corresponding to the output value of an output neuron) is obtained by applying the filter [W] to the corresponding sub-matrix of the input matrix [I].
Generally speaking, the output matrix [O] is connected to the input matrix [I] by a convolution operation, via a convolution kernel or filter denoted [W]. Each neuron of the output matrix [O] is connected to a portion of the input matrix [I], this portion being called “input sub-matrix” or else “receptive field of the neuron” and having the same dimensions as the filter [W]. The filter [W] is shared by all of the neurons of an output matrix [O].
The values of the output neurons Oi,j put into the output matrix [O] are given by the following relationship:
In the above formula, g( ) denotes the activation function of the neuron, while si and sj respectively denote the vertical and horizontal stride parameters. Such a stride corresponds to the offset between each application of the convolution kernel on the input matrix. For example, if the stride is greater than or equal to the size of the kernel, then there is no overlap between each application of the kernel. It will be recalled that this formula is applicable if the input matrix has been processed so as to add additional rows and columns (padding). The filter matrix [W] is formed by the synaptic coefficients wt,l of ranks t=0 to Kx−1 and I=0 to Ky−1.
More generally, each convolutional layer of neurons, denoted Ck, may receive a plurality of input matrices on a plurality of input channels of rank p=0 to P, where P is a positive integer, and/or compute multiple output matrices on a plurality of output channels of rank q=0 to Q, where Q is a positive integer. [W]p,q,k+1 denotes the filter corresponding to the convolution kernel that connects the output matrix [O]q of the layer of neurons Ck+1 to an input matrix [I]p in the layer of neurons Ck. Various filters may be associated with various input matrices for the same output matrix.
For simplicity, the activation function go is not shown in
Moreover, when an output matrix is connected to multiple input matrices, the convolutional layer, in addition to each convolution operation described above, sums the output values of the neurons obtained for each input matrix. In other words, the output value of an output neuron (or also called output channels) is in this case equal to the sum of the output values obtained for each convolution operation applied to each input matrix (or also called input channels).
The values of the output neurons Oi,j of the output matrix [O]q are given in this case by the following relationship:
Where p=0 to P is the rank of an input matrix [I]p connected to the output matrix [O]q of the layer Ck of rank q=0 to Q via the filter [W]p,q,k formed of the synaptic coefficients wp,q,t,l of ranks t=0 to Kx−1 and I=0 to Ky−1.
Thus, to compute the output result of an output matrix [O]q of rank q of the layer Ck, it is necessary to have the set of synaptic coefficients of the weight matrices [W]p,q connecting all of the input matrices [I]p to the output matrix [O]q of rank q.
The computer CALC is thus able to compute a convolutional layer with the same mechanisms and configurations as described for the example of the fully connected layer if the synaptic coefficients are expediently distributed among the weight memories MEM_POIDSn.
When the subset of synaptic coefficients stored in the weight memory MEM_POIDSn of rank n corresponds to the synaptic coefficients belonging to all of the weight matrices Wp,q associated with the output matrix of rank q, such that q modulo (N+1) is equal to n, the computing unit PEn carries out all of the multiplication and addition operations for computing the output matrix Oq of rank q of the layer Ck during propagation of the data or inference. The computer is configured in this case in accordance with the first configuration CONF1 described above.
When the computer is configured in accordance with the second configuration, distributing the synaptic coefficients in accordance with the rank of the associated output channel allows the computer CALC to perform the computations of a back-propagation phase.
Reciprocally, when the subset of synaptic coefficients stored in the weight memory MEM_POIDSn of rank n corresponds to the synaptic coefficients belonging to all of the weight matrices Wp,q,k associated with the input matrix of rank p (or input channel), such that p modulo (N+1) is equal to n, the computer carries out propagation with the second configuration CONF2 and back-propagation with the first configuration CONF1.
The principle of executing the computations remains the same as that described for a fully connected layer.
The computer CALC according to the embodiments of the invention may be used in many fields of application, notably in applications in which a classification of data is used. The fields of application of the computer CALC according to the embodiments of the invention comprise, for example, video-surveillance applications with real-time recognition of people, interactive classification applications implemented in smartphones for interactive classification applications, data fusion applications in home surveillance systems, etc.
The computer CALC according to the invention may be implemented using hardware and/or software components. The software elements may be present in the form of a computer program product on a computer-readable medium, which medium may be electronic, magnetic, optical or electromagnetic. The hardware elements may be present, in full or in part, notably in the form of dedicated integrated circuits (ASICs) and/or configurable integrated circuits (FPGAs) and/or in the form of neural circuits according to the invention or in the form of a digital signal processor DSP and/or in the form of a graphics processor GPU, and/or in the form of a microcontroller and/or in the form of a general-purpose processor, for example. The computer CALC also comprises one or more memories, which may be registers, shift registers, a RAM memory, a ROM memory or any other type of memory suitable for implementing the invention.
Number | Date | Country | Kind |
---|---|---|---|
2008236 | Aug 2020 | FR | national |