BAYESIAN NEURAL NETWORK WITH RESISTIVE MEMORY HARDWARE ACCELERATOR AND METHOD FOR PROGRAMMING THE SAME

Description

FIELD OF THE INVENTION

The present invention concerns the field of machine learning (ML) and more specifically Bayesian neural networks (BNNs). It also relates to the field of resistive Random Access Memories (RRAMs).

Background of the Invention

With the development and applications of deep learning algorithms, artificial neural networks (ANNs), hereinafter simply referred to as neural networks, have achieved tremendous success in various fields such as image classification, object recognition, natural language processing, autonomous driving, cancer detection, etc.

However conventional neural networks are prone to overfitting, that is to situations where the network model fail to generalize well from the training data to new data. The underlying reason for this shortcoming is that the ANN model parameters, i.e. the synaptic coefficients (or weights), do not capture the uncertainty of modelling, namely the uncertainty about which values of these parameters will be good at predicting or classifying new data. This particularly true when using scarce or noisy training data. Conversely, reliable ANN model parameters require huge amount of training data.

To capture modelling uncertainty a probabilistic programming approach can be adopted. That is, in contrast to traditional ANNs whose weights are fixed values, each weight is a random variable following a posterior probability distribution, conditioned on a prior probability distribution and the observed data, according to Bayes rule. Such neural networks are called Bayesian neural networks (BNNs). BNNs can learn from small datasets, achieve higher accuracy and are robust to overfitting.

From a practical point of view, BNN can be regarded as an ensemble of neural networks (NNs) whose respective weights have been sampled from the posterior probability distribution p(w|D), where w is the flattened vector conventionally representing all weights of the network (i.e. all weights of the network are arranged consecutively in vector w) and D is the training dataset. The training dataset is comprised of data x⁽ⁱ⁾and their corresponding labels z⁽ⁱ⁾, i=1, . . . , S. For classification, z⁽ⁱ⁾is a set of classes and, for regression, a value of a continuous variable.

After the BNN has been trained, it can be used (for prediction or classification) on new data. More specifically, if we denote x₀the vector input representing new data, the output of the BNN is obtained by averaging the output of NNs sampled according to the posterior probability distribution:

z=E
_p(w|D)[ƒ(x₀,w)] (1)

where ƒ denotes the network function and E_p(w|D)denotes the expectation according to the posterior probability distribution of the weights, p(w|D).

The posterior probability distribution can be calculated according to Bayes rule as:

$\begin{matrix} p (w | D) = \frac{p (D | w) p (w)}{p (D)} & (2) \end{matrix}$

where p(l))=∫p(l)|w)p(w)dw.

The calculation of the latter integral is generally intractable and therefore different techniques have been proposed in the prior art in order to approximate the posterior probability distribution, p(w|D), like Markov Chain Monte Carlo (MCMC) sampling or variational inference.

The training of a BNN requires substantial processing and storage resources which are unavailable in most implementation platforms such as IoT devices or even edge computing nodes. A known solution is to shift the computation burden to the Cloud and train the BNN ex situ. The resulting parameters of the weight probability distribution are then transferred to the inference engine located in the implementation platform.

The inference engine nevertheless needs to sample the weights of the BNN according to the posterior probability distribution p(w|D) and perform the calculation of network function ƒ for each sample, before output averaging. However, calculation of the network function entails intensive dot product computation.

Various hardware accelerators have been proposed for generating samples according to a Gaussian distribution law and for performing dot product calculation.

For example, the article of R. Cai et al. entitled “VIBNN: hardware acceleration of Bayesian neural networks” published in ACM SIGPLAN Notices, Vol. 53, Issue 2, March 2018 discloses an FGPA based accelerator for variational inference on BNNs.

However, the proposed accelerator only allows for the generation of a Gaussian posterior probability distribution for the synaptic coefficients (or weights). In other terms, BNNs whose weights follow an arbitrary probability distribution after training cannot benefit from such a hardware accelerator. Furthermore, the proposed VIBNN architecture requires volatile memories and to move data at high rate between these memories and random generator units.

On the other hand, non-volatile resistive memories (RRAMs) have been used for dot product computation in the training and the predicting phases. In such a memory, the dot product between an input voltage vector and an array of conductance values stored in the RRAM is obtained as a sum of currents according to Kirchhoff's circuit law. An example of use of RRAMs for dot product computation in a neural network can be found in the article of Y. Liao et al. entitled “Weighted synapses without carry operations for RRAM-based neuromorphic systems” published in Front. Neuroscience, 16 Mar. 2018.

However, RRAMs generally suffer from device-to-device dispersion and cycle-to-cycle conductance dispersion which are mitigated in order to ensure reliable training and prediction. These mitigation techniques imply area, time and energy consumption.

An object of the present invention is therefore to propose a Bayesian neural network equipped with an efficient hardware accelerator capable of modeling arbitrary posterior probability distributions of the synaptic coefficients. A further object of the present invention is to propose a method for programming such a Bayesian neural network.

BRIEF DESCRIPTION OF THE INVENTION

The present invention is defined by the appended independent claims. Various preferred embodiments are defined in the dependent claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be better understood from the description of the following embodiments, by way of illustration and in no way limitative thereto:

FIG. 1 schematically illustrates an example of probability distribution of a synaptic coefficient in a Bayesian neural network;

FIG. 2 shows a decomposition of the probability distribution of FIG. 1 into a mixture of Gaussian probability distributions;

FIG. 3 shows an example of probability distribution of the conductance of a resistor in a programmed RRAM cell;

FIG. 4 illustrates the evolution of the median and the standard deviation of the conductance vs. the set programming current in a RRAM cell;

FIGS. 5A and 5B schematically illustrate a first variant of a neuron fed with synapses, which can be used in a BNN according the present invention;

FIGS. 6A and 6B schematically illustrate a second variant of a neuron fed with synapses, which can be used in a BNN according the present invention;

FIGS. 7A and 7B schematically illustrate an example of a BNN according to an embodiment of the present invention;

FIG. 8 schematically represents the flow chart of a method for programming a BNN according to a first embodiment of the present invention;

FIG. 9 schematically represents the flow chart of a method for programming a BNN according to a second embodiment of the present invention;

FIG. 10 illustrates the transfer of a programming matrix for programming the RRAM memory used in FIG. 9;

FIG. 11 schematically represents the flow chart of a method for programming a BNN according to a third embodiment of the present invention;

FIG. 12 schematically represents the flow chart of a method for programming a BNN according to a fourth embodiment of the present invention.

DETAILED DISCLOSURE OF PARTICULAR EMBODIMENTS

We are considering in the following a Bayesian neural network (BNN) as introduced in description of the prior art.

After it has been trained, a BNN is associated with a joint posterior probability distribution p(w|D), where D is the training dataset. In the following, we will use the vector notation w when referring to the whole set of synapses and to the scalar notation w when referring to a single synapse. The training dataset is denoted D

$\begin{matrix} = {x^{(i)}, z^{(i)}} \end{matrix}$

where x⁽ⁱ⁾is the input data on which the neural network is to be trained and z⁽ⁱ⁾or z⁽ⁱ⁾if the output is scalar) the corresponding labels.

The training is preferably performed ex situ on a model of the neural network, that is outside the hardware BNN itself, for example in the Cloud on a distant server.

Given the training dataset D={x⁽ⁱ⁾,z⁽ⁱ⁾}, the likelihood function p(D|w) can be calculated by:

$\begin{matrix} p (D | w) = \prod_{i} p (z^{(i)} | x^{(i)}, w) & (3) \end{matrix}$

The prior distribution of the synaptic coefficients p(w) can be chosen (multivariate) Gaussian and the numerator of expression (2) can be derived therefrom h(w; D, Σ)=p(D|w)p(w), where Σ is a covariance matrix. Preferably, the covariance matrix will be chosen as Σ=σ²l where σ²is a predetermined common variance of the prior distributions of the synaptic coefficients.

The posterior distribution p(w|D) being proportional to h(w; D, Σ), it can be obtained by a Markov Chain Monte Carlo (MCMC) method, such as, for example, the well-known Metropolis-Hastings algorithm, or, preferably, the No-U-Turn MCMC Sampling (NUTS) algorithm as described in the paper of M. D. Hoffman et al. entitled “The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo” published in Journal of Machine Learning Research 15 (2014), 1593-1623.

Alternatively, the posterior distribution can be obtained by a variational inference as described further below.

A first idea at the basis of the invention is to represent the posterior marginal probability distribution function (PDF) of each synaptic coefficient p(w|D) as a Gaussian mixture model (GMM). In other words, the marginal posterior probability distribution of each synaptic coefficient can be approximated by a linear combination of a plurality K of Gaussian distributions, also referred to as Gaussian components:

$\begin{matrix} p (w | D) • \sum_{k = 1}^{K} λ_{k} g (w; μ_{k}, σ_{k}^{2}) & (4) \end{matrix}$

where g(·; μ_k, σ_k²) denotes the k^thGaussian component, of mean μ_kand standard deviation σ_k, and where λ_kis a weighting factor. The weighting factors satisfy the constraint

$\sum_{k = 1}^{K} λ_{k} = 1 .$

As the synaptic coefficients are not independent from each other, the mean and standard deviation values μ_k, σ_k; k=1, . . . , K alone fail to capture the posterior probability distribution of the whole set of synaptic coefficients p(w|D). However, the covariance matrix of the synaptic coefficients will be taken into account by joint sampling as explained further below.

Several techniques are available for estimating the set of parameters Ω={λ_k, μ_k, σ_k|k=1, . . . , K} of the GMM. Preferably, a maximum likelihood (ML) estimation will be used.

More specifically, we are looking for the model parameters which maximize the likelihood of the GMM given a sampling of the posterior probability distribution p(w|D), previously obtained by MCMC, denoted w₁, w₂, . . . , w_Jwhere J is the number of samples of said distribution.

The ML estimates of the parameters can be obtained iteratively by using the expectation-maximization (EM) algorithm.

First, the conventional EM algorithm will be set out below and next an adapted version thereof for the purpose of the present invention will be presented.

Basically, the EM algorithm starts with an initial guess of the parameters ω⁽⁰⁾={μ_k⁽⁰⁾, σ_k⁽⁰⁾|k=1 . . . , K} and estimates the membership probabilities of each sample, λ_k,j⁽⁰⁾, that is the probability that sample w_joriginates from the k^thGaussian distribution. Then the expectation of the log likelihood

$\begin{matrix} \log L (Ω; w) = \log (\sum_{k = 1}^{κ} λ_{k}^{(0)} g (w; μ_{k}, σ_{k}^{2})) & (0) \end{matrix}$

is calculated over the samples w₁, w₂, . . . , w_Jon the basis of their respective membership probabilities and set of parameters Ω⁽¹⁾maximizing this expectation is derived. The parameters ω⁽¹⁾={μ_k⁽¹⁾, σ_k⁽¹⁾|k=1, . . . , K} belonging to Ω⁽¹⁾are then used as new estimates for a next step. The expectation calculation and maximization steps alternate until the expectation of log L (Ω; w) saturates, i.e. does not increase by more than a predetermined threshold in comparison with the preceding step. Alternatively, the stopping criterion can be based on Kullback-Leibler divergence between the combination of Gaussian components and the target posterior probability distribution.

The number K of Gaussian components can be determined heuristically. Alternatively, a predetermined range of values of K can be swept and the value achieving the highest expectation of log likelihood at the EM iterative process can be selected.

In fact, as earlier mentioned, the EM algorithm is not applied as such but a constrained version thereof has been developed for the present invention.

The constraint is a relationship between the mean value and the standard deviation that cannot be chosen independently in the hardware implementation of the BNN. More specifically, the standard deviation is assumed to follow a linear law, h(·), with respect to the mean:

σ=h(μ)=σ₀−λ·μ (5)

where λ is a positive coefficient given by the hardware implementation. Hence, the Gaussian components in (4) depend on only one parameter: g (·; μ_k, h²(μ_k)). It follows that the set of parameters is restrained to the subset {tilde over (Ω)}={λ_k, μ_k, h(μ_k)|k=1, . . . , K} and that the log-likelihood maximization in the EM in the algorithm is performed on this subset instead of the entire set Ω. In the following, the subset of parameters will merely be defined as {tilde over (Ω)}={λ_k, μ_k|k=1, . . . , K} since the standard deviation values of the components can be derived from their respective mean values.

FIG. 1 schematically illustrates an example of probability distribution of a synaptic coefficient in a Bayesian network and FIG. 2 shows its decomposition into Gaussian components.

The probability distribution of a synaptic coefficient, w, in FIG. 1 has been obtained by sampling the posterior p(w|D).

It can be seen from FIG. 2 that this probability distribution can be considered as combination (or mixture) of five Gaussian components, each component being characterized by a mean and associated standard deviation, and by a weighting factor. It should be noted that the standard deviations of the components (linearly) decreases with their respective mean values. The weighting factor of a component reflects how much this component contributes to the target distribution, here the probability of the synaptic coefficient in question.

After GMM decomposition of posterior probability distribution of the synaptic coefficients into Gaussian components, the Bayesian neural network is defined by a set of 2K×Q parameters,

$\underset{q = 1}{⋃^{Q}} Ω_{q}$

where {tilde over (Ω)}_q={λ_q,k, μ_q,k|k=1, . . . , K} where Q stands for the number of synapses in the BNN and q is an index of the synapses.

Once the parameters of have been calculated ex situ on a model of the BNN, e.g. by a server in the Cloud, they can be transferred to the BNN itself, that is its hardware implementation. The BNN is then programmed with these parameters before it can be used for inference, as explained further below.

A second idea at the basis of the present invention is to use a resistive switching RAM also simply called resistive memory (RRAM) as hardware accelerator for implementing the dot product performed by the synapses of the BNN.

More specifically the invention makes use of the cycle to cycle (C2C) variability of the low-resistance obtained after a SET operation in order to generate the synaptic coefficients according to their respective probability distributions parametrized by {tilde over (Ω)}_q, q=1, . . . , Q.

We recall that a resistive switching RAM consists of non-volatile random access memory cells, each cell comprising a resistor made of dielectric which can be programmed either in a low resistance state (LRS), also called high conductance state (HCS) or in a high resistance state (HRS), also called low conductance state (LCS), with a so-called RESET operation.

During a SET operation, a strong electric field is applied to the cell, while limiting the current to a programming current value. This operation forms a conductive filament through the dielectric and confers to the resistor a high conductance value, which depends upon the programming current value.

Conversely, during a RESET operation, a programming voltage is applied to the cell with the same or opposite polarity than the one used for electroforming. This voltage breaks the filament and the resistor returns therefore to a low conductance value, which depends upon the programming voltage value.

A detailed description of RRAM can be found in the article by R. Carboni et al., entitled “Stochastic memory devices for security and computing”, published in Adv. Electron. Mat., 2019, 5, 1900198, pages 1-27.

Once a cell has been programmed by a SET operation, the obtained high conductance value is stable in time until the next operation is performed. However, the high conductance value varies from one SET operation to the next, even if the programming current is kept constant.

More specifically, for a given programming current, the high conductance in the HCS state can be considered as a random variable which exhibits a normal distribution over the SET operations. The mean value and the standard deviation of this normal distribution are determined by the programming current used during the SET operations.

FIG. 3 shows the probability density distribution of the conductance of a RRAM cell programmed with a SET programming current.

Once the RRAM cell has been programmed by a SET operation, its conductance is a sample from a normal distribution whose median value (or, equivalently, mean value since the distribution is normal) depends upon the programming current as illustrated in FIG. 4.

In FIG. 4, the x-axis indicates (in logarithmic scale) the programming current (in A) during the SET operation. The y-axis on the left indicates (in logarithmic scale) the median value (in Siemens) of the conductance of the resistor once the RRAM has been programmed, whereas the y-axis on the right indicates (in logarithmic scale) the cycle-to-cycle standard deviation (as a percentage of the median) of the conductance of the RRAM thus programmed.

The variation in log-log scale being linear, it means that the median value of the conductance, g^median, in high conductance state, follows a power law of the type g^median=α(I_SET)^γwhere α and γ are positive coefficients and I_SETis the intensity of the current injected in the resistor during the SET operation. Hence, in order to program a distribution having a target median value, the programming current during the SET operation needs to be:

$\begin{matrix} I_{SET} = {(\frac{g^{median}}{α})}^{1 / γ} & (6) \end{matrix}$

The evolution of the standard deviation (in Siemens) vs. the median value of the conductance can be approximated by a linear function of the type σ=σ₀−−λg^medianwhere σ₀is a constant and λ is a positive coefficient.

FIG. 5A symbolically represents a first variant of a neuron fed with three synapses and FIG. 5B their corresponding implementation in a RRAM memory.

FIG. 5A shows three synapses 511, 512, 513 of respective coefficients g₁,g₂,g₃whose outputs are summed in adder 540 for feeding neuron 550.

In FIG. 5B, the three synapses are embodied by RRAM cells connected to a common data line, 540, feeding neuron 550.

Each cell comprises a programmable resistor (511,512,513) in series with a FET transistor (521,522,23), the gates of the FET transistors being connected to a control line 530 driven by the voltage V_gateand the drains of the FET transistor being connected to the data line. If we denote g₁,g₂,g₃the conductance values of the respective resistors (the values stored in the cell) and V₁,V₂,V₃the voltages applied at the resistors, the current flowing through the data line is simply the dot product g^TV where V^T=(V₁V₂V₃) is the input vector and g^T=(g₁g₂g₃) is the vector of the synaptic coefficients related to neuron 550. Neuron 550 applies an activation function, F_α′for example a logistic function, to the dot product g^TV, to output the value:

z=F
_a(g^TV) (7)

In practice, the synaptic coefficients can be positive or negative and therefore a differential configuration as illustrated in FIGS. 6A and 6B may be preferred.

FIG. 6A symbolically represents a second variant of a neuron fed with three synapses and FIG. 6B their corresponding implementation in a RRAM memory.

In this variant, each synapse is split into two branches according to a differential configuration. The first synapse is implemented by branches 611-1 and 611-2, branch 611-1 bearing a first branch coefficient, g₁⁺, and branch 611-2 bearing a second branch coefficient, g₁⁻. The second and third synapses have the same differential configuration. The outputs of the first branches are summed together in adder 640-1 and the outputs of the second branches are summed together in adder 640-2. The output of adder 640-1 is connected to a non-inverted input of neuron 650, whereas the output of adder 640-2 is connected to an inverted input thereof.

FIG. 6B shows the corresponding implementation in a RRAM. This RRAM differs from the one used in FIG. 5B by the fact that the synaptic coefficients are stored as pairs. More precisely, the vector of synaptic coefficients g, hereinafter referred to as synaptic vector, is represented by a pair (g⁺,g⁻) with g=g⁺-g⁻ where a first branch part, g⁺, and the second branch part, g⁻, of synaptic vector g are generated and stored separately in the RRAM. Expressed otherwise, a synapse is implemented by a cell comprising a first resistor storing the first branch coefficient, g⁺, of the synapse and a second resistor storing the second branch, g⁻, of the synapse.

As shown in FIG. 6B, each of the three cells embodying a synapse comprises a first resistor in series with a first FET transistor, and a second resistor in series with a second FET resistor. For example, the cell on the left comprises a first resistor 611-1 and a second resistor 611-2, a first FET transistor 621-1 in series with the first resistor, and a second FET transistor 621-2 in series with the second transistor.

The sources of the first FET transistors are connected to a first data line 640-1 and the drains of the second FET transistors are connected to a second data line 640-2. The first and second lines are respectively connected to a non-inverted input and an inverted input of the neuron 650. The neuron subtracts its inverted input to the non-inverted input and applies the activation function to the result thus obtained.

The gates of the first transistors and of the second transistors are all connected to the same control line 630.

Voltage V₁is applied to both first resistor, 611-1, and second resistor, 611-2. Voltage V₂is applied to both first resistor, 612-1, and second resistor, 612-2. Voltage V₃is applied to both first resistor, 613-1, and second resistor, 613-2.

The conductance values of first resistors 611-1, 612-1, 613-1 are respectively equal to the first branch coefficients g₁⁺, g₂⁺, g₃⁺of the synapses whereas the conductance values of second transistors 611-2, 612-2, 613-2 are respectively equal to the second branch coefficients g₁⁻, g₂⁻, g₃⁻.

The current flowing through first data line 640-1 is equal to the dot product (g⁺)^TV and the current flowing through second data line 640-2 is equal to the dot product (g⁻)^TV. Hence, the output of neuron 650 can be expressed as:

z=F
_a((g⁺−g⁻)^TV) (8)

It may be necessary to add a bias to the dot product before applying the activation function, in which case expressions (7) and (8) respectively become:

z=F
_a(g^TV+g_bV_b) (9)

z=F
_a(g⁺−g⁻)^TV+(g_b⁺−g_b)V_b) (10)

where g_b, g_b⁺, g_b⁻are positive coefficients and V_bis a predetermined bias voltage. However, the man skilled in the art will understand that expressions (9) and (10) can be regarded as particular cases of (7) and (8) by merely adding an additional synapse supplied with a bias voltage. Hence, without loss of generality, we may omit the constant and restrain our presentation to the purely linear configuration.

FIGS. 7A and 7B schematically illustrate the structure of an example of a BNN according to an embodiment of the present invention.

The BNN represented in FIG. 7A comprises three layers of neurons, each neuron of a layer being symbolically represented by a circle. Although the example shown comprise only one hidden, it will be understood by the man skilled in the art that several hidden layers can be provided between the input layer and the output layer. In other instances, e.g. in case the neural network is a perceptron, the hidden layer may be missing.

More specifically, in the present instance, two input neurons of the first layer, 721 feed forward into 8 neurons of a hidden layer, 722, which in turn feed forward into two neurons of the output layer, 723. Additional neurons labelled “0” in the first and the hidden layer provide constant values (bias voltage multiplied by a synaptic coefficient) as explained above in relation with expressions (9) and (10).

FIG. 7B schematically illustrates the details of the BNN of FIG. 7A. More specifically, the boxes 701, 702, 708 respectively correspond to neurons 1, 2, 8 of the hidden layer and box 711 corresponds to neuron 1 of the output layer. Each box shows a circuit implementing a neuron (represented by a circle) and a circuit implementing the synapses connected thereto (represented by an array of RRAM cells). In the present example, a differential configuration has been adopted, namely the second variant described in relation with FIGS. 6A-6B.

The structure of RRAM array is shown in detail for output neuron 511, it being understood that the same structure would be used for the other neurons.

In general, the RRAM array comprises a plurality N of rows, each row corresponding to an instance of the synapses connected to the neuron and having the structure already described in relation with the first variant of FIG. 5B or the second variant of FIG. 6B. Each row comprises P cells, each cell comprising a programmable resistor in the first variant (monopolar configuration) or a pair of programmable resistors in the second variant (differential configuration). In the latter configuration, the first programmable resistor of a cell represents the first branch and the second programmable resistor represents the second branch of the synapse.

The RRAM array comprises P_icolumns, each column representing a synapse, P_ibeing the number of neurons of the preceding layer the neuron in question is connected to.

The different rows correspond to different samples of the joint posterior probability distribution of the synaptic coefficients. Assuming that the marginal posterior probability distribution of a synaptic coefficient q is decomposed by GMM into K Gaussian components of parameters {tilde over (Ω)}_q={λ_q,k, μ_q,k|k=1, . . . , K}, the number of cells in the column (corresponding to this synapse), programmed with the value μ_q,k, will be proportional to λ_q,k·N. By programming cells with the value μ_q,k, we mean, in the first variant, that the resistors of the cells are programmed so that their respective conductance values g_q,kfollow a normal distribution of mean μ_q,k.

In case of the second variant, we have two sets of parameters per synapse {tilde over (Ω)}_q⁺={λ_q,k⁺, μ_q,k⁺|k=1, . . . , K} and {tilde over (Ω)}_q⁻={λ_q,k⁻, μ_q,k⁻|k=1, . . . , K}. By programming the cells, we mean, in the second variant, that the first resistors of the first branches of the cells are programmed so that the respective conductance values of the first branches, g_k,q⁺, follow a normal distribution of mean μ_k,q⁺and that the respective conductance values of the second branches of these cells, g_k,q⁻, follow a normal distribution of mean μ_k,q⁻.

If N is large enough, the set of conductance values of the resistors within the RRAM provides a fair representation of the posterior probability distribution p (w|D).

During inference, after the cells of the RRAM has been programmed, the N control lines of the array are selected in turn, e.g. by applying a sequence of pulses occurring at times t₀, . . . , t_N-1as shown. In other words, each neuron is switched sequentially between the N rows connected thereto.

When control line n is selected, while the voltage vector V is applied to the columns neuron outputs the value:

z
_n
=F
_a(g_n^TV) (11)

in the first variant, and

z
_n
=F
_a(g_n⁺−g_n⁻)^TV) (12)

in the second variant.

If the neuron belongs to the input layer or a hidden layer, the voltage output by the neuron, which is given by (11) or (12), propagates to the columns of the RRAM arrays of the next layer, according to the connection topology of the network.

If the neuron belongs to the output layer, the output of the neuron is stored in a memory. Hence, by sequentially switching the control lines of the RRAM arrays in the BNN and propagating the neuron outputs from a layer to the next throughout the BNN, we obtain a probability distribution, denoted p(z), if the output is scalar (e.g. in case of regression) or p(z) if the output is a vector (e.g. in case of classification). When a vector of data is applied to the input layer, a probability distribution of the inference from said vector of data is output by the output layer. This probability distribution can be used to make a prediction, for example by calculation the expectation of z on the probability distribution.

FIG. 8 schematically represents the flow chart of a method for programming a BNN according to a first embodiment of the present invention.

In a preliminary step, 810, the BNN model is trained on a training data set, D, and the posterior probability distribution of the synaptic coefficients p(w|D) is obtained by a MCMC method for example a No-U-Turn MCMC Sampling method as set out above. The prior probability distribution, p(w) is preferably chosen multivariate Gaussian of covariance matrix σ_initI_Q. In other words, the Q components are independent Gaussian random variables of identical mean value μ_initand identical standard deviation σ_init. Preferably, given the conductance range of the RRAM cells after a SET operation, the mean value μ_initis chosen equal to the middle value of the conductance range and the standard deviation is a fraction of the range width, for example half of this range.

The result of step 810 is a matrix W_Sof size S×Q, where S≥N is the number of samples of the probability distribution p(w|D), where Q is the number of synapses in the network. Each row of the matrix corresponds to an instance of the Q synaptic coefficients, or more precisely to a sample of the multivariate posterior probability distribution p(w|D).

At step 820, the marginal posterior probability distribution of each of the synaptic coefficients, p(W_q|D), q=1, . . . , Q is decomposed by GMM into K Gaussian components by the EM (expectation-maximization) algorithm, the mean values and the standard deviations of the Gaussian components being constrained by the linear relationship (5) given by the RRAM cell characteristics. In practice, the marginal posterior probability distribution is provided by the samples w_s,q, s=1, . . . , S, of the q^thcolumn of W_S.

This step provides a set of parameters {tilde over (Ω)}_q={λ_q,k, μ_q,k|k=1, . . . , K} for each synaptic coefficient w_q, q=1, . . . , Q of the BNN, where μ_q,k, λ_q,kare respectively the median value and the weighting factor of the k^thGaussian component for synaptic coefficient w_q.

At step 830, the elements of the matrix W_Sare quantized by associating each element with the closest mean (or median) value among the K mean values of the respective K Gaussian components of the distribution it is stems from. More specifically, each element w_s,qis associated with the quantized value:

$\begin{matrix} {\tilde{w}}_{s, q} = \underset{μ_{q, k}}{argmin} (\langle w_{s, q} - μ_{q, k} \rangle) & (13) \end{matrix}$

In other words, the element w_s,qis associated with the Gaussian component it most probably originates from. One can therefore form a programming matrix, {acute over (W)}_S, the elements of which belong to the set {μ_q,k|q=1, . . . , Q, k=1, . . . , K} or more simply an index programming matrix {tilde over (Γ)}_Sof size S×Q whose elements are the indices

$\begin{matrix} γ_{s, q} = \underset{k}{\arg} \min (\langle w_{s, q} - μ_{q, k} \rangle) & (14) \end{matrix}$

Optionally, if S>N the rows of programming matrix {tilde over (Γ)}_S(or {tilde over (W)}_S) are subsampled so as to downsize the matrix to N×Q.

According to a variant, the subsampling step and the quantization step may be swapped.

At step 840, the programming matrix {tilde over (W)}_Sis transferred to the BNN network. Alternatively, the index programming matrix {tilde over (Γ)}_Smay be transferred along with the set of mean values {μ_q,k|q=1, . . . , Q, k=1, . . . , K} so as to reconstruct {tilde over (W)}_Sat the BNN side.

Steps 810 to 830 are typically performed by a distant server in the Cloud and then transferred at step 840 to the terminal (e.g. an IoT device) hosting the RRAM. However, in some instances, e.g. when the prediction/classification is provided as a service in the Cloud, the RRAM can be hosted by the server.

At step 850, for each row n of the RRAM, the cells embodying the synapses of the network are respectively programmed with the corresponding elements of matrix {tilde over (W)}_S, by applying the programming currents

$I_{SET}^{n, q} = {(\frac{{\tilde{w}}_{n, q}}{α})}^{1 / γ}$

during a SET operation.

The programming is carried out in turn for each row of the RRAM. It is important to note that the programmed conductance values of a row of cells correspond to the mean values in the corresponding row of matrix {tilde over (W)}_S, thereby capturing the correlation between the synaptic random variables.

Once the BNN has been programmed, it is ready to be used for inference by inputting a new vector of data to the network. The layers are processed sequentially, the voltages output by the neurons of a layer being simultaneously propagated to the next layer, or output if the layer is the output layer. For each neuron of a given layer, the N control lines controlling the different rows are switched in turn at switching instants, the voltage output by the neuron being propagated to the next layer at each switching instant.

After propagation of the voltages throughout the network, the neuron(s) of the output layer sequentially output(s) samples of z with the probability distribution p(z). These samples can be stored for later use, e.g. for calculation of an expectation value of z and, possibly, its standard variation which can provide a measure of uncertainty of the prediction.

FIG. 9 schematically represents the flow chart of a method for programming a BNN according to a second embodiment of the present invention.

This embodiment differs from the first in that it aims at programming a BNN implemented according to a differential configuration, such as represented in FIGS. 6A and 6B. The BNN exhibit 2Q branch coefficients, each synapse being split into a first branch connected to the non-inverted input of a neuron and a second branch connected of inverted input of the same.

In a preliminary step, 910, the BNN model is trained on a training data set, D, and the posterior probability distribution of the synaptic coefficients p(w|D) is obtained by a MCMC method as explained for the first embodiment, where a synaptic coefficient, w_q, is to be understood here as entailing a first branch coefficient, w_q⁺, and a second branch coefficient, w_q⁻.

The result of step 910 is a matrix W_Sof size S×2Q, where S≥N is the number of samples of the probability distribution p(w|D), where Q is the number of synapses in the network. Each row of the matrix corresponds to an instance of the Q synaptic coefficients, or more precisely to a sample of the multivariate posterior probability distribution p(w|D).

At step 920, the marginal posterior probability distribution of each of the first branch coefficients, p(w_q⁺|D), q=1, . . . , Q and each of the second branch coefficients, p(w_q⁻|D), q=1, . . . , Q is decomposed by GMM into K Gaussian components by the EM (expectation-maximization) algorithm, the mean values and the standard deviations of the Gaussian components being constrained by the linear relationship (5) given by the RRAM cell characteristics.

This step provides a set of first parameters {tilde over (Ω)}_q⁺={λ_q,k⁺, μ_q,k⁺|k=1., K} for each first branch coefficient w_q⁺, q=1, . . . , Q of the BNN, where μ_q,k⁺, λ_q,k⁺are respectively the median value and the weighting factor of the k^thGaussian component for first branch coefficient w_q⁺. Similarly, it provides a set of second parameters and {tilde over (Ω)}_q⁻={λ_q,k⁻, μ_q,k⁻|k=1, . . . , K} for each second branch coefficient w_q⁻, q=1, . . . , Q of the BNN, where μ_q,k⁻, λ_q,k⁻are respectively the median value and the weighting factor of the k^thGaussian component for second branch coefficient w_q⁻.

At step 930, the elements of matrix W_Sare quantized. More specifically, each (second) branch element w_s,q⁺(w_s,q⁻) is associated with the quantized value:

$\begin{matrix} {\tilde{w}}_{s, q}^{+} = \underset{μ_{q, k}^{+}}{\arg \min} (\langle w_{s, q}^{+} - μ_{q, k}^{+} \rangle) and & (15 - 1) \\ {\tilde{w}}_{s, q}^{-} = \underset{μ_{q, k}^{-}}{\arg \min} (\langle w_{s, q}^{-} - μ_{q, k}^{-} \rangle) & (15 - 2) \end{matrix}$

As in the first embodiment, if S>N, the rows of the resulting matrix {tilde over (W)}_Sare subsampled in order to downsize the matrix to N×2Q.

At step 940, the matrix {tilde over (W)}_Sis transferred to the BNN network. Alternatively, an index programming matrix {tilde over (Γ)}_Sof size N×2Q may be transferred along with both sets of mean values {μ_q,k⁺|q=1, . . . , Q, k=1, . . . , K}, {μ_q,k⁻|q=1, . . . , Q, k=1, . . . , K} so as to reconstruct {tilde over (W)}_Sat the BNN side.

The variants envisaged for the first embodiment apply mutatis mutandis to the second embodiment.

At step 950, of each row n of the RRAM, the first, resp. second resistor, in a cell of this row, respectively embodying the first, resp. the second branch of the synapse associated with said cell, are programmed with the corresponding elements of matrix {tilde over (W)}_S, by applying the respective programming currents

$I_{SET}^{+ n, q} = {(\frac{{\tilde{w}}_{n, q}^{+}}{α})}^{1 / γ}, I_{SET}^{- n, q} = {(\frac{{\tilde{w}}_{n, q}^{-}}{α})}^{1 / γ}$

during a SET operation.

Once the BNN has been programmed, it is ready to be used for inference by inputting a new vector of data to the network as in the first embodiment.

FIG. 10 illustrates the transfer of the programming matrix {tilde over (W)}_Sto the BNN for programming the RRAM memory.

The same part of the BNN as in FIG. 7B has been reproduced.

The original {tilde over (W)}_Smatrix is represented in 1010 with its original size, S×2Q, and in 1020, after down-sampling, with size N×2Q. The matrix is then sliced into slices P_i, i=1, . . . , I where I is the number of neurons, a slice P_icorresponding to (the first and second branches of) the synapses connected to the i^thneuron.

More specifically, the first transistors of the n^throw cells of the RRAM array, embodying the first branches of the synapses connected to the i^thneuron are respectively programmed with the {tilde over (w)}_n,q⁺values present in the n^throw of slice whereas the second transistors of the n^throw cells of the RRAM array, embodying the second branches of these synapses are programmed with the {tilde over (w)}_n,q⁻values present in the same row of the same slice.

By using a row of the {tilde over (W)}_Smatrix to program an instance of the first and second branch coefficients of all the synapses, the correlation between the branch random variables is preserved.

FIG. 11 schematically represents the flow chart of a method for programming a BNN according to a third embodiment of the present invention.

The third embodiment differs from the first embodiment in that the posterior probability distribution of the synaptic coefficients p(w|D) is not obtained by a MCMC method but is approximated by an analytic distribution, known as variational density, q(w; ϕ) depending upon a set of parameters denoted ϕ. This method, known as variational inference or VI, has been described in the article of A. Kucukelbir et al. entitled “Automatic Differentiation Variational Inference” published in Journal of Machine Learning Research, No. 18, 2017, pp. 1-45 and also in the article of C. Blundell et al. entitled “Weight uncertainty in neural network” published in Proc. of the 32^ndInt'l Conference on Machine Learning, Lille, France, 2015, vol. 17. Examples of VI methods are mean-field VI or automatic differentiation VI, known as ADVI. For the sake of simplification, but without prejudice of generalization, the description will be limited to ADVI.

The optimal set of parameters, ϕ_optcan be determined by minimizing the Kullback-Leibler divergence between the posterior probability distribution p(w|D) and the variational density, q(w, ϕ). As this problem is generally intractable, ADVI proposes to maximize equivalently the so-called evidence lower bound (ELBO) of the parameters ϕ ELBO (ϕ), where:

$\begin{matrix} E L B O (ϕ) = E_{q (w; ϕ)} [\log p (D, w) - \log q (w; ϕ)] and ϕ_{opt} = \underset{ϕ}{\arg \max} [E L B O (ϕ)] . & (16) \end{matrix}$

The variational density can be chosen as a multinomial Gaussian distribution whereby each synapse is defined by a mean and a standard deviation. The synapses are assumed to be independent and therefore their parameters do not co-vary. The ELBO value can be calculated through Monte Carlo integration by random sampling the current parameters of the variational density.

The optimal set of parameters, ϕ_opt, can be obtained by stochastic gradient descent, also known as automatic differentiation.

At step 1110, once the BNN has been trained by variational inference (e.g. ADVI), the variational density q(w, ϕ_opt) is known and hence the respective means and standard deviations of the synaptic coefficients.

At step 1115, samples of the synaptic coefficients are generated with Monte Carlo sampling, by using their respective means and standard deviations.

The posterior distribution of samples of each synaptic coefficient is then decomposed in 1120 by GMM into a plurality of Gaussian components, by using the expectation-maximization algorithm, the mean values and the standard deviations of the Gaussian components being constrained by the linear relationship (5) given by the RRAM cell characteristics. As in step 820, this step provides a set of parameters {tilde over (Ω)}_q={λ_q,k, μ_q,k|k=1, . . . , K} for each synaptic coefficient w_q, q=1, . . . , Q of the BNN, where μ_q,k, λ_q,kare respectively the median value and the weighting factor of the k^thGaussian component for synaptic coefficient w_q.

At step 1130, for each synaptic coefficient w_q, {tilde over (λ)}_q,k=└N λ_q,k┘, k=1, . . . , K resistors of a column of the RRAM cells are programmed by injecting a current

${I_{SET}^{q} (\frac{μ_{q, k}}{α})}^{1 / γ}$

corresponding to median value μ_q,kduring a SET operation, where N is the number of RRAM cells used for implementing the synapse. More generally, {tilde over (λ)}_q,kis the integer value closest to N λ_q,k. By contrast with the first embodiment, it should be noted that there is no need here to program the RRAM row by row in order to retain the dependence between the synaptic coefficients. It suffices to program the RRAM by column while complying the number λ_q,kof RRAM cells programmed with a current corresponding to mean value, μ_q,k.

FIG. 12 schematically represents the flow chart of a method for programming a BNN according to a fourth embodiment of the present invention.

The fourth embodiment differs from the second embodiment in that the BNN is trained by using variational inference while it differs from the third embodiment in that it aims at programming a BNN according to a differential configuration.

More specifically, in step 1210, as in step 1110, the BNN is trained by VI (e.g. ADVI) for approximating the posterior probability distribution p(w|D) by a variational density, q(w, ϕ). However, here each synaptic coefficient, w_q, is to be understood here as entailing a first branch coefficient, w_q⁺, and a second branch coefficient, w_q⁻, as explained in relation with FIGS. 9 and 10. This step allows to obtain a couple of optimal means and standard deviations for the first and second branches of each synaptic coefficient.

At step 1215, samples of the synaptic coefficients, i.e. 2S samples of the first branch and second branch coefficients are generated by Monte Carlo sampling by using the couples of means and standard deviations obtained at the preceding step.

At step 1220, the posterior distribution of samples of each synaptic coefficient is then decomposed by GMM into a plurality of Gaussian components, by using the expectation-maximization algorithm, thereby providing couples of branch synaptic coefficients w_q,k⁺, w_q,k′⁻, each couple of Gaussian components k,k′ being weighted by a weighting factor λ_q,k,k′.

At step 1230, for each couple of branch synaptic coefficients, w_q,k⁺, w_q,k′⁻, the first branch resistor of the RRAM cell is programmed with a current

${I_{SET}^{+ q} (\frac{{\tilde{w}}_{q, k}^{+}}{α})}^{1 / γ}$

corresponding to median value μ_q,k⁺and the second branch resistor of the RRAM cell is programmed with a current

$I_{SET}^{- q} = {(\frac{{\tilde{w}}_{q, k^{'}}^{-}}{α})}^{1 / γ}$

corresponding to μ_q,k⁻.

By contrast with the second embodiment, it should be noted that there is no need here to program the RRAM row by row in order to retain the dependence between the branch synaptic coefficients. It suffices to program the RRAM cells by column while complying with the number {tilde over (λ)}_q,k,k′=└λ_q,k,k′N┘ of couples of synaptic branches respectively programmed with the couples of currents I_SET^+q, I_SET^−qcorresponding to the couple of mean values, μ_q,k⁺, μ_q,k′⁻. More generally, {tilde over (λ)}_q,k,k′is the integer value closest to λ_q,k,k′N.

The man skilled in the art will understand that the first and second embodiments have been described in the context of modelling a posterior probability distribution of synaptic coefficients by MCMC, the third and fourth embodiments have been described in the context of modelling a posterior probability distribution of synaptic coefficients by VI. It should be noted though, that alternative algorithms can be used for modelling the posterior probability distribution of the synaptic coefficients during the training of the BNN without going out the scope of the present invention.

Claims

1. A Bayesian neural network comprising an input layer, an output layer and possibly one or more hidden layer(s), each neuron of a layer being fed with a plurality of synapses, wherein said plurality of synapses are implemented as a RRAM array of cells, each column of the RRAM array being associated with a synapse, each row of the array corresponding to a sample of the posterior probability of the synaptic coefficients after the neural network has been trained, each cell comprising at least a first resistor in series with a first transistor the drains of the first transistors of the cells of a row being connected together with an input of a neuron, the gates of the first transistors of a row being connected together with a control line, the control lines of the N rows being connected to a control unit configured to switch in turn said control lines, the voltages output by all the neurons of a layer being simultaneously propagated to the next, or being output if the layer is the output layer, the successive voltages output by output layer providing a probability distribution (p(z)) of an inference from a vector of data when said vector of data is input to the input layer.
2. The Bayesian neural network according to claim 1, wherein each cell of the RRAM array comprises a second resistor in series with a second transistor, the gates of the second transistors of a row being connected together with the control line connected to the gates of the first transistors of said row, the drains of the first transistors of the cells of a row being connected together with a non-inverted input of said neuron and the drains of the second transistors of the cells of a row being connected together with an inverted input of said neuron.
3. The method for programming a Bayesian neural network according to claim 1, wherein: (a) a model of the Bayesian neural network is trained on a training dataset, D, the posterior probability distribution of the synaptic coefficients p (w|D) of the network being sampled by a Markov Chain Monte Carlo (MCMC) algorithm, thereby providing a matrix WS of size S×Q, each row of said matrix corresponding to a sample of the synaptic coefficients and each column of said matrix corresponding to a synapse of said network;(b) the marginal posterior probability distribution of each synaptic coefficient, wq is decomposed (820) into a plurality K of Gaussian components, the kth Gaussian component of synaptic component wq having a median value μq,k;(c) each sample of a synaptic coefficient, wq, in matrix WS is transformed into a quantized value to obtain a programming matrix {tilde over (W)}S, said quantized value being selected as the mean value of a Gaussian component obtained at step (b) which is closest to said sample of synaptic coefficient;(d) programming the first resistors of each row, n, of the RRAM array, by injecting therein a current of intensity
4. The method for programming a Bayesian neural network according to claim 3 wherein, in step (a), the posterior probability distribution of the synaptic coefficients is obtained by a No-U-Turn MCMC sampling method.
5. The method for programming a Bayesian neural network according to claim 3, wherein the number S of rows in matrix WS is higher than the number N of rows in the RRAM array, and that the rows of matrix WS are subsampled, prior or after step (c) so as to only retain N rows thereof.
6. The method for programming a Bayesian neural network according to claim 3, wherein the K Gaussian components of a synaptic coefficient, wq, are obtained by the expectation-maximisation (EM) algorithm, the respective median values, μq,k, and standard deviations, σq,k, k=1, . . . , K of said Gaussian components being considered as latent variables bound by a linear constraint σq,k=σ0−λ·μq,k where σ0 and λ are constant values characteristic of the RRAM array.
7. The method for programming a Bayesian neural network according to claim 3, wherein steps (a), (b) and (c) are carried out by a distant server in the Cloud and that programming matrix {tilde over (W)}S is transferred to the RRAM array to program the first resistors in step (d), said RRAM array being hosted by a mobile terminal or an IoT device.
8. The method for programming a Bayesian neural network according to claim 2, wherein: (a) a model of the Bayesian neural network is trained on a training dataset, D, each synapse of the neural network having a first branch and a second branch, said first branch comprising the first resistor and the first transistor of a cell, said second branch comprising the second resistor and the second transistor of said cell, the posterior probability distribution of the synaptic coefficients p(w|D) of the network being obtained by a Markov Chain Monte Carlo (MCMC) algorithm, each synaptic coefficient entailing a first branch coefficient and a second branch coefficient, thereby providing a matrix WS of size S×2Q, each row of said matrix corresponding to a sample of the synaptic coefficients and each column of said matrix corresponding to a synapse;(b) the marginal posterior probability distribution of each first, resp. second branch coefficient is decomposed (920) into a plurality K of Gaussian components, the kth Gaussian component of said first, resp. second branch coefficient wq+, resp. wq− having a median value μq,k+, resp. μq,k−;(c) each sample of a first, resp. second branch coefficient, wq+, resp. wq− in matrix WS is transformed into a quantized value to obtain a programming matrix {tilde over (W)}S, said quantized value being selected as the mean value of a Gaussian component obtained at step (b) which is closest to said sample of first, resp. second branch coefficient;(d) programming the first, resp. second resistors of each row, n, of the RRAM array, by injecting therein a first current of intensity
9. The method for programming a Bayesian neural network according to claim 8 wherein, in step (a), the posterior probability distribution of the first and second branch of the synaptic coefficients is obtained by a No-U-Turn MCMC sampling method.
10. The method for programming a Bayesian neural network according to claim 8, wherein the number S of rows in matrix WS is higher than the number N of rows in the RRAM array, and that the rows of matrix WS are subsampled, prior or after step (c) so as to only retain N rows thereof.
11. The method for programming a Bayesian neural network according to claim 8, wherein the K Gaussian components of a first, resp. second branch coefficient, wq+, resp. wq−, are obtained by the expectation-maximisation (EM) algorithm, the respective median values, μq,k+, μq,k−, and standard deviations, σq,k+, σq,k−, k=1, . . . , K of said Gaussian components being considered as latent variables bound by linear constraints σq,k+=σ0−λ·μq,k+ and σq,k−=σ0−λ·μq,k− where σ0 and λ are constant values characteristic of the RRAM array.
12. The method for programming a Bayesian neural network according to claim 8, wherein steps (a), (b) and (c) are carried out by a distant server in the Cloud and that programming matrix WS is transferred to the RRAM array to program the first resistors in step (d), said RRAM array being hosted by a mobile terminal or an IoT device.
13. The method for programming a Bayesian neural network according to claim 1, wherein: (a) a model of the Bayesian neural network is trained on a training dataset, D the posterior probability distribution of the synaptic coefficients p (w|D) of the network being approximated by an analytic density distribution by variational inference on the dataset, thereby providing a mean and a standard deviation for each synaptic coefficient;(b) samples of the synaptic coefficients are generated by Monte Carlo sampling of the analytic density distribution by using the mean and standard deviation obtained for each synaptic coefficient at step (a);(c) the posterior probability distribution represented by the samples of each synaptic coefficient, wq is decomposed into a plurality K of Gaussian components, the kth Gaussian component of synaptic component wq having a mean value μq,k and a weighting factor λq,k;(d) programming, for each synaptic coefficient, wq, and each kth Gaussian component, {tilde over (λ)}q,k first resistors of a column of the RRAM array, by injecting therein a current of intensity
14. The method for programming a Bayesian neural network according to claim 2, wherein: (a) a model of the Bayesian neural network is trained on a training dataset, D, the posterior probability distribution of the synaptic coefficients p (w|D) of the network being approximated by an analytic density distribution by variational inference on the dataset, each synaptic coefficient entailing a first branch coefficient and a second branch coefficient, said training providing a mean and a standard deviation for each synaptic branch coefficient;(b) samples of the synaptic branch coefficients are generated by Monte Carlo sampling of the analytic density distribution by using the mean and standard deviation obtained for each synaptic branch coefficient at step (a);(c) the posterior probability distribution of each first, resp. second branch coefficient is decomposed into a plurality K of Gaussian components, the kth Gaussian component of said first, resp. second branch coefficient wq+, resp. wq− having a mean value μq,k+, resp. μq,k′− and each couple of Gaussian components, k, k′ having a weighting factor λq,k,k′;(d) programming, for each couple of synaptic branch coefficients, (wq+, wq−), and each couple of Gaussian components, k,k′, {tilde over (λ)}q,k,k′ first and second resistors of a column of the RRAM array, by injecting therein a first current of intensity

Priority Claims (1)

Number	Date	Country	Kind
20305450.7	May 2020	EP	regional

BAYESIAN NEURAL NETWORK WITH RESISTIVE MEMORY HARDWARE ACCELERATOR AND METHOD FOR PROGRAMMING THE SAME

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)