BNN TRAINING WITH MINI-BATCH PARTICLE FLOW

TECHNICAL FIELD

Embodiments discussed herein regard devices, systems, and methods for training a Bayesian neural network (NN) using mini-batch particle flow.

BACKGROUND

Most NNs provide point estimates without a direct uncertainty metric or confidence. Standard NNs can also have relatively poor performance on open sets. Bayesian NNs (BNNs) learn statistical distributions of weights, providing a statistical environment in which decision uncertainty and confidence can be determined. The pre-existing training techniques for BNNs include Hamiltonian Monte Carlo, variational inference (both Monte Carlo and deterministic), probabilistic back propagation, and standard particle filter. These training methods are computationally expensive and require a relatively large amount of data for training.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates, by way of example, a block diagram contrasting a standard DL architecture and a BNN.

FIG. 2 illustrates, by way of example, a diagram of an embodiment of an NN.

FIG. 3 illustrates, by way of example, a flow diagram of an embodiment of an NN training procedure using training particle flow.

FIG. 4 illustrates, by way of example, a plot of accuracy versus measurement update for a BNN being trained based on MNIST {0, 1} using training particle flow.

FIG. 5 illustrates, by way of example, a plot of accuracy versus measurement update for a BNN being trained on MNIST {0, 1, 2, 3}.

FIG. 6 illustrates, by way of example, a flow diagram of an embodiment of a mini-batch training particle flow technique.

FIG. 7 shows a plot of the network accuracy with increasing batch update for respective batch sizes of 1, 2, and 16.

FIG. 8 is a block diagram of an example of an environment including a system for NN training, according to an embodiment.

FIG. 9 illustrates, by way of example, a block diagram of an embodiment of a machine in the example form of a computer system within which instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed.

DETAILED DESCRIPTION

Particle flow has recently been modified to be used in a Deep Learning (DL) context. Such as particle flow technique for DL training is called “training particle flow”. Training particle flow can train a Bayesian Neural Network (BNN). A BNN trained using training particle flow is sometimes called a “particle flow BNN”. The particle flow BNN architecture demonstrates a high predictive accuracy with few training samples and a strong capability for measuring predictive uncertainty using variance in the predictions made by the BNN. A case study using MNIST-classes {0,1} showed that this is the case. However, the current implementation of the particle flow BNN, as described below, tends to lack robustness and can be difficult to train for more than two classes. See FIGS. 4 and 5, for example. BNNs and training particle flow are described. Then a description of mini-batch particle flow training is described. Training a BNN using mini-batch particle flow provides a more robust BNN that is easier to train for more than two classes than another, single input particle flow training technique.

FIG. 1 illustrates, by way of example, a block diagram contrasting a standard DL architecture 102 and a BNN 104. Standard DL architectures provide point estimates for model predictions (output 108) and network parameters (weights of nodes 112). Such DL architectures 102 do not provide a direct route towards quantifying uncertainty. Instead, standard DL architectures 102 rely on indirect methods for estimating uncertainty. Common methods include using entropy and confidence scores and functions, as well as application specific methods. Bayesian DL architectures 104 and statistical methods tend to offer a more natural landscape for quantifying uncertainty.

BNNs has been researched at least since 1992 and continues to be a growing field. BNN techniques use Bayes' theorem as a guide to solve for a posterior probability distribution of weights (distribution of nodes 114) in an NN. The computational intractability of solving Bayes' Theorem for DL tasks has led to development of numerous approaches for estimating the posterior distribution of weights in the NN. Well known approaches for BNN optimization include Hamiltonian Monte Carlo, Monte Carlo Variational Inference, Deterministic Variational Inference, and Probabilistic Back Propagation (PBP).

An output 110 of the BNN 104 is a distribution, per class as compared to the output of the NN 102 which provides a score per class as the output 108. The distribution of the output 110 can be a natural consequence of using a distribution (“dist.”), instead of scalar weights as in the nodes 112, to represent the activation function of the node 114.

Deep ensembles and stochastic regularization techniques provide an alternate approach towards estimating uncertainty in DL. While these techniques do not optimize Bayes' Theorem, they do provide a statistical landscape to compute predictive uncertainty, in a fraction of the computational cost as BNNs. These statistical methods have been used in a variety of applications for uncertainty quantification.

A common theme among the existing Bayesian NNs and statistical approaches is extremely large training sets and training over thousands of epochs. However, real world datasets are commonly sparse and may not be sufficient to train a typical NN with thousands of parameters. Embodiments provide an NN architecture that can quantify uncertainty and can also perform robustly in the limit of sparse datasets and on open-set problems

Embodiments use a modified form of a particle flow technique that is commonly used in particle filters but altered and repurposed to train a BNN. The modified form of the particle flow method is called “training particle flow” herein. Particle flow is a method for optimizing Bayes' Theorem that has been used exclusively (up until now), in the particle filtering context. Numerical experiments for particle flow, in the context of Particle Filtering, show that particle flow can reduce the computational complexity by many orders of magnitude relative to standard particle filters or other state-of-the-art algorithms for the same filter accuracy. Moreover, particle flow can reduce the filter errors by many orders of magnitude relative to an extended Kalman filter or other state-of-the-art algorithms for difficult nonlinear non-Gaussian problems.

While particle methods have recently emerged for optimizing NNs, using particle flow to optimize a BNN has not been done before to the best of the inventors' knowledge. Results of a BNN trained to perform a classification task with MNIST {0,1} demonstrate a high predictive accuracy with very few training samples. Further, the BNN trained to perform the classification task had a strong capability for measuring predictive uncertainty using variance in the network's predictions.

Particle Flow

Consider a system with an internal state, s, with a measurement, m. Bayes' theorem relates the posterior probability of the state given the measurement, p(s|m), to the prior distribution of the state, p(s), and the likelihood of the measurement given the state, p(m|s), according to

$\begin{matrix} p (s ❘ m) = \frac{p (m ❘ s) p (s)}{p (m)} . & (Eqn . 1) \end{matrix}$

Here p(m)=∫p(m|s)p(s)ds is the evidence, which behaves as a normalization constant. The measurement, m, is a quantity that helps characterize what the internal state is or will be. Given a general Markov process with a set of noisy measurements, {m}, particle filters provide a method to estimate the internal state(s) {s} of the system using Bayes' Theorem as a guide.

Particle flow is a method used in a particle filtering context to estimate an optimal posterior distribution of internal states per each measurement. Particle flow optimizes Bayes' theorem by evolving the prior distribution to the posterior distribution along a log-homotopy,

log p(s,λ|m)=λ log p(m|s)+log p(s)−log K(λ,m). (Eqn. 2)

Two continuous functions in respective space are called homotopic if one continuous function can be “continuously deformed” into the other continuous function. A homotopy exists between functions that can be so deformed.

Here K(λ, m)=∫p(m|s)^λp(s)ds normalizes the posterior distribution p(s, λ|m) for each λ. The scalar homotopy parameter λ=[0,1] evolves the distribution from the prior to the posterior for a given measurement, m. Each particle represents a single realization of the internal state, s, of the system. The flow of particles along the log-homotopy is described by a stochastic differential equation (SDE),

ds=f(s,λ)dλ+BdW (Eqn. 3)

where {right arrow over (f)} is a drift velocity, B is a diffusion matrix, and dW is a differential of a Weiner process.

The evolution of a posterior distribution of particles can be characterized by a Fokker-Planck equation (where the diffusion-squared matrix is defined as Q_ij=Σ_kB_ikB_jk),

$\begin{matrix} \frac{\partial p (s, λ ❘ m)}{\partial λ} = - \vec{▽} \cdot (p (s, λ ❘ m) \vec{f}) + \frac{1}{2} \sum_{i, j} \frac{\partial^{2} (p (S, λ ❘ m) Q_{ij})}{\partial x_{i} \partial x_{j}} & (Eqn . 4) \end{matrix}$

Here the gradients and derivatives as written in Eqn. (4) are with respect to Cartesian coordinates; however, Eqn. (4) can be used with any orthogonal coordinate system of choice by properly transforming the partial derivatives.

Various solutions for the drift velocity {right arrow over (f)} and diffusion matrix Q have been found for specific choices of the prior and likelihood functional forms or whether the evolution is deterministic or stochastic. The Gromov solution for the drift velocity and diffusion matrix is,

{right arrow over (f)}=−[λ({right arrow over (∇)}{right arrow over (∇)}^Tlog p(m|s))+({right arrow over (∇)}{right arrow over (∇)}^Tlog p(s))]⁻¹{right arrow over (∇)} log p(m|s) (Eqn. 5)

Q=[λ{right arrow over (∇)}{right arrow over (∇)}^Tlog p(m|s)+{right arrow over (∇)}{right arrow over (∇)}^Tlog p(s)]⁻¹(−{right arrow over (∇)}{right arrow over (∇)}^Tlog p(m|s))[λ{right arrow over (∇)}{right arrow over (∇)}^Tlog p(m|s)+{right arrow over (∇)}{right arrow over (∇)}^Tlog p(s)]⁻¹ (Eqn. 6)

where {right arrow over (∇)}{right arrow over (∇)}^Tis a Hessian matrix. The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function that describe local curvature of a function of variables. The Gromov solution for the diffusion matrix requires the prior and likelihood to have a Gaussian functional form and a linear relationship between the measurement and state with Gaussian white noise.

The Geodesic solution assumes the Gromov solution for the drift velocity and no diffusion (i.e., Q=0); however, it does not simultaneously satisfy Eqn. 2 and Eqn. 4. The zero-curvature solution assumes the particles do not accelerate with varying A and there is no diffusion term (i.e., Q=0). This solution has a drift velocity proportional to the Gromov drift velocity.

DL and Supervised Learning Tasks

DL is a branch of ML that uses a series of layers of nodes to learn higher order representations of data for a supervised, semi-supervised, or unsupervised learning task. While embodiments described focus on supervised learning tasks, embodiments can be applied to any learning task for which a likelihood function can be defined.

Supervised learning tasks can use a deep NN (DNN) to learn a relationship between input and output data for either regression or classification. For a regression task, the NN can predict the dependent variable custom-character that is causally related to the input data; for a classification task, the network predicts the probability _jof a particular class. The word “probability” is a slight misnomer here. Classification tasks often use a SoftMax activation function in an output layer to produce a vector in which its elements sum to one. While this output represents a set of class probabilities, these “probabilities” are not necessarily well calibrated to the actual accuracy of the network. In this sense, the output probabilities can be more accurately understood as a normalized score for each class. During training, the NN predictions can be evaluated against the corresponding truth class or truth values, y_T, of the input data, and the network weights can be adjusted using a chosen optimization scheme.

A “likelihood” function is used in many gradient-based optimization methods in DL. For regression tasks, the likelihood, custom-character , of truth variable, y_T, given NN weights θ={θ_i} and network prediction, , is commonly modeled by a Normal Distribution,

$\begin{matrix} ℒ (y_{T} ❘ θ, 𝓅) = \frac{1}{\sqrt{{(2 π)}^{k} ❘ \sum ❘}} \exp [- \frac{1}{2} {(y_{T} - 𝓅)}^{T} \sum^{- 1} (y_{T} - 𝓅)] & (Eqn . 7) \end{matrix}$

Here Σ is a covariance matrix that scales error between the truth, y_T, and predictions, custom-character , and index, k, represents a dimensionality of y_T. This likelihood function assumes Gaussian white noise discrepancies between the prediction, , and truth, y_T. A corresponding log-likelihood is given by,

$\begin{matrix} \log ℒ = - \frac{1}{2} [k \log 2 π + \log ❘ \sum ❘ + {(y_{T} - 𝓅)}^{T} \sum^{- 1} (y_{T} - 𝓅)], & (Eqn . 8) \end{matrix}$

which is reminiscent of the Mean Squared Error (MSE) when Σ is the identity matrix.

For classification tasks, a categorical distribution describes the likelihood, custom-character , of the true class, y_T, of an input given the predicted class probabilities, _j, jϵ[1, n_class] and NN weights θ,

custom-character (y_T|θ,)=Π_j_j^y^T,j, (Eqn.9)

where y_T={y_Tj} is a one-hot encoded vector, or a non-binary vector that sums to one if using soft labels, of the truth class of the image. The corresponding log-likelihood is the negative of cross-entropy function,

log custom-character =log[Π_j_j^y^T,j]=Σ_jy_Tjlog _j. (Eqn. 10)

Mapping Particle Flow for Training BNNs

FIG. 2 illustrates, by way of example, a diagram of an embodiment of an NN 200. The NN 200 includes L layers of nodes. The L-layer NN, Λ_θ, has a set of network parameters θ={θ_j}={W¹, b¹, W², b², . . . , W^L, b^L}, which includes all the weights and biases. The NN 200 has a total of N_paramparameters, such that the network parameter θ is a N_param-dimensional vector, θ∈ custom-character ^N^params.

In a typical supervised learning task, a set of training data custom-character ={, _T} is used to train the NN 200. Here ={x_j} is the set of all inputs and _T={y_T,j} the corresponding set of all truth values given . Each NN prediction _jresults from a series of 2L compositions on the data, Λ_θ=(σ^L∘g^L∘σ^L−1. . . ∘σ¹∘g¹),

custom-character
_j=Λ_θ(x_j) (Eqn. 11)

where σ describes an activation function for each layer of nodes and g describes an affine transformation at each layer nodes.

The goal of BNN is to learn an optimal posterior distribution of network parameters p(θ| custom-character ) given the data using Bayes' Theorem,

$\begin{matrix} p (θ ❘ D) = p (θ ❘ {𝒳, 𝒴_{T}}) = \frac{ℒ (𝒴_{T} ❘ θ, 𝒳) p (θ)}{p (𝒴_{T} ❘ 𝒳)} . & (Eqn . 12) \end{matrix}$

Here p(θ) describes the prior distribution on the network parameters, {θ_j}, custom-character (_T|θ, )=(_T|Λ_θ()) describes the likelihood of the truth values _Tgiven the NN predictions =Λ_θ() (see Eqns. (7) and Eqn. (9) A normalization factor in Eqn. (12) is defined as p(_T|)=∫(_T|θ, )p(θ)dθ.

Note that the right hand side of Eqn. (12) follows from a reduction of the full expression of the posterior probability,

$p (θ ❘ {𝒳, 𝒴_{T}}) = \frac{{ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (𝒳 ❘ θ) p (θ)}{\int {ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (𝒳 ❘ θ) p (θ) d θ},$

where the denominator is the evidence p({ custom-character ,_T})=∫(_T|θ, )p(|θ)p(θ)dθ. Given the conditional independence of the inputs on the network parameters θ, the conditional probability becomes p(|θ)=p()p(θ). Substituting this into the full expression for the posterior probability reproduces the RHS of Eqn. (12),

$p (θ ❘ {𝒳, 𝒴_{T}}) = \frac{[{ℒ (𝒴_{T} ❘ 𝒳, θ)}_{p} (θ)] p (𝒳)}{[\int {ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (θ) d θ] p (𝒳)} = \frac{{ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (θ)}{[\int {ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (θ) d θ]} \frac{{ℒ (𝒴_{T} ❘ θ, 𝒳)}_{p} (θ)}{p (𝒴_{T} ❘ 𝒳)} .$

Embodiments map particle flow, used in the particle filtering context, to the DL context, resulting in training particle flow, by equating the internal states {s} and the measurements {m} in Eqn. (1) to the network parameters θ and truth values custom-character _T, respectively, in Eqn. (12). Each particle, under these equalities, now represents a single realization of network parameters {θ_j}. Training particle flow evolves the values of the network parameters {θ_j} in the NN with homotopy scalar λ. The likelihood of each measurement, m, given the internal state s, is replaced by a likelihood of the truth value, custom-character _T, given the prediction, , of the network 200 based on the input data, . In the DL context, each particle represents a single realization of network parameters {θ_j}.

A mapping of particle flow as used in the particle filtering context to training particle flow is mathematically described in Eqn. (13)-Eqn. (15).

$\begin{matrix} p (s ❘ m) = \frac{p (m ❘ s) p (s)}{p (m)} \to p (θ ❘ {𝒳, 𝒴_{T}}) = \frac{ℒ (𝒴_{T} ❘ θ, 𝒳) p (θ)}{p (𝒴_{T} ❘ 𝒳)} & (Eqn . 13) \end{matrix}$

$\begin{matrix} m \to 𝒴_{T} & (Eqn . 14) \end{matrix}$

$\begin{matrix} s \to θ, d θ = \vec{f} d λ + BdW & (Eqn . 15) \end{matrix}$

The log-homotopy constraint given in Eqn. (3) becomes,

log p(θ,λ|{ custom-character ,_T})=λ log (_T|θ,)+log p(θ)−log p(λ,_T|) (Eqn. 16)

where the scalar homotopy parameter, λ, was added to the notation of the posterior distribution and the normalization factor to designate the variance of these terms on λ. The corresponding gradients and derivatives as written in Eqn. (4) are now with respect to the network parameters for training particle flow,

$\begin{matrix} \frac{\partial p (θ, λ ❘ {𝒳, 𝒴_{T}})}{\partial λ} = - {\vec{\nabla}}_{θ} \cdot (p (θ, λ ❘ {𝒳, 𝒴_{T}}) \vec{f}) + \frac{1}{2} \sum_{i, j} \frac{\partial^{2} (p (θ, λ ❘ {𝒳, 𝒴_{T}}) Q_{ij})}{\partial θ_{i} \partial θ_{j}} & (Eqn . 17) \end{matrix}$

Eqn. 18 shows a mathematical representation of drift velocity in training particle flow using the Gromov expression,

{right arrow over (f)}=−[λ({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog custom-character )+({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog p(θ))]⁻¹{right arrow over (∇)}_θlog . (Eqn. 18)

The Gromov expression generalizes well to architectures with L≥1 layers, varying activation functions, and arbitrary prior and likelihood functional forms. However, Eqn. (18) only satisfies Eqn. (16)-(17) when the NN 200 has a single layer with a linear activation function, Λ_θ( custom-character )=σ¹∘g¹()=g¹(), the prior and likelihood have a Gaussian functional form, and Q is given by Eqn. (6).

Eqn. 19 provides a mathematical representation of a constant diffusion matrix used in the particle flow training,

Q=αId, (Eqn. 19)

where α∈ custom-character _>0is a positive real number and Id is the identity matrix. A constant diffusion matrix can help provide numerical stability. Additionally, adding a small amount of noise can prevent the network training from getting stuck in local minima. A small exponential damping factor can also be added to the diffusion matrix to reduce the impact of noise with increasing number of weight updates,

Q=α exp[−β(update #)]Id, (Eqn. 20)

where β>0 scales the rate of damping.

NN Training Procedure Using Training Particle Flow

FIG. 3 illustrates, by way of example, a flow diagram of an embodiment of an NN training procedure using training particle flow. The procedure includes initialization 320, training particle flow optimization 322, and prediction 324.

The initialization 320 includes a user choosing, or a computer automatically instantiating, a functional form of a likelihood and prior at operation 326. The initialization 320 includes sampling the chosen prior distribution p(θ) at operation 328. Sampling initializes each realization of the NN that is optimized at optimization 322. If a multivariate normal distribution is chosen as the prior at operation 326, the corresponding Hessian has a simple analytical form,

p(θ)˜ custom-character (θ;λ,Γ), (Eqn.20)

{right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog p(θ)=Γ⁻¹. (Eqn. 21)

where μ is a mean vector and Γ is a covariance matrix. The mean adds an initial offset to the values of the network parameters. The mean can be set to zero for simplicity and to avoid adding an incorrect offset. However, an offset can be known and used. The values of Γ characterize the initial spread of each network parameter and potential correlations. The values can be chosen to be large enough to encourage fast learning, yet small enough to discourage divergences in the network.

For the likelihood functions, Eqn. (7) or (9) can be used depending on the type of supervised learning task. The residual covariance in Eqn. (7) can be chosen similarly to the prior covariance (e.g., to promote learning, yet to prevent divergences). The number of particles, N, can be chosen to be large enough to provide sufficient statistics on the data and to avoid divergences in the covariance matrix.

The particle flow optimization 322 can be described as follows: For each data point in the training set (x_j, y_T,j)∈ custom-character (select data from training set at operation 330):

- Equate the current distribution of particles to the prior distribution of the particles.
- Calculate the covariance of the prior distribution of particles, Γ.
- Loop iteratively through the scalar homotopy parameter λ=[0,1] (operation 332)
  - For each λ_k, k=1, 2, . . . , N_λ−1
    - Calculate integration step size, δλ=λ_k+1−λ_k
    - For each particle {θ_i}, i=1, 2, 3 . . . , N: (operation 334)
      - Pass data input x_jthrough network (operation 334) using particle's values to get a prediction

custom-character
_j=Λθ_i(x_j)

- - - - Calculate gradients and hessians of the log-likelihood with respect to network parameters (operation 336)
      - Calculate the drift f and diffusion matrix Q (operation 338)
      - Update state of particle using numerical integration of stochastic differential equation (SDE) (Eqn. 15) (operation 340)

θ_i=θ_i+fδλ+√{square root over (Qδλ)}n,n˜ custom-character (0,Id)

At operation 330, a pair of input and output values is selected from the data set. At operation 332, the operation 322 iterates through discretized steps of the homotopy parameter λ. At operation 334, a particle state of the N particle states from operation 328 is selected and data can be passed through the selected NN. A result of operation 334 can be a prediction. At operation 336 the gradients (derivatives) and Hessians are calculated. Drifts and diffusions are determined based on the gradients and Hessians at operation 338. At operation 340, particle states are updated.

Particle flow uses numerical integration to integrate Eqn. (15). An interpolation of the scalar homotopy parameter λ={λ_k: k∈[1, N_λ]}, at operation 332) for the numerical integration can be based on a linear, log, or adaptive scale. The number of divisions N_λ of the scalar homotopy parameter can be balanced by integration accuracy and algorithm efficiency. While there are several different methods of numerical integration for SDE, Euler-Maruyama method can be used for computation efficiency.

In prediction 324, an output prediction can be provided at operation 342. A marginalized probability distribution of the output prediction given a new input is determined at operation 344. The unmarginalized predictive distribution for an output prediction custom-character given a new input x′, the training data , and the particles θ is,

p( custom-character ′|x′,,θ)=Λ_θ(x′)p(θ|) (Eqn. 21)

Marginalizing Eqn. (21) over all network realizations θ gives the predictive distribution of an output prediction custom-character ′ given a new input x′ and the training data ,

p( custom-character ′|x′,)=∫p(′|x′,,θ)dθ (Eqn. 22)

Embodiments can evaluate Eqn. (22) using a Monte Carlo sampling of the posterior distribution p(θ| custom-character ), where the sampling is over all particles {θ_i},

$\begin{matrix} p (𝓅^{'} ❘ x^{'}, D) \approx \frac{1}{N} \sum_{i = 1}^{N} p (𝓅^{'} ❘ x^{'}, D, θ_{i}) = \frac{1}{N} \sum_{i = 1}^{N} Λ_{θ_{i}} (x^{'}) & (Eqn . 23) \end{matrix}$

As is seen, the predictive distribution is a marginalization of the posterior with the network prediction, which is a sum over network parameters.

Mini-Batch Particle Flow

One issue with training particle flow is that it tends to be sensitive to each measurement (particle update). A data input outlier can drive the particle distribution in the wrong direction during a measurement update, leading to a large jump in the predictive accuracy. In some cases, the BNN being trained does not recover from such a detour. Another issue with training particle flow is that training a particle flow BNN takes longer than training a standard NN (non-Bayesian NN), which can process “mini-batches” of data for each weight update. In contrast, the particle flow optimization procedure is formulated to only process one data point at a time, which prevents any sort of batch processing. This is because particle filters typically provide a state-transition update at a specific time followed by a measurement update conditioned on this specific time. The coupling between the state-transition and measurement in time necessitates processing measurements one at a time. Thus, there is no reason for particle flow, used exclusively in the context of particle filters up until recently, to process more than one measurement at a time.

FIG. 4 illustrates, by way of example, a plot of accuracy versus measurement update for a BNN being trained based on MNIST {0, 1} for training particle flow BNN described regarding FIGS. 1-3. In this example, the accuracy of the BNN experiences reductions in the accuracy. The BNN does recover from these reductions in the example of FIG. 4.

FIG. 5 illustrates, by way of example, a plot of accuracy versus measurement update for a BNN being trained on MNIST {0, 1, 2, 3}. In the example of FIG. 5, the accuracy of the BNN experiences a large reduction in accuracy between measurements 280 and 300. The BNN in the example of FIG. 5 does not recover from this reduction, and the network parameters maintain values that produce predictions with low accuracy.

A mini-batch training particle flow BNN is now described. This mini-batch particle flow BNN training formulation retains a core training particle flow optimization procedure, but modifies the training particle flow framework to accommodate mini-batch processing of data. The use of mini-batches in stochastic gradient descent type training is a well-established practice for training NNs. However, the use of mini-batches in a training particle flow BNN, or even particle flow itself, has not been done before to the best of the inventors' knowledge. Results on MNIST {0,1} using mini-batch particle flow demonstrate that being able to process more data for a single update greatly increases the training speed and accuracy of the resulting model. Using mini-batches in training particle flow helps avoid the reductions in accuracy experienced when using particle updates that are performed based on singular inputs. However, modifications to training particle flow are required to be able to use mini-batches in training particle flow. These modifications, when implemented, reduce training time of a BNN and increase accuracy of a trained BNN.

Consider a mini-batch of data d={x, custom-character _T}, d∈ with N_mbnumber of samples. The joint posterior probability over all N_mbindependently distributed data samples in the mini-batch is,

$\begin{matrix} P_{joint} = \prod_{i = 1}^{N_{mb}} p (θ, λ ❘ {x_{i}, y_{T, i}}) = \prod_{i = 1}^{N_{mb}} \frac{{ℒ (y_{T, i} ❘ θ, x_{i})}^{λ} p (θ)}{p (λ,_{y, T, i} ❘ x_{i})} & (Eqn . 24) \end{matrix}$

where p(θ, λ|{x_i, custom-character _T,i}) is the posterior distribution for the i-th sample in the minibatch and λ is the scalar homotopy. Here the same prior distribution p(θ) of particles is assumed for each of the samples in the mini-batch. Taking the logarithm of the joint posterior probability gives,

$\begin{matrix} \log P_{joint} = \sum_{i = 1}^{N_{mb}} \log p (θ, λ ❘ {x_{i}, y_{T, i}}) = [λ \sum_{i = 1}^{N_{mb}} \log ℒ (y_{T, i} ❘ θ, x_{i})] + N_{mb} \log P (θ) - [\sum_{i = 1}^{N_{mb}} \log p (λ, y_{T, i} ❘ x_{i})] & (Eqn . 25) \end{matrix}$

Eqn. 25 can be re-written as:

log P_joint=λ log custom-character _MB+log p_MB(θ)−log K (Eqn. 26)

where the mini-batch log likelihood is log custom-character _MB=log[Π_i=1^N^mb(_T,i|θ, x_i)]=[Σ_i=1^N^mblog (_T,i|θ, x_i)], the mini-batch log prior is log p_MB(θ)=N_mblog p(θ), and the mini-batch log normalization constant, which has a zero gradient or Hessian with respect to the network parameters, is log K=Σ_i=1^N^mblog p(λ, _T,i|x_i).

By comparing Eqn. 26 to Eqn. 16, one can deduce the drift vector for the entire mini-batch update to be,

{right arrow over (f)}
_MB=−[λ({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog custom-character _MB)+({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog p_MB(θ))]⁻¹{right arrow over (∇)}_θlog _MB. (Eqn. 27)

One can derive this expression by using the general approach laid out in Appendix A of Ref. D. F. Crouse and C. Lewis, “Consideration of Particle Flow Filter Implementations and Biases,” Naval Research Lab, Washington, D.C. (2019) for the single measurement case.

It is important to point out that the drift vector for the mini-batch (Eqn. 27) does not equal the sum of drift vectors (Eqn. 18) over all samples in the mini-batch,

{right arrow over (f)}
_MB≠Σ_i=1^N^mb{right arrow over (f)}_i. (Eqn. 28)

The gradient operator {right arrow over (∇)}_θ is a linear operation. Consequently, the gradient of the mini-batch log-likelihood {right arrow over (∇)}_θ log custom-character _MBis equal to the sum of the gradients for each sample in the minibatch, (e.g. {right arrow over (∇)}_θ log _MB=Σ_i=1^N^mb{right arrow over (∇)}_θ log (_T,i|θ, x_i)). However, the multiplication of the mini-batch log-likelihood with the inverse of the sum of Hessians, [λ({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog custom-character _MB)+({right arrow over (∇)}_θ{right arrow over (∇)}_θ^Tlog p_MB(θ))]⁻¹, breaks the equivalence of Eqn. (27) to the sum of the drift vectors in Eqn. (18). This equivalence is further broken by the non-additivity of the matrix inversion in Eqn. (27). In other words, the sum of the inverse of matrices is not equal to the inverse of the sum of matrices. That is A⁻¹+B⁻¹+C⁻¹+ . . . ≠(A+B+C+ . . . )⁻¹.

NN Training Procedure Using Mini-batch Training Particle Flow

FIG. 6 illustrates, by way of example, a flow diagram of an embodiment of a mini-batch training particle flow technique. The technique is similar to the training particle flow technique illustrated in FIG. 3 with small alterations in the particle flow optimization 322 resulting in a mini-batch particle flow optimization 658. The particle flow optimization 658 includes selecting a mini-batch of data from the training set at operation 330, iterating through the discretized steps of the homotopy at operation 332, and, at operation 334, determining a batch of predictions.

Various Python ML and AI libraries, such as Pytorch® or TensorFlow®, do not store the individual gradients for each sample in the mini-batch; instead they either sum or average the gradients by default to increase efficiency and reduce memory usage.

To accommodate these libraries, the mini-batch particle flow training optimization 658 can be adjusted to evolve the average of the log of the joint posterior probability,

$\begin{matrix} \log P_{joint}^{'} = \frac{1}{N_{mb}} \log P_{joint} \sim [\frac{λ}{N_{mb}} \sum_{i = 1}^{N_{mb}} \log ℒ (y_{T, i} ❘ θ, x_{i})] + \log p (θ) - \frac{1}{N_{mb}} \log K & (Eqn . 29) \end{matrix}$

where p(θ) is the prior distribution of the particles prior to a batch update since log p_MB(θ)=N_mblog p(θ). This physically corresponds to the taking the geometric mean of the posterior probabilities for each sample within the mini-batch,

log P′_joint=log[P_joint]^1/N^mb→P′_joint=^N^mb√{square root over (Π_i=1^N^mbp(θ,λ|{x_i, custom-character _T,i}))}. (Eqn. 30)

The drift vector, determined at operation 664, then becomes,

$\begin{matrix} {\vec{f}}_{MB} = - {[\frac{λ}{N_{mb}} ({\vec{\nabla}}_{θ} {\vec{\nabla}}_{θ}^{T} \log ℒ_{MB}) + ({\vec{\nabla}}_{θ} {\vec{\nabla}}_{θ}^{T} \log p (θ))]}^{- 1} (\frac{1}{N_{MB}}) {\vec{\nabla}}_{θ} \log ℒ_{MB} . & (Eqn . 31) \end{matrix}$

The mean of the gradient of log custom-character _MBcan be computed in most ML libraries by setting log _MBas the objective function. However, calculating the mean of the Hessian of the mini-batch log-likelihood deserves careful thought to ensure averaging is performed at the correct time.

Training particle flow can use a Gauss-Newton Hessian approximation to calculate the Hessian of the log-likelihood,

$\begin{matrix} {({\vec{\nabla}}_{θ} {\vec{\nabla}}_{θ}^{T} \log ℒ)}_{rs} \approx \sum_{m} \sum_{k} \frac{\partial 𝓅_{k}^{T}}{\partial θ_{r}} \frac{\partial^{2} \log ℒ}{\partial 𝓅_{k} \partial 𝓅 m} \frac{\partial 𝓅 m}{\partial θ_{s}}, & (Eqn . 32) \end{matrix}$

where custom-character _mis the m-th component of the network prediction and the r, s describe the indices of the Hessian. Averaging the Hessian of the mini-batch log-likelihood includes averaging the Hessian for each sample in the mini-batch,

$\begin{matrix} \frac{1}{N_{mb}} {({\vec{\nabla}}_{θ} {\vec{\nabla}}_{θ}^{T} \log ℒ_{MB})}_{rs} = {\frac{1}{N_{mb}} [\sum_{i = 1}^{N_{MB}} {\vec{\nabla}}_{θ} {\vec{\nabla}}_{θ}^{T} \log ℒ (y_{T, i} ❘ θ, x_{i})]}_{rs} \approx \frac{1}{N_{mb}} \sum_{i = 1}^{N_{MB}} \sum_{m} \sum_{k} \frac{\partial 𝓅_{k}^{T}}{\partial θ_{r}} \frac{\partial^{2} \log ℒ (y_{T, i} ❘ θ, x_{i})}{\partial 𝓅_{k} \partial 𝓅_{m}} \frac{\partial 𝓅_{m}}{\partial θ_{s}} . & (Eqn . 33) \end{matrix}$

This means that both the Jacobian terms,

$\frac{\partial 𝓅_{m}}{\partial θ_{s}},$

and the Hessian terms,

$\frac{\partial^{2} \log ℒ (y_{T, i} ❘ θ, x_{i})}{\partial 𝓅_{k} \partial 𝓅_{m}},$

are calculated and stored for each sample in the minibatch at operation 662; then the product of these terms, as described in Eqn. (33), is averaged and used in the particle state update at operation 666.

The mini-batch particle flow optimization 658 can be summarized as follows:

For each mini-batch of data in the training set d={x, custom-character _T}, d∈ (select mini-batch of data from training set at operation 660):

- Equate the current distribution of particles to the prior distribution of the particles.
- Calculate the covariance of the prior distribution of particles, Γ.
- Loop iteratively through the scalar homotopy parameter λ=[0,1] (operation 332)
  - For each λ_k, k=1, 2, . . . , N_λ−1
    - Calculate integration step size, δλ=λ_k+1−λ_k
    - For each particle {θ_i}, i=1, 2, 3 . . . , N: (operation 334)
      - Pass mini-batch of input data x={x_j} through network (operation 661) using particle's values to get a batch of predictions

{ custom-character _j}=Λ_θ_i({x_j}))

- Calculate mean gradients and hessians of mini-batch log-likelihood with respect to network parameters (operation 662)
- Calculate the drift f and diffusion matrix Q (operation 664)
- Update state of particle using numerical integration of stochastic differential equation (SDE) (Eqn. 15) (operation 666)

θ_i=θ_i+fδλ+√{square root over (Qδλ)}n,n˜ custom-character (0,Id)

Results for classification of a subset of digits {0, 1} from the Modified National Institute of Standards and Technology (MNIST) database. The MNIST database was created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges using images from two separate NIST databases. The MNIST database can be accessed at http://yann.lecun.com/exdb/nnist/.

A convolutional NN (CNN) architecture consisting of 2 convolutional layers, each with 4 filters, followed by a dense output layer was instantiated. This network has 286 network parameters. 100 normally distributed particles with an initial covariance of Γ=0.04Id were also instantiated. Numerical integration of the flow was performed using a logarithmic step size with N_λ=10.

TABLE 1

List of Parameters used in Results

Initial Prior Covariance
Γ = 0.04 Id, Id =

Identity

Diffusion Matrix
α = 0.1

constant

Interpolation
Logarithmic Scheme

Scheme and # Divisions
with N_λ = 10

Number of Particles
N = 100

Number of Network
L = 3;

Layers
2 Convolutional, 1

Output

Number of Network
N_params= 286

Parameters

Mini-batch training particle flow was implemented to train a BNN with batch updates of N_MB=1, N_MB=2, and N_MB=16. For mini-batch sizes >1, batches contain an equal distribution of classes (e.g., batch-size of 16 contains 8 of class 0 and 8 of class 1). From these training examples, the smoothness of the mean log-likelihood increases with mini-batch size. Additionally, the divergence of the each particle's log-likelihoods from the mean decrease with increasing batch-size. This implies that individual particles are less susceptible to outliers than when no mini-batches (batch-size of 1) are used.

FIG. 7 shows a plot of the network accuracy with increasing batch update for a batch size of N_MB=1, N_MB=2, and N_MB=16. It is clear from this plot that a batch-size of 16 achieves and maintains the highest accuracy. Meanwhile, a batch-size of 1 tends to experience dips in the accuracy. From this study, it is clear that using a mini-batch worth of data, per parameter update, increases both the training speed and robustness of the approach.

AI is a field concerned with developing decision-making systems to perform cognitive tasks that have traditionally required a living actor, such as a person. NNs are computational structures that are loosely modeled on biological neurons. Generally, NNs encode information (e.g., data or decision making) via weighted connections (e.g., synapses) between nodes (e.g., neurons). Modern NNs are foundational to many AI applications, such as speech recognition.

Many NNs are represented as matrices of weights that correspond to the modeled connections. NNs operate by accepting data into a set of input neurons that often have many outgoing connections to other neurons. At each traversal between neurons, the corresponding weight modifies the input and is passed to an activation function. The result of the activation function is then transmitted to another neuron further down the NN graph. The process of weighting and processing, via activation functions, continues until an output neuron is reached; the pattern and values of the output neurons constituting the result of the ANN processing.

The correct operation of most NNs relies on accurate weights. However, NN designers do not generally know which weights will work for a given application. NN designers typically choose a number of neuron layers or specific connections between layers including circular connections. A training process may be used to determine appropriate weights by selecting initial weights. In some examples, the initial weights may be randomly selected. Training data is fed into the NN and results are compared to an objective function that provides an indication of error. The error indication is a measure of how wrong the NN's result is compared to an expected result. This error is then used to correct the weights. Over many iterations, the weights will collectively converge to encode the operational data into the NN. This process may be called an optimization of the objective function (e.g., a cost or loss function), whereby the cost or loss is minimized.

Gradient descent is a common technique for optimizing a given objective (or loss) function. The gradient (e.g., a vector of partial derivatives) of a scalar field gives the direction of steepest increase of this objective function. Therefore, adjusting the parameters in the opposite direction by a small amount decreases the objective function, in general. After performing a sufficient number of iterations, the parameters will tend towards a minimum value. In some implementations, the learning rate (e.g., step size) is fixed for all iterations. However, small step sizes tend to take a long time to converge, whereas large step sizes may oscillate around a minimum value or exhibit other undesirable behavior. Variable step sizes are usually introduced to provide faster convergence without the downsides of large step sizes.

After a forward pass of input data through the neural network, backpropagation provides an economical approach to evaluate the gradient of the objective function with respect to the network parameters. The final output of the network is built from compositions of operations from each layer, which necessitates the chain rule to calculate the gradient of the objective function. Backpropagation exploits the recursive relationship between the derivative of the objective with respect to a layer output and the corresponding quantity from the layer in front of it, starting from the final layer backwards towards the input layer. This recursive relationship eliminates the redundancy of evaluating the entire chain rule for the derivative of the objective with respect to each parameter. Any well-known optimization algorithm for backpropagation may be used, such as stochastic gradient descent (SGD), Adam, etc.

FIG. 8 is a block diagram of an example of an environment including a system for NN training, according to an embodiment. The system can aid in training of a cyber security solution according to one or more embodiments. The system includes an artificial NN (ANN) 805 that is trained using a processing node 810. The processing node 810 may be a central processing unit (CPU), graphics processing unit (GPU), field programmable gate array (FPGA), digital signal processor (DSP), application specific integrated circuit (ASIC), or other processing circuitry. In an example, multiple processing nodes may be employed to train different layers of the ANN 805, or even different nodes 807 within layers. Thus, a set of processing nodes 810 is arranged to perform the training of the ANN 805.

The set of processing nodes 810 is arranged to receive a training set 815 for the ANN 805. The ANN 805 comprises a set of nodes 807 arranged in layers (illustrated as rows of nodes 807) and a set of inter-node weights 808 (e.g., parameters) between nodes in the set of nodes. In an example, the training set 815 is a subset of a complete training set. Here, the subset may enable processing nodes with limited storage resources to participate in training the ANN 805.

The training data may include multiple numerical values representative of a domain, such as a word, symbol, other part of speech, or the like. Each value of the training or input 817 to be classified once ANN 805 is trained, is provided to a corresponding node 807 in the first layer or input layer of ANN 805. The values propagate through the layers and are changed by the objective function.

As noted above, the set of processing nodes is arranged to train the neural network to create a trained neural network. Once trained, data input into the ANN will produce valid classifications 820 (e.g., the input data 817 will be assigned into categories), for example. The training performed by the set of processing nodes 807 is iterative. In an example, each iteration of the training the neural network is performed independently between layers of the ANN 805. Thus, two distinct layers may be processed in parallel by different members of the set of processing nodes. In an example, different layers of the ANN 805 are trained on different hardware. The members of different members of the set of processing nodes may be located in different packages, housings, computers, cloud-based resources, etc. In an example, each iteration of the training is performed independently between nodes in the set of nodes. This example is an additional parallelization whereby individual nodes 407 (e.g., neurons) are trained independently. In an example, the nodes are trained on different hardware.

FIG. 9 illustrates, by way of example, a block diagram of an embodiment of a machine in the example form of a computer system 900 within which instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed. In a networked deployment, the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

The example computer system 900 includes a processor 902 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 904 and a static memory 906, which communicate with each other via a bus 908. The computer system 900 may further include a video display unit 910 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 900 also includes an alphanumeric input device 912 (e.g., a keyboard), a user interface (UI) navigation device 914 (e.g., a mouse), a mass storage unit 916, a signal generation device 918 (e.g., a speaker), a network interface device 920, and a radio 930 such as Bluetooth, WWAN, WLAN, and NFC, permitting the application of security controls on such protocols.

The mass storage unit 916 includes a machine-readable medium 922 on which is stored one or more sets of instructions and data structures (e.g., software) 924 embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 924 may also reside, completely or at least partially, within the main memory 904 and/or within the processor 902 during execution thereof by the computer system 900, the main memory 904 and the processor 902 also constituting machine-readable media.

While the machine-readable medium 922 is shown in an example embodiment to be a single medium, the term “machine-readable medium” may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more instructions or data structures. The term “machine-readable medium” shall also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention, or that is capable of storing, encoding, or carrying data structures utilized by or associated with such instructions. The term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Specific examples of machine-readable media include non-volatile memory, including by way of example semiconductor memory devices, e.g., Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

The instructions 924 may further be transmitted or received over a communications network 926 using a transmission medium. The instructions 924 may be transmitted using the network interface device 920 and any one of a number of well-known transfer protocols (e.g., HTTP). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), the Internet, mobile telephone networks, Plain Old Telephone (POTS) networks, and wireless data networks (e.g., WiFi and WiMax networks). The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible media to facilitate communication of such software.

ADDITIONAL NOTES AND EXAMPLES

Example 1 includes a method for training a Bayesian neural network (BNN) using batched inputs and operating the trained BNN, the method comprising initializing particles such that each particle individually represents pointwise values of respective NN parameters of NNs and that collectively represent a distribution of parameters of the BNN, optimizing, using min-batch training particle flow, the particles based on batches of inputs, resulting in optimized distributions for the parameters, determining a prediction distribution using the optimized distributions for the parameters and predictions from each of the NNs, and providing a marginalized distribution representative of the prediction distribution.

In Example 2, Example 1 can further include, wherein mini-batch training particle flow includes iteratively evolving values of the network parameters based on a log-homotopy.

In Example 3, Example 2 can further include, wherein the mini-batch training particle flow includes evolving the average of a log of the joint posterior probability.

In Example 4, at least one of Examples 2-3 can further include, wherein the mini-batch training particle flow includes determining, for each batch within the training set, a geometric mean of posterior probabilities for each input within the batch.

In Example 5, at least one of Examples 3-4 can further include, wherein evolving the average includes averaging, for each batch within the training set, a Hessian matrix for each input within the batches.

In Example 6, Example 5 can further include, wherein averaging the Hessian matrix includes storing, for each input within the batch, a corresponding Hessian matrix term and a Jacobian term.

In Example 7, Example 6 can further include, wherein averaging the Hessian matrix includes determining, for each input within the batch, a product of the Hessian matrix term and the Jacobian term in the Gauss-Newton approximation resulting in product results, and averaging the product results resulting in an average of the Hessian matrix.

Example 8 includes a system including processing circuitry and memory coupled to the processing circuitry, the memory including instructions that, when executed by the processing circuitry, cause the processing circuitry to perform the method of one of Examples 1-7.

Example 9 includes a non-transitory machine-readable medium including instructions stored thereon that, when executed by a machine, cause the machine to perform the method of one of Examples 1-8.

Although an embodiment has been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. The accompanying drawings that form a part hereof, show by way of illustration, and not of limitation, specific embodiments in which the subject matter may be practiced. The embodiments illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.

BNN TRAINING WITH MINI-BATCH PARTICLE FLOW

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

International Classifications

Abstract

Description

Claims