PROBABILISTIC NUMERIC CONVOLUTIONAL NEURAL NETWORKS

Information

  • Patent Application
  • 20220108173
  • Publication Number
    20220108173
  • Date Filed
    September 30, 2021
    3 years ago
  • Date Published
    April 07, 2022
    2 years ago
Abstract
Certain aspects of the present disclosure provide techniques for performing operations with probabilistic numeric convolutional neural network, including: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.
Description
INTRODUCTION

Aspects of the present disclosure relate to probabilistic numeric convolutional neural networks.


Machine learning is generally the process of producing a trained model (e.g., an artificial neural network, a tree, or other structures), which represents a generalized fit to a set of training data. Applying the trained model to new data produces inferences, which may be used to gain insights into the new data.


Machine learning models are seeing increased adoption across myriad domains, including for use in classification, detection, and recognition tasks. For example, machine learning models are being used to perform complex tasks on electronic devices based on sensor data provided by one or more sensors onboard such devices, such as automatically detecting features (e.g., faces) within images.


One particularly powerful type of machine learning model is the convolutional neural network (CNN) model, which is a type of deep neural network model that can be trained to identify various features in input data based. CNNs typically rely on kernels or filters that are strided across a grid of input data, such as the grid formed by pixels in an image, through various layers of the CNN. Inherent in the conventional design of a CNN, then, is that the input data will be sampled regularly, such as in rectangular grids of input image data.


Unfortunately, not all input data is regularly sampled. For example, continuous input signals, like time series, that are irregularly sampled or which have missing values are challenging for existing deep learning model architectures, such as CNNs.


Accordingly, methods are needed to improve the performance of CNNs when processing continuous input data.


BRIEF SUMMARY

Certain aspects provide a method for performing operations with probabilistic numeric convolutional neural network, including: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.


Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.


The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.





BRIEF DESCRIPTION OF THE DRAWINGS

The appended figures depict certain aspects of the one or more aspects and are therefore not to be considered limiting of the scope of this disclosure.



FIG. 1 depicts an example of process for training a probabilistic numeric convolutional neural network.



FIG. 2 depicts an example method for training a probabilistic numeric neural network.



FIG. 3 depicts an example processing system configured to perform the various methods described herein.





To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.


DETAILED DESCRIPTION

Aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for applying probabilistic numerics to convolutional neural networks (CNNs) to improve such models' ability to process continuous input data, including irregularly sampled input data.


Continuous input signals, like time series that are irregularly sampled or have missing values, are challenging for existing deep learning methods. One reason for this is that coherently defined feature representations generally depend on the values in unobserved regions of the input of irregularly sampled data. To overcome this issue, probabilistic numeric convolutional neural networks are described herein, which represent features as Gaussian processes, providing a probabilistic description of discretization error. Such probabilistic numeric convolutional neural networks define a convolutional layer as the evolution of a partial differential equation defined on a Gaussian process, followed by a nonlinear operation. Probabilistic numeric convolutional neural networks yield significant reductions in error from the previous state of the art on well-known datasets, such as SuperPixel-MNIST.


Standard convolutional neural networks are defined on a regular input grid. For continuous signals, these elements correspond to regular samples of an underlying function ƒ defined on a continuous domain. In such cases, the standard convolutional layer of a neural network is a numerical approximation of a continuous convolution operator custom-character.


Coherently defined networks on continuous functions should only depend on the input function ƒ, and not on spurious shortcut features, such as the sampling locations or sampling density, which enable overfitting and reduce robustness to changes in the sampling procedure. Each application of custom-character in a standard neural network incurs some discretization error which is determined by the sampling resolution. In some sense, this error is unavoidable because the features custom-character at the layers custom-character depend on the values of the input function ƒ at regions that have not been observed. For input signals which are sampled at a low resolution, or even sampled irregularly (e.g., such as with the sporadic measurements of patient vitals data in ICUs or dispersed sensors for measuring ocean currents), this discretization error cannot be neglected. Simply filling in the missing data with zeros or imputing the values is not sufficient since many different imputations are possible, each of which can affect the outcomes of the network.


Probabilistic numerics is an emergent field that studies discretization errors in numerical algorithms using probability theory. As described herein, probabilistic numerics may be built upon to quantify the dependence of a model (e.g., neural network) on the regions in the input which are unknown, and integrate this uncertainty into the computation of the model. To do so, the discretely evaluated feature maps {custom-character(xi)}i=1N are replaced with Gaussian processes: distributions over the continuous function custom-character that track the most likely values as well as the uncertainty. Beneficially, this Gaussian process feature representation need not resort to discretizing the convolution operator A as in a standard convolutionan neural network, but instead the continuous convolution operator may be applied directly. If a given feature is a Gaussian process, then applying linear operators yields a new Gaussian process with transformed mean and covariance functions. The dependence of custom-characterƒ on regions of ƒ that are not known translates into the uncertainty represented in the transformed covariance function, the analogue of the discretization error in a convolutional neural network, which is now tracked explicitly. The resulting model, as described further herein, may be referred to as a probabilistic numeric convolutional neural network (PNCNN).


Probabilistic Numerics

Probabilistic numeric convolutional neural networks, as described herein, leverage probabilistic numerics in which the error in numerical algorithms are modeled probabilistically, and typically with a Gaussian process. In this framework, only a finite number of input function calls can be made, and therefore the numerical algorithm can be viewed as an autonomous agent which has epistemic uncertainty over the values of the input. One example is Bayesian Monte Carlo model where a Gaussian process is used to model the error in the numerical estimation of an integral and optimally select a rule for its computation. Probabilistic numerics has been applied to numerical problems, such as the inversion of a matrix, the solution of an ordinary differential equation, a meshless solution to boundary value partial differential equations, and other numerical problems.


Gaussian Processes

Probabilistic numeric convolutional neural networks, as described herein, operate on a continuous function ƒ(x) underlying the input based on a collection of the values of that function sampled on a finite number of points {xi}i=1N. Classical interpolation theory reconstructs ƒ deterministically by assuming a certain structure of the signal in the frequency domain. Gaussian processes (GPs) give a way of modeling beliefs about values that have not been observed. These beliefs are encoded into a prior covariance k of the GP, ƒ˜custom-character(0, k), and updated with Bayesian inference upon seeing data. Explicitly, given a set of sampling locations x={xi}i=1N and noisy observations y={yi}i=1N sampled yi≣N(ƒ(xi), σi2), using Bayes rule, the posterior distribution ƒ|y, x˜custom-characterp,kp) may be computed, which captures the epistemic uncertainty about the values between observations. The posterior mean μp(x) and covariance kp(x, x′) are given by:





μp(x)=k(x)T[K+S]−1y,kp(x,x′)=k(x,x′)−k(x)T[K+S]−1k(x′),  (1)


where Kij=k(xi, xj), k(x)i=k(x, xi) and S=diag(σi2).


In some aspects, a radial basis function (RBF) kernel (kRBF) may be used to determine a prior covariance, due to its convenient analytical properties. For example:








k
RBF



(

x
,

x



)


=


a






𝒩


(


x
;

x



,


l
2


I


)



=



a


(

2

π






l
2


)



-

d
2






exp


(


-

1

2


l
2









x
-

x





2


)


.







In typical applications of GPs to machine learning tasks, such as regression, the function ƒ that is predicted is already the regression model. In contrast, here GPs are used as a way of representing beliefs and epistemic uncertainty about the values of both the input function and the intermediate feature maps of a model (e.g., a deep neural network model).


Probabilistic Numeric Convolutional Neural Networks

Given a continuous input signal ƒ:X→custom-characterc, a network with layers may be defined that acts directly on this continuous input signal. In one aspect, a neural network may be defined recursively from the input ƒ(0)=ƒ, as a series of L continuous convolutions custom-character with pointwise nonlinearities (e.g., ReLU) and weight matrices (custom-character, M∈custom-characterc×c) that mix only channels (known as 1×1 convolutions) according to:






custom-character=custom-characterReLU[custom-charactercustom-character],  (2)


A final global average pooling layer custom-character may be added that acts channel-wise as a natural generalization of the discrete case: custom-character(L))α=∫ƒα(L)(x)dx for each α=1, 2, . . . , c. Denoting the space of functions on X with c channels by Hc, the convolution operators custom-character are linear operators from custom-character to custom-character. Like in ordinary convolutional neural networks, the layers build up increasingly more expressive spatial features and depend on the parameters in custom-character and custom-character. Unlike ordinary convolutional networks, these layers are well defined operations on the underlying continuous signal.


While it is clear that such a network can be defined abstractly, the exact values of the function ƒ(L) generally cannot be computed as the operators depend on unknown values of the input. However, by adapting a probabilistic description, it is possible to formulate ignorance of ƒ(0) with a Gaussian process and see how the uncertainties propagate through the layers of the network, yielding a probabilistic output. The following briefly describes important components of Equation 2 that make this possible, with more detailed descriptions below.


Continuous convolution operators custom-character in Equation 2 can be applied to input Gaussian process ƒ˜custom-characterp, kp) in closed form. The output is another Gaussian process with a transformed mean and covariance: custom-characterƒ˜custom-character(custom-characterμp, custom-characterkpcustom-character′), where custom-character′ acts to the left on the primed argument of kp (x, x′). Below it is described how to parametrize these continuous convolutions in terms of the flow of a partial differential equation and show how they can be applied to the radial basis function kernel exactly in closed form.


Applying a ReLU nonlinearity to a Gaussian process in Equation 2 yields a new non-Gaussian stochastic process custom-character=ReLU[custom-charactercustom-character], and the mean and covariance of this process has a closed form solution which can be computed. This may generally be referred to as a probabilistic ReLU function.


The activations custom-character in Equation 2 are not Gaussian; however, for a large number of weakly dependent channels, it can be shown that custom-character=custom-charactercustom-character is approximately distributed as a Gaussian Process, as described further below.


While custom-characterin Equation 2 is approximately a Gaussian process, the mean and covariance functions have a complicated form. Instead of using these functions directly, it is possible to take measurements of the mean and variance of this process and feed them in as noisy observations to a fresh radial basis function kernel Gaussian process, allowing the process to be repeated and to build up multiple layers without increasing complexity.


In some aspects, the Gaussian process feature maps in the final layer ƒ(L) are aggregated spatially by an integral pooling custom-character that can also be applied in closed form to yield a Gaussian output. Assembling these components allow implementation of an end-to-end trainable probabilistic numeric convolutional neural network, which integrates a probabilistic description of missing data and discretization error inherent to continuous signals.


Continuous Convolutional Layers

On a discrete domain, such as the lattice X=custom-characterd, all translation equivariant linear operators custom-character are convolutions. In general, these convolutions can be written in terms of a linear combination of powers of the generators of the translation group: the shift operators τi, i=1, . . . , d shift all elements by one unit along the i-th axis of the grid. For a one dimensional grid, one can always write custom-characterk Wkτk where the weight matrices Wkcustom-characterx×x act only on the channels and the shift operator τ acts on functions on the lattice. In d dimensions, custom-characterk1, . . . , kdτ1k1 . . . τdkd for some set of integer coefficients k1, . . . , kd. For example when d=2, k1, k2∈{−1, 0, 1} can be taken to fill out a 3×3 neighborhood.


On the continuous domain X=custom-characterd, convolutions may be similarly parametrized with custom-characterkWkeDk, where custom-characterk is given by powers of the partial derivatives ∂i, i=1, . . . , d that generate infinitesimal translations along the i-th axes. Setting d=1 for simplicity, it can be verified that the operator exponential τα=eα∂x applied to a function g(x) is a translation:






e
α∂x
g(x)=g(x)+ag′(x)+½α2g″(x)+ . . . =g(x+α),


which is the Taylor series expansion of g(x+α) around x. Exponentials of operators can be defined similarly in terms of the formal Taylor series custom-characterk=0custom-characterk/k! or more broadly as the solution to the partial differential equation:





tg(t,x)=(Dg)(t,x),g(0,x)=g(x)  (3)


at time t=1: custom-characterg(x)=g(t=1,x).


Following the discussion in the discrete case, translation invariance of custom-characterk imposes that it is expressed in terms of powers of the generators. Collecting the derivatives into the gradient ∇, the general form of Dk can be written as αkkT∇+½∇TΣk∇+ . . . for any constants αk, vectors βk, matrices Σk, etc. For simplicity, the series may be truncated at second order to get:






D
kkT∇+½∇TΣk∇,  (4)


where the constants αk that can be absorbed into the definition of Wk are omitted. For this choice of custom-character, the partial differential in Equation 3 is nothing but the diffusion equation with drift βk and diffusion Σk. When discussing rotational equivariance, below, a more general form of custom-character is also considered.


The diffusion layer can also be viewed in another way as the infinitesimal generator of an Ito diffusion (a stochastic process). Given an Ito process with constant drift and diffusion dXt=βdt+Σ1/2 dBt, where Bt is a d dimensional Brownian motion, the time evolution operator can be written via the Feynman-Kac formula as custom-characterƒ(x)=[ƒ(Xt)], where X0=x. In other words, the operator layer custom-character=custom-character is the expectation under a parametrized Neural Stochastic Differential equation that is homogeneous and therefore shift invariant. The flow of this stochastic differential equation depends on the drift and diffusion parameters and E.


To recap, a convolution operator may be defined through the general form custom-characterk Wkcustom-character, where the weight matrices Wkcustom-characterc×c mix only channels and custom-character is the forward evolution by one unit of time of the diffusion equation with drift βk and diffusion Σk containing learnable parameters {(Wk, βk, Σk)}k=1K. The translation equivariance of custom-character follows directly from the fact that the generators commute ∀k, i: [custom-characterk, ∇i]=0 and therefore [custom-character, τi]=0 (the bracket [a, b]=ab−ba is the commutator of the two operators).


Application on Radial Basis Function Gaussian Processes

Although the application of the linear operator custom-characterk Wkcustom-character involves the time evolution of a partial differential equation, owing to properties of the radial basis function kernel, the operator may beneficially be applied to an input Gaussian process in closed form. Gaussian processes are closed under linear transformations. For example, given ƒ˜GP(μp, kp), the action of custom-character need only be computed on the mean and covariance: custom-characterƒ˜GP (Aμp, AkpA′), where A′ is the adjoint with respect to the L2(X) inner product. The application of time evolution custom-character is a convolution with a Green's function Gk, so custom-characterƒ=Σk Wkcustom-characterƒ=Σk WkGk*ƒ. In one aspect, the Green's function for custom-characterkkT∇+(½)∇T Σk∇, is nothing but the multivariate Gaussian density Gk(x)=N(x;−βkk) acccording to:






custom-characterƒ=ΣkWkcustom-characterƒ=ΣkWkGk*ƒ=ΣkWkN(−βkk)*ƒ.  (5)


In order to apply custom-character to the posterior Gaussian process, the operator need only be applied to the posterior mean and covariance. This posterior mean and covariance in Equation 1 are expressed in terms of kRBFcustom-character(x; x′, custom-characterI) and the computation boils down to a convolution of two Gaussians:






custom-character
k
RBF(x,x′)=custom-character(x;tβ,tΣ)*αcustom-character(x;x′,custom-characterI)=αcustom-character(x;x′−tβ,custom-characterI+tΣ)  (6)






custom-character
k
RBF(x,x′)ecustom-charactercustom-character(x;x′−t1−β2),custom-characterI+tΣ1+tΣ2).  (7)


The application of the channel mixing matrices Wk and summation is also straightforward through matrix multiplication for the mean and covariance. To summarize, because of the closed form action on the radial basis function kernel, the layer can be implemented efficiently and exactly with no discretization or approximations.


Note with respect to Green's function above that the action of custom-character encompasses the ordinary convolution operator on the 2d lattice as a special case. For example, given drift βk∈{−1, 0, 1}×2, k=1, . . . , 9 filling out the 9 elements of a 3×3 grid and as the diffusion Σk→0, the Green's function is a Dirac delta, so that:






custom-characterƒ(x)=ΣkWkδ(x−βk)*ƒ(x)=Σi,j=−1,0,1Wijƒ(x1−i,x2−j)=Wcustom-characterƒ(x)


General Equivariance

The convolutional layers discussed so far are translation equivariant, but it is possible to extend the continuous linear operator layers to more general symmetries, such as rotations. Feature fields in this more general case are described by tensor fields, where the symmetry group acts not only on the input space X but also on the vector space attached to each point x∈X. A linear layer custom-character is equivariant if its action commutes with that of the symmetry. It is possible to derive constraints for general linear operators and symmetries, which generalize those known in the context of steerable convolutional neural networks.


Probabilistic Nonlinearities and Rectified Gaussian Processes

It is possible to derive the mean and variance for a univariate rectified Gaussian distribution for use in a neural network. This can then be generalized to the full covariance function (and higher moments) of a rectified Gaussian process.


For example, for an input GP custom-charactercustom-character(x)˜GP (μ(x), k(x, x′)), the standard deviation may be denoted σ(x)=√{square root over (k(x, x))}, the matrix with components Σij=k (xi, xj) for i, j=1, 2, and the mean μ=[μ(x1), μ(x2)]. The notation Φ(z) may be used for the univariate standard normal cumulative distribution function (CDF), and Φ(z; Σ) may be used for (two-dimensional) multivariate CDF of N(0, Σ) at z. Σ1 and Σ2 are the column vectors of Σ. The first and second moments of h=ReLU[custom-characterƒ] are:






custom-character[h(x)]=μ(x)Φ(μ(x)/σ(x))+σ(x)Φ′(μ(x)/σ(x)),  (8)






custom-character[h(x1)h(x2))]=(k(x1,x2)+μ(x1)μ(x2))Φ(μ;)+(μ(x12T+μ(x21T)∇Φ(μ;)+Σ1T∇∇TΦ(μ;)Σ2  (9)


where ∇∇TΦ denotes the Hessian of Φ with respect to the first argument. The first and higher order derivatives of the Normal CDF are just the probability distribution function (PDF) and products of the PDF with Hermite polynomials. Note that the mean and covariance interact through the nonlinearity.


Channel Mixing and Central Limit Theorem

After the nonlinearity is applied (e.g., probabilistic ReLU), the process is no longer Gaussian. To overcome this issue, a channel mixing matrix custom-charactercustom-character is introduced, and the feature map is defined in the following layer by custom-character=custom-charactercustom-character, where custom-character=ReLU[custom-charactercustom-character]. So long as the channels of custom-character are only weakly dependent, the central limit theorem (CLT) may be applied to each function according to custom-character=custom-charactercustom-charactercustom-character so that in the limit of large custom-character, the statistics of the custom-character's converge to a Gaussian process with first and second moments given by:






custom-character[custom-character(x)]=Mcustom-character[custom-character(x)],custom-character[custom-character(x)custom-character(x′)T]=






M
custom-character[custom-character(x)custom-character(x′)T]MT  (10)


The convergence to a Gaussian process here is reminiscent of the well-known infinite width limit of Bayesian neural networks. However the setting here is fundamentally different. Unlike the Bayesian case where the distribution of M is given by a prior or posterior, in the case of a probabilistic numeric convolutional neural network (PNCNN), M is a deterministic quantity and instead the uncertainty is about the input. Thus, a PNCNN is not a Bayesian method in the sense of representing uncertainty about the parameters of the model, but instead it is Bayesian in representing and propagating the uncertainty in the value of the inputs.


Measurement and Projection to RBF Gaussian Process

As a last step, the mean and covariance functions of the approximate GP custom-character are simplified. While it is possible to compute the values of these functions, unlike in the RBF kernel case, it is not possible to apply the convolution operator custom-character in closed form. In order to circumvent this challenge, the (approximately) Gaussian process custom-character is modeled with an RBF Gaussian process as follows. First, the mean yi=custom-character[custom-character(xi)] and variance σi2=custom-characterar[custom-character(xi)] of the approximate Gaussian process custom-character are evaluated at a collection of points {xi}i=1N using Equations 8, 9 and 10. These values yi are treated as measurements of the underlying signal with a heteroscedastic noise σi2 that varies from point to point. Second, the RBF-based posterior GP of this signal {circumflex over (f)}|{(xi, yi, σi)}i=1N˜custom-characterp,kp) with posterior mean and covariance given by (1) is computed for the heteroscedastic noise model. The uncertainty in the input custom-characteris propagated through to the RBF posterior {circumflex over (f)}|{(xi, yi, σi)}i=1N via the measurement noise σi. Notably, this Gaussian process mean and covariance functions are written in terms of the RBF kernel and therefore it is possible to continue applying convolutions in closed form in future layers.


As described further below, the RBF kernel in each layer is trained to maximize the marginal likelihood of the data that it sees, and thereby minimize the discrepancy with the underlying generating distribution custom-character. This measurement/projection approach is effective in many scenarios.


Training Procedure

An example neural network model, such as depicted in FIG. 1, may have two sets of parameters: the channel mixing and diffusion parameters, {(custom-character, custom-character, custom-character, custom-character)}custom-character, as well as kernel hyperparameters of the Gaussian Processes {(custom-character, custom-character)}custom-character. In some aspects, all parameters are trained jointly on the loss Ltaskcustom-character, where Ltask is the cross entropy with logits given by the mean μp of the pooled features P(ƒ(L))˜N(μpp) and custom-character is the marginal log likelihoods of the GP feature maps:






custom-character(ƒ)=½custom-charactercustom-character[(fαT[KXX+Sα]−1fα)+log det[KXX+Sα]+N log 2π]custom-character  (11)


where for each layer custom-character, fα=[fα (xi), . . . , ƒα(xN)]Σ custom-characterN are the observed values for channel α at locations X=[x1, . . . , xN], KXX is the covariance of the RBF kernel and Sα=diag(σα2) the measurement noise for each channel a and spatial location, and logdet[·] is a log determinant function. Notably the GP marginal likelihood is independent of the class labels.


Example Probabilistic Numeric Convolutional Neural Network Architecture


FIG. 1 depicts an example 100 of a probabilistic numeric convolutional neural network architecture. In the depicted example, input data is superpixel data, which is an example of sparsely sampled continuous data, though in other aspects, other types of input data, including other types of sparsely (or irregulary) sampled input data may be used. Generally, the mean and elementwise uncertainty of the Gaussian process feature maps are shown as they are transformed through the network by the convolution layers. Observation points shown as dots in σ(x).


As depicted, in a first convolution layer 104, a Gaussian process 106 is determined based on the input data x, which includes a determined mean function μ1(x) 108 and a determined standard deviation σ1(x) 110. The standard deviation σ1(x) may then be used to determine a covariance. In one aspect, as described above, a covariance kernel k(x, x′) can be determined as above in Equation 1.


This Gaussian process 106 in layer 104 is used to interpolate the data input to linear operator custom-character(1) 112, which in-turn generates pre-activation data, and thus substitutes for a conventional convolution layer. In some aspects, linear operator custom-character(1) is implemented as a diffusion process, which replaces learnable parameters in the convolutional neural network, as described above.


Then, a pointwise nonlinear operation (probabilistic ReLU in this example) is applied to the pre-activation data to generate activations, and then channel mixing is performed, as described above. Finally, the current Gaussian process is evaluated (measured) at a given set of points and the uncertainty of Gaussian process is treated as a heteroscedastic noise model. This is the output of the layer that can be then passed on to the next layer, which yields a new Gaussian process for the second layer, with transformed mean function μ2 (x) and standard deviation σ2 (x).


This process is repeated through a plurality of layers (four in this example) and ends with an integral pooling operation (as discussed above) at 114. The output 116 of the model and process 100 in this example is a random variable custom-character with mean μ and uncertainty Σ.


As depicted, during training, the cross-entropy loss of the model output is minimized along with the sum of the marginal log likelihoods (MLL) of each layer, for example, according to Equation 11 above. However, in other aspects, the cross-entropy loss is first minimized, followed by the sum of the marginal log likelihoods.


Example Method for Performing Operations with a Probabilistic Numeric Convolutional Neural Network Model


FIG. 2 depicts an example method 200 for performing operations with a probabilistic numeric neural network.


Method 200 begins at step 202 with receiving input data (e.g., x). In some aspects, the input data is in the form of a vector-valued function (e.g., ƒ(x)).


Method 200 then proceeds to step 204 with calculating a mean of the input data (e.g., μ(x)).


Method 200 then proceeds to step 206 with calculating a covariance of the input data (e.g., k(x, x′)).


Method 200 then proceeds to step 208 with determining a Gaussian process based on the mean and the covariance of the input data (e.g., GP (μ(x), k(x, x′))), where k(x, x′)=σ2 (x).


Method 200 then proceeds to step 210 with applying a linear operator (custom-character) to the Gaussian process to generate pre-activation data. In one aspect, this may be performed according to custom-character[ƒ]˜GP(custom-characterμ(x), custom-characterk(x, x′)A).


Method 200 then proceeds to step 212 with applying a nonlinear operation to the pre-activation data to form activation data (e.g., σ(custom-character[ƒ]), where a is a nonlinear operator such as ReLU). In some embodiments, a channel mixing may further be performed to the activation data, such as according to Equation 10 above. In some aspects, step 212 may be performed iteratively across two or more layers of a model.


Method 200 then proceeds to step 214 with applying a pooling operation to the activation data to generate an inference. In some aspects, the pooling operation is an integral pooling operation, such as described above. In some aspects, the inference is in the form of a random variable custom-character with mean μ and uncertainty Σ (e.g. N(μ, Σ)), such as 116 in FIG. 1.


During a training phase, method 200 may then optionally proceed to step 216 with calculating a loss based on the inference.


Further during a training phase, method 200 may then optionally proceed to step 218 with training parameters of the linear operator (e.g., custom-character) based on the loss.


In some aspects of method 200, applying a linear operator to the Gaussian process comprises applying a diffusion equation (e.g., etDƒ(x)).


In some aspects of method 200, the loss comprises a cross-entropy component.


In some aspects of method 200, the loss further comprises a marginal log likelihood component associated with the Gaussian process; and the method further comprises: training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.


In some aspects of method 200, training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator.


In some aspects of method 200, training parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process.


In some aspects of method 200, the nonlinear operation comprises a probabilistic ReLU operation.


In some aspects of method 200, the input data comprises irregularly sampled data (e.g., {(xi, ƒ(xi))}i=1N).


Example Processing System


FIG. 3 depicts an example processing system 300 configured to perform the various methods described herein, including, for example, with respect to FIGS. 1 and 2.


Processing system 300 includes a central processing unit (CPU) 302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 302 may be loaded, for example, from a program memory associated with the CPU 302 or may be loaded from memory 314.


Processing system 300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 304, a digital signal processor (DSP) 306, and a neural processing unit (NPU) 308.


An NPU, such as 308, is generally a specialized circuit configured for implementing all the necessary control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs), deep neural networks (DNNs), random forests (RFs), probabilistic numeric convolutional neural networks (PNCNNs), and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP), tensor processing unit (TPU), neural network processor (NNP), intelligence processing unit (IPU), vision processing unit (VPU), or graph processing unit.


NPUs, such as 308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples they may be part of a dedicated neural-network accelerator.


NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.


NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged), iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.


NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process it through an already trained model to generate a model output (e.g., an inference).


In one implementation, NPU 308 is a part of one or more of CPU 302, GPU 304, and/or DSP 306.


In some examples, connectivity component 312 may include various subcomponents, for example, for wide area network (WAN), local area network (LAN), Wi-Fi connectivity, Bluetooth connectivity, and other data transmission standards.


Processing system 300 may also include one or more input and/or output devices 310, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.


In some examples, one or more of the processors of processing system 300 may be based on an ARM or RISC-V instruction set.


Processing system 300 also includes memory 314, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 314 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 300.


In this example, memory 314 includes Gaussian process component 314A, linear operator component 314B, nonlinear operation component 314C, pooling component 314D, measuring component 314E, loss calculation component 314F, training component 314G, inferencing component 314H, model parameters 3141, and models 314J. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.


Generally, processing system 300 and/or components thereof may be configured to perform the methods described herein, including methods described with respect to FIGS. 1 and 2.


Notably, in other aspects, processing system 300 may include additional, alternative, or fewer elements. Further, various aspects of methods described above may be performed on one or more processing systems.


Example Clauses

Implementation examples are described in the following numbered clauses:


Clause 1: A method of performing operations with a probabilistic numeric neural network, comprising: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.


Clause 2: The method of Clause 1, wherein the inference comprises a random variable.


Clause 3: The method of any one of Clauses 1-2, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.


Clause 4: The method of any one of Clauses 1-3, further comprising: calculating a loss based on the inference; and training parameters of the linear operator based on the loss.


Clause 5: The method of Clause 4, wherein: the loss further comprises a cross entropy component.


Clause 6: The method of Clause 5, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, and the method further comprises: training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.


Clause 7: The method of Clause 5, wherein training parameters of the linear operator comprises performing gradient descent on the training parameters of the linear operator.


Clause 8: The method of any one of Clauses 6-7, wherein training parameters of the Gaussian process comprises performing gradient descent on the training parameters of the Gaussian process.


Clause 9: The method of any one of Clauses 1-8, wherein the nonlinear operation comprises a probabilistic ReLU operation.


Clause 10: The method of any one of Clauses 1-9, wherein the input data comprises irregularly sampled data.


Clause 11: The method of any one of claims 1-10 wherein the linear operator comprises custom-characterk Wkcustom-character and applying the linear operator to pre-activation data is performed according to Equation 5.


Clause 12: A processing system, comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any one of Clauses 1-11.


Clause 13: A processing system, comprising means for performing a method in accordance with any one of Clauses 1-11.


Clause 14: A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any one of Clauses 1-11.


Clause 15: A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any one of Clauses 1-11.


Additional Considerations

The preceding description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not limiting of the scope, applicability, or embodiments set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.


As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.


As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c).


As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” may include resolving, selecting, choosing, establishing and the like.


The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.


The following claims are not intended to be limited to the embodiments shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims
  • 1. A method of performing operations with a probabilistic numeric convolutional neural network, comprising: defining a Gaussian Process based on a mean and a covariance of input data;applying a linear operator to the Gaussian Process to generate pre-activation data;applying a nonlinear operation to the pre-activation data to form activation data; andapplying a pooling operation to the activation data to generate an inference.
  • 2. The method of claim 1, wherein the inference comprises a random variable.
  • 3. The method of claim 1, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.
  • 4. The method of claim 1, further comprising: calculating a loss based on the inference; andtraining parameters of the linear operator based on the loss.
  • 5. The method of claim 4, wherein the loss comprises a cross entropy component.
  • 6. The method of claim 5, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, andthe method further comprises training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
  • 7. The method of claim 6, wherein training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator.
  • 8. The method of claim 7, wherein training parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process.
  • 9. The method of claim 1, wherein the nonlinear operation comprises a probabilistic ReLU operation.
  • 10. The method of claim 1, wherein the input data comprises irregularly sampled data.
  • 11. A processing system, comprising: a memory comprising computer-executable instructions; andone or more processors configured to execute the computer-executable instructions and cause the processing system to: define a Gaussian Process based on a mean and a covariance of input data;apply a linear operator to the Gaussian Process to generate pre-activation data;apply a nonlinear operation to the pre-activation data to form activation data; andapply a pooling operation to the activation data to generate an inference.
  • 12. The processing system of claim 11, wherein the inference comprises a random variable.
  • 13. The processing system of claim 11, wherein in order to apply a linear operator to the Gaussian process, the one or more processors are further configured to cause the processing system to apply a diffusion equation to the Gaussian process.
  • 14. The processing system of claim 11, wherein the one or more processors are further configured to cause the processing system to: calculate a loss based on the inference; andtrain parameters of the linear operator based on the loss.
  • 15. The processing system of claim 14, wherein the loss comprises a cross entropy component.
  • 16. The processing system of claim 15, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, andthe one or more processors are further configured to cause the processing system to train parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
  • 17. The processing system of claim 16, wherein in order to training parameters of the linear operator, the one or more processors are further configured to cause the processing system to perform gradient descent on the parameters of the linear operator.
  • 18. The processing system of claim 17, wherein in order to train parameters of the Gaussian process, the one or more processors are further configured to cause the processing system to perform gradient descent on the parameters of the Gaussian process.
  • 19. The processing system of claim 11, wherein the nonlinear operation comprises a probabilistic ReLU operation.
  • 20. The processing system of claim 11, wherein the input data comprises irregularly sampled data.
  • 21. A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by a processor of a processing system, cause the processing system to perform a method of training a probabilistic numeric neural network, the method comprising: defining a Gaussian Process based on a mean and a covariance of input data;applying a linear operator to the Gaussian Process to generate pre-activation data;applying a nonlinear operation to the pre-activation data to form activation data; andapplying a pooling operation to the activation data to generate an inference.
  • 22. The non-transitory computer-readable medium of claim 21, wherein the inference comprises a random variable.
  • 23. The non-transitory computer-readable medium of claim 21, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.
  • 24. The non-transitory computer-readable medium of claim 21, wherein the method further comprises: calculating a loss based on the inference; andtraining parameters of the linear operator based on the loss.
  • 25. The non-transitory computer-readable medium of claim 24, wherein the loss comprises a cross entropy component.
  • 26. The non-transitory computer-readable medium of claim 25, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, andthe method further comprises training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
  • 27. The non-transitory computer-readable medium of claim 26, wherein: training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator, andtraining parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/086,339, filed on Oct. 1, 2020, the entire content of which is hereby incorporated by reference.

Provisional Applications (1)
Number Date Country
63086339 Oct 2020 US