The following relates to the information processing arts, computational arts, numerical integration arts including inference arts such as Bayesian inference, and related arts.
Tasks such as machine learning, modeling, and so forth commonly involve evaluation of an intractable sum or integral. For example, in Bayesian inference a posterior predictive distribution of the form:
p(x|,α)=∫Θp(x|θ)p(θ|X,α)dθ
is evaluated, where θ is a parameters vector defined on the parameter space Θ, α is a hyperparameter vector, X is a training data set (also called the “evidence”), and x is a data point to be predicted. Computing the posterior predictive distribution requires evaluating an integral of a product of two functions, namely p(x|θ) and p(θ|X,α), which is often not possible to perform analytically.
Numerical approximations methods have been derived to evaluate such integrals. Sampling techniques such as Gibbs sampling are sometimes used to explore the space and compute sufficient statistics to estimate such an integral. However, sampling techniques are stochastic by nature and typically do not provide guaranteed convergence, but only convergence with a high probability. Other approaches minimize an objective function, such as a loss function, a negative log-likelihood or a energy criterion.
In some approaches, an upper (or lower) bounding function is defined such that the integral is known to be less than (or greater than) the bound everywhere over the region of interest. In such approaches, the “tightness” of the bound determines accuracy of (or at least an error bar for) the estimated integral. A loose bound which has large gaps between the bounding function and the true integral over the region will typically result in a large overestimate of the integral (for an upper bound). A tight bound of the bounding function to the true integral over the region provides accuracy and a small error bar.
Accuracy can be improved by partitioning the region over which the integral is to be computed and defining an upper (or lower) bound for each partition. Bouchard et al., U.S. Pat. No. 8,190,550 titled “Split Variational Inference” discloses some such approaches, including a disclosed embodiment which partitions a region of interest into a plurality of soft bin regions that span the region of interest, estimates an integral over each soft bin region of a function defined over the region of interest, and outputs a value equal to or derived from the sum of the estimated integrals over the soft bin regions spanning the region of interest.
In some embodiments disclosed herein, a non-transitory storage medium storing instructions readable and executable by an electronic data processing device to perform a method of evaluating an integral Z:=Πk=1Kƒk(t)dv(t) where K≧2, v is a measure on a space and ƒk(t), k=1, . . . , K are functions defined in the space , the method comprising operations including optimizing a pivot function q(t)=Πk=1Kqk(t) to minimize a bound defined by the inequality Πk′=1Kƒk(t)dv(t)≧Πk=1K
where ρk≧0 for k=1, . . . , K and Σk=1Kρk=1 to determine an optimized pivot function qopt(t) and optimized values ρ1
with q(t)=qopt(t) and ρk=ρk
In some embodiments disclosed herein, an apparatus comprises a non-transitory storage medium as set forth in the immediately preceding paragraph and an electronic data processing device configured to read and execute the instructions stored on the non-transitory storage medium.
In some embodiments disclosed herein, a method operating on two functions ƒ and g each mapping from a space into + comprises optimizing a pivot function r:+ to minimize a bound defined by the inequality Z≦
where v is a measure on the space and p≧1 and
to determine an optimized pivot function ropt(t). The method may further comprise evaluating the product ƒg as one of ƒg=ƒproptp and ƒg=gqropt−q. The method may additionally or alternatively further comprise evaluating the integral Z:=ƒ(t)g(t)dv(t) as the product
with r(t)=ropt(t). The method is suitably performed by an electronic data processing device.
In existing integral evaluation approaches employing an upper (or lower) bound, a problem exists in that substantial computer resources are required to evaluate the integral. The evaluation of the integral is computationally expensive due to the need to determine a tight bound. In approaches employing partitioning of the region of interest, the number of computations is multiplied by the number of partitions.
Disclosed herein are improved systems and methods providing various benefits, such as: reduced computer resources required for evaluating the integral; more accurate integration; and applicability to a wide range of classes of integrals in which the integrand can be formulated as the product of two or more functions each of which is individually readily integrable. The disclosed systems and methods additionally or alternatively may be employed to estimate the normalized integrand distribution.
The disclosed approaches for computationally efficient integral evaluation are based on Hölder's inequality, which was originally promulgated by Otto Ludwig Hölder (Dec. 22, 1859-Aug. 29, 1937). Hölder's inequality is suitably defined as follows. Denote by ƒ and g two functions mapping from a space into +. For example, in a probabilistic inference problem, the functions ƒ and g typically correspond to the prior and the likelihood term of the probabilistic inference problem. The integral Z of a product ƒg of these functions is to be evaluated, that is, Z:=∫ƒ(t)g(t)dv(t), where v is a measure on the space . It is assumed that this integral Z is intractable (or at least is not readily evaluated analytically). However, the individual integrals are readily integrated, that is, there exist efficient methods to compute the individual integrals ∫ƒdv and ∫gdv. A known upper bound on the integral Z is the Hölder inequality, which can be written as follows:
where p is a scalar and p≧1 and q is defined in terms of p as
Hölder's inequality (e.g. as expressed in Equation (1)) might appear to be a good candidate for evaluation of integrals such as integrals implicated in evaluating a posterior predictive distribution in Bayesian inference. However, in practice Hölder's inequality is often not useful for such purposes because the bound is too loose. Hölder's inequality provides a tight bound when g=ƒp-1 (or equivalently when ƒ=gq-1), but in practical applications such as Bayesian inference this equality cannot be achieved for any value of p, and the bound is loose. Accordingly, Hölder's inequality is not of practical use in evaluating many integrals of the form Z:=ƒ(t)g(t)dv(t).
Disclosed herein are integration approaches that are implementable with reduced computer resources (i.e. are less computationally expensive) as compared with other existing integration approaches. The disclosed integration approaches modify Hölder's inequality to incorporate a pivot function r as follows. For any function r:+ the following inequality holds:
This proposition is suitably demonstrated as follows. Replace ƒ by ƒr and similarly replace g by
in Hölder's inequality as presented in Equation (1). The expression for the integral Z is then written as:
Note that the integral Z of Equation (3) is exactly equal to the original integral Z:=ƒ(t)g(t)dv(t) since the r(t) and
terms of Equation (3) cancel. Applying Hölder's inequality (Equation (1)) to the integral of Equation (3) yields:
which is Equation (2) since the left-hand size is exactly equal to Z.
Equation (2) is referred to herein as a “variational Hölder's inequality”, and serves as the basis for a variational inference algorithm disclosed in the following which comprises minimizing the bound defined by Equation (2) with respect to the pivot function r.
As an initial point, it is observed herein that the upper bound defined by the variational Hölder's inequality (Equation (2)) actually becomes exact if r is chosen to be proportional to r* which is defined as follows:
That is, for r=cr* with c being a positive scalar and r* as defined in Equation (5), the variational Hölder's inequality of Equation (2) becomes an equality. This can be demonstrated as follows. For r=cr*, the first integral of the right-hand side of Equation (2) is
because it integrates
Similarly, the second integral of the right-hand side of Equation (2) is
because
The product of these integrals is
It is also noted that for the case of p=q=2 (which satisfies the conditions p≧1 and
required for Hölder's inequality), the function r* given in Equation (5) corresponds to the square root of the ratio between the two factors ƒ and g, that is,
Equation (2) using a pivot function r=cr* with r* given in Equation (5) thus yields an exact evaluation of Z. In practice however, the integrals in Equation (2) formulated using r* as given in Equation (5) are typically not easier to evaluate than the original integral Z.
As further disclosed herein, computationally efficient evaluation of the integral Z using the variational Hölder's inequality of Equation (2) is achievable by defining a family of variational functions ={r(•,θ),θεΘ} parameterized by θεΘ for which the integrals of the form:
are tractable. Choosing the best function r(•,θ) in the family is disclosed herein to significantly tighten the bound as compared with Hölder inequality (Equation (1)). In one suitable optimization formulation, the upper bound Ī:Θ+ is minimized with respect to θ, p and q (where p and q are not independent but rather are constrained by p≧1 and
where:
For variational functions r(•,θ) that are differentiable with respect to θ, the optimal parameters θ can be obtained using an optimization algorithm such as (by way of illustrative example) a standard gradient descent algorithm.
In some embodiments, the pivot functions r are chosen to be log-linear functions. Defining the feature function φ:Θ and assuming the space Θ is equipped with the dot-product operation, the following class of pivot functions is suitably employed:
r(t,θ)=e<θ,φ(t)>,θεΘ (8)
For log-linear functions r as defined in Equation (8), the bound Ī is log-convex and differentiable in θ. This can be shown as follows. The factors log Iƒ (θ;p) and log Ig(θ; q) are log-sum-exp functions (logarithm of the integral of an exponential function) and therefore are convex in θ. Hence log Ī(θ)=log Iƒ(θ;p)+log Ig(θ; q) is a sum of two functions which are convex in θ, so that Ī(θ) is log-convex. Because of this log-convex property, gradient descent algorithms are less subject to local minima. In practice, this means that complex initialization procedures are not required. In addition, the optimization is not restricted to generic gradient descent algorithms, but rather other optimization methods, tailored to specific problems can alternatively be employed. Furthermore, various available theoretical tools pertaining to convex optimization may be employed to analyze convergence properties. For example, it is expected that the objective function of Equation (7) with a log-convex pivot function as per Equation (8) can be solved in polynomial time using standard gradient descent algorithms.
Optimization of Equation (7) with a pivot of the form r(t,θ):=eθ
which is ph,θ,p[φ(t)]. Similarly, the gradient ∇p log Ih(θ;p) may be written as
log(h(t))h(t)pepθ
With reference to
The upper bound minimization engine 22 employs a suitable optimization algorithm, such as a gradient descent algorithm, to optimize the objective function 24 respective to the pivot function parameters 26 and Hölder's inequality parameter 28. Other optimization algorithms may be used. By way of an additional illustrative example, in some embodiments a modified gradient descent is employed in which only the pivot function parameters 26 are subject to optimization by the gradient descent algorithm while the Hölder's inequality parameter 28 can be optimized using a grid approach (e.g., repeating the gradient descent optimization of pivot function parameters θ for each of a plurality of discrete values of p and choosing the value of p yielding the best gradient descent result).
When employing log-linear functions r as defined in Equation (8), or another pivot function family for which the bound Ī is log-convex, it is ensured that a global upper bound minimum can be identified. Nonetheless, it is to be understood that as used herein the term “minimize”, “optimize” and similar phraseology encompasses iterative optimization techniques that are configured to settle on a local minimum that is not the global minimum, and/or that are configured to stop before reaching the precise global minimum. For example, the upper bound minimization engine 22 may employ an iterative optimization algorithm with stopping criteria that terminate the iterative process when the iteration-to-iteration change in the objective function value becomes less than a stopping threshold.
The output of the upper bound minimization engine 22 is a set of optimized parameters 30 including optimized pivot function parameters θopt and optimized Hölder's inequality parameters popt and qopt. Equation (2) is then applied using these optimized values to compute the integral 32 according to
In addition to computing the integral of the distribution ƒ(t)g (t) over the space (that is, computing Z:=ƒ(t)g(t)dv(t)), the foregoing approach is also optionally applied to compute the normalized distribution ƒ(t)g(t) itself. This is based on the recognition herein that obtaining an accurate upper bound to the log-partition is closely related to the fact that the approach provides a good approximation of the distribution and its moments. Hence, minimizing the variational Hölder inequality (Equation (2)) is also an efficient inference algorithm. To see this, the exact solution of Equation (5) is revisited. Recall that for r=cr* with c being a positive scalar and r* as defined in Equation (5), the variational Hölder's inequality of Equation (2) becomes an equality. In this case, the product ƒg can be approximated by ƒprp or gqr−q as:
ƒg=ƒprp or ƒg=gqr−q (9)
To see this, it is observed that Equation (5) leads to
By raising this equation to power p the expression
is obtained, so that
because
While this holds exactly only for r=cr* for which the inequality of Equation (2) is an equality, the relations of Equation (9) hold at least approximately where the bound is tight but not exact, for example when using the optimized bound provided by the optimized parameters 30. Thus, with reference again to
The illustrative system of
Some illustrative applications of the integral evaluation system of
One illustrative application is evaluation of the product of univariate factors with a Gaussian. Using notations previously set forth, the function ƒ(t)=Πi=1nƒi(ti) is defined, where each function ƒi: is univariate. The function g(t) is also defined, with
where A is a definite symmetric n×n matrix and bεn. The Lebesgue measure is used for the measure v. The integral to be evaluated is:
This type of integral is used in certain machine learning tasks, and typically corresponds to the marginal data probability—that is, the evidence—of a linear regression model with known variance and sparse priors, n being the number of variables. Up to an affine change of variable to obtain orthogonal univariate factors, this integral corresponds also to the data evidence in a generalized linear model with Gaussian prior, where n is the number of independent observations. The functions ƒ and g alone are readily integrated, and remain readily integrated when multiplied by a Gaussian potential with diagonal covariance matrices, so that the following variational family is suitable for constructing the variational inference problem:
A approximation to the integral of Equation (10) is obtained by minimizing the upper bound Ī given in Equation (7) using the system of
Integration of the orthogonal univariate function is first considered. Here the first term Iƒ required to compute the bound can be obtained efficiently in terms of univariate integrals:
where the univariate integrals U[h]:2×[01] are defined as:
Here, h is an arbitrary univariate function . These integrals can be efficiently computed using quadrature integration (e.g. recursive adaptive Simpson quadrature), but in some practical applications the same functions ƒi in Equation (12) are used for many factors. A considerable computational efficiency enhancement can be obtained by designing integrals dedicated to some functions (in practice, using pre-computed functions with linear interpolation is 10 to 100 times faster than running a new quadrature every time). One commonly encountered example is the step function: ƒi(x)={x≧0} for all iε{1, . . . , n}. For this function and using Gaussian pivot functions as specified in Equation (11), a closed form expression is obtained in terms of normal cdf function Φ:
where Φ is the cumulative density function of a centered normalized Gaussian random variable.
Concerning the other factor Ig, its log-quadratic form corresponds to a standard Gaussian integral:
which can be recast as:
I
g(θ;q)=N(q(A−diag(θ1)),q(b−θ2)) (16)
where N(A,b) is involves a matrix inversion and a determinant computation:
Truncated multivariate Gaussian integration is next considered. Most of the truncated multivariate Gaussian integration problems with linear truncations can be put under the canonical form of Equation (10) where truncations are orthogonal, i.e. ƒi(x)={x≧0} for all iε{1, . . . , n}. Integrating truncated correlated Gaussian is a known open problem for which only approximation techniques are presently available. To be used in a learning framework, upper and lower bounds to the integral of Equation (10) are often useful. The disclosed variational Hölder's inequality approach disclosed herein is well suited to provide the upper bound.
With reference to
Sparse probit regression is next considered. Probit regression is a Generalized Linear Model (GLM) for binary observations. In Bayesian statistics, probit regression is often preferred to logistic regression as there is a natural interpretation of the model as a multivariate truncated Gaussian variable. In fact it corresponds exactly to a truncated multivariate Gaussian integration. It is shown here that good performance for this problem can be obtained using the variational Hölder approach disclosed herein. In brief, the upper bounding technique is applied to the evidence integral to obtain tractable approximations.
Let =(X,y), where Xεm×d is the data matrix, i.e. Xij is the value of the j-th covariate for the i-th data point, and yε{−1,1}n the vector of binary output variables. We define M as the signed data matrix, such that Mij=yiXij. The likelihood function is p(|w)=Πi=1mp(yi|Xi1, . . . , w)=Φ(Mw) where Φ:™[0,1] is the product of normal cdfs: Φ(s)=Πi=1m∫−∞s
p()=p(|w)dp(w)=Φ(Mw)dp(w) (18)
which can be written as:
or more succinctly as:
where:
The matrices In and 0n×d denote the identity matrix and zero matrix of size n×n and n×d, respectively. Equation (20) is based on the concatenation t:=[wT, vT]T and corresponds exactly the same form as Equation (10), i.e. a correlated Gaussian pdf multiplied by orthogonal n=d+m univariate functions.
The system of
for any K-dimensional vector ρεΔK-1 where ΔK-1={ρ; ρk≧0, Σk=1
Equation (1) is recovered.
The introduction of a pivot function into the generalized Hölder inequality of Equation (22) is developed as follows. Define a non-negative pivot function q(t)=Πi=1Kqk(t). Then, the following results holds:
This result can be demonstrated as follows. Substituting
for ƒk and Πk′=1Kqk,(t)dv(t) for dv(t) in Equation (22) leaves the left integral unchanged but gives the required right hand side of the equation. This result is also a variational Hölder inequality, since it is a direct generalization of Equation (2) for more than two factors. The function q(t) is actually a factored approximation rather than a formal pivot function, similarly to the difference between binary logistic regression (with odd-ratios) and multinomial logistic regression (with unnormalized scores). Nonetheless, the function q(t) serves the same functionality (e.g. providing a tighter bound) as does the function r(t) in Equation (2), and the function q(t) is defined as a pivot function herein.
The pivot function q(t) provides a tight bound as follows. In the following, K≧2 is assumed. (The special case of K=2 is also covered by Equation (2) and its sequel). Analogous to the Equation (5) and its surrounding discussion, if qk(t)=ckƒk(t) for any tε where ck are non-negative constants then the inequality in Equation (23) is an equality.
In Equation (23), the integral over the product is transformed into a product of integrals, each of them being potentially easier to approximate. Equation (23) can be written in a slightly different for by isolating the terms ƒk and qk, and taking the logarithm:
It is seen that this bound can be viewed as a Reverse Information Inequality by comparison directly with the standard information inequality, further assuming that q is a proper distribution function, i.e. that it sums to one:
where (q)=−Σk×1K log ƒΣk′=1Kqk,(t)log qk(t)dv(t) is the entropy of q.
In analogy to the K=2 embodiment shown in
with q(t)=qopt(t) and ρ=ρopt.
The disclosed variational inference approaches are fundamentally different from other inference techniques commonly used in Bayesian inference. It has the advantage of being scalable, i.e. it leads to a convex problem that can be efficiently optimized, thus reducing the computational resources required to perform the integration. There are diverse applications in data analytics where Bayesian inference is used to advantages, including by way of example: factorization methods, as used in customer care to profile users and agents, or in education analytics, where the performances of the student are measured based on a lot of random exam assignment. Another area of application is transportation analytics to recover the trajectory of a user based on partial location information (such as the location of the ticket validations): the recovery of the user home location cannot be perfect, but a distribution over the possible locations can be inferred.
The disclosed variational integral approximations are based on minimization of an upper bound to a log-partition function. The disclosed variational inference problem is convex if the variational function is log-linear, which has practical and theoretical advantages. Application to handling sparse regression in GLMs, e.g. using probit regression (but the extension to other losses is straightforward) is described as an illustrative application. Advantageously, a convex objective is maintained even if the sparse factors exhibit heavy tails.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.