The following relates to the statistical sampling arts and related arts, and to arts employing statistical sampling.
A diverse range of problems can be formulated in terms of sampling of a space or domain (represented herein without loss of generality as X a sample of which may be denoted as x) in accordance with a target distribution, which is represented herein without loss of generality as p(x), which may or may not be normalized. For example, many structured learning problems can be naturally cast in terms of decision sequences that describe a structured object x, which is associated with an unnormalized probability or distribution p(x). In some such learning problems, p(x) is the exponential of a real-valued “energy” function, which is analogous to the “value” function in optimization problems. In such situations, both inference and learning give a central role to procedures that are able to produce samples for x that follow the normalized probability distribution associated with p(x).
A known approach for performing such sampling is a technique called rejection sampling. In this approach, sampling is performed in accordance with a proposal distribution, which is represented herein without loss of generality as
However, in practice it can be difficult to obtain a proposal distribution
In adaptive rejection sampling (ARS), the rejected samples are used to improve the proposal distribution. ARS assumes that the target distribution p(x) is concave, in which case a tangent line at any given point on the target distribution is guaranteed to define an upper bound. This concavity aspect is used in ARS to refine the proposal distribution
ARS is applicable to log-concave distributions in which the logarithm of the target density function p(X) is concave. Görür et al., “Concave Convex Adaptive Rejection Sampling”, Technical Report, Gatsby Computational Neuroscience Unit (2008) (hereinafter “Görür et al.”). Görür et al. sets forth an improved ARS that is applicable to distributions whose log densities can be expressed as a sum of concave and convex functions, which expands the scope of applicability of ARS. Nonetheless, even with this improvement the ARS technique is generally limited to a target distribution p(X) that is continuous in one dimension. This is a consequence of ARS relying upon piecewise linear upper bounds that are refined based on rejected samples and that are assured of being upper bounds on account of the continuous curvature between the end points. ARS techniques are therefore difficult or impossible to adapt to sampling of a complex target distribution p(X), such as a multi-dimensional target distribution, a discrete target distribution, discrete multi-dimensional target distribution, a highly discontinuous target distribution, and so forth.
In some illustrative embodiments disclosed as illustrative examples herein, a method comprises: performing iterative rejection sampling on a domain in accordance with a target distribution wherein the domain is partitioned to define a partition comprising partition elements and wherein each iteration of the rejection sampling includes selecting a partition element from the partition in accordance with partition element selection probabilities, acquiring a sample of the domain in the selected partition element according to a normalized proposal distribution that is associated with and normalized over the selected partition element, and accepting or rejecting the acquired sample based on the target distribution and a bound associated with the selected partition element; and, during the iterative rejection sampling, adapting the partition by replacing a partition element of the partition with two or more split partition elements, associating bounds with the split partition elements, and computing partition element selection probabilities for the split partition elements. The performing of iterative rejection sampling and the adapting of the partition may suitably be performed by an electronic processing device.
In some illustrative embodiments disclosed as illustrative examples herein, a non-transitory storage medium stores instructions executable by an electronic processing device to perform a method including performing iterative rejection sampling of a domain in accordance with a target distribution p wherein the domain is partitioned into a partition comprising partition elements Yi each having an associated selection probability πi and an associated normalized proposal distribution
wherein (i) accepting the sample comprises adding the sample to a set of accepted samples and (ii) rejecting the sample comprises replacing the selected partition element Ysel with split partition elements Aj, j=1, . . . , N where N≧2 and Ysel=∪j=1NAj where ∪ is the union symbol.
In some illustrative embodiments disclosed as illustrative examples herein, an apparatus comprises an electronic processing device configured to: partition a domain into one or more partition elements Yi each having an associated selection probability πi and an associated normalized proposal distribution
In the following, the terms “optimization”, “minimization”, and similar phraseology are to be broadly construed as one of ordinary skill in the art would understand these terms. For example, these terms are not to be construed as being limited to the absolute global optimum value, absolute global minimum, or so forth. For example, minimization of a function may employ an iterative minimization algorithm that terminates at a stopping criterion before an absolute minimum is reached. It is also contemplated for the optimum or minimum value to be a local optimum or local minimum value.
With reference to
The electronic processing device 10 embodies a sampling module 20 that performs adaptive rejection sampling using techniques disclosed herein. These techniques are disclosed herein with illustrative reference to a natural language parsing application in which a probabilistic context-free grammar (PCFG) is augmented by transversal constraints. All configurations x in the space to be sampled correspond to possible parse trees of the PCFG, and for a given input natural language sentence the PCFG assigns a probability to each configuration x (that is, to each possible parse tree for the sentence). This probability is the product of the conditional probabilities of the rules used for producing the parse tree x. The PCFG is assumed to be normalized in that the sum of the probabilities of all finite parse trees is unity.
Applying the PCFG by itself is straightforward. However, if the PCFG is augmented by some transversal constraint (or constraints), then evaluation is much more difficult. For example, consider an augmentation of the PCFG in which the probability p(x) of a finite parse tree x is defined as the product of (1) the output of the PCFG and (2) a transversal constraint term f(x) which lies in the interval [0,1]. Such a transversal term may represent various constraints upon the parsing. By way of illustrative example, f(x) may provide (probabilistic) enforcement of a certain quality of linguistic agreement between different words in the derivation. As a specific example, the transversal constraint f(x) may bias against (i.e., reduce the probability of) a parse tree in which a singular subject is associated with a plural verb. The transversal constraint f(x) does not completely forbid such an association, but reduces the probability of parse trees including such an association. This is merely an illustrative example, and more generally the transversal constraint f(x) can provide various probabilistic constraints that tune, adapt, or otherwise improve the augmented PCFG for various applications. In particular, since the augmented PCFG is constructed as a product of the output of the PCFG and the transversal constraint, it follows that the transversal constraint f(x) can comprise a product of two or more transversal constraints each representing a different constraint upon the parsing.
Unlike the “pure” PCFG, the augmented PCFG is substantially more difficult to sample from. One difficulty is that, although the PCFG is known to be normalized, there is no guarantee that the augmented PCFG will be normalized (indeed, it most likely will not be normalized). As a result, the computed product of the output of the PCFG and the transversal constraint is not a true probability value. Moreover, computing a normalization factor is typically difficult or impossible due to the very large number of possible parsing trees that can be generated by the augmented PCFG for a given input sentence.
With continuing reference to
The sampling module 20 performs adaptive iterative rejection sampling on the domain 22 in accordance with a target distribution 24, which is denoted herein without loss of generality as p(x). The target distribution 24 is not, in general, normalized. In the illustrative application the target distribution 24 is the augmented PCFG, which is most likely not a normalized distribution.
The sampling module 20 also receives an initial proposal distribution 26 for use in the rejection sampling, which is denoted herein without loss of generality as
In the illustrative application of sampling an augmented PCFG, a suitable choice for the reference proposal distribution
The sampling module 20 performs adaptive iterative rejection sampling of the domain X 22 in accordance with the unnormalized target distribution p(x) 24 and initially using the normalized (reference) proposal distribution
With continuing reference to
An iteration of the rejection sampling includes an operation 32 in which a partition element is selected from which to draw the sample. This selected partition element is denoted herein as Ysel. In an operation 34 a sample (denoted x) is drawn from the selected partition Ysel. In an operation 36, this sample x is accepted or rejected based on the target distribution p(x) and the bound QY
With continuing reference to
With reference to
Having provided an overview of embodiments of the adaptive iterative rejection sampling approaches disclosed herein with reference to
The disclosed adaptive rejection sampling approaches are based on the following observation. It is desired to sample from target distribution
The adaptive adaptation approaches disclosed herein split the space or domain X 22 into two (or optionally more) disjoint subsets Y1,Y2; for many problems it is straightforward to find nonnegative numbers β1<β and β2<β subject to βi being an upper bound for the density ratio ρ(x) over Yi, i=1,2. Considering β′≡β1
and then sampling x inside Yi with
which implies that β′ is an upper bound over the whole domain X 22 for the density ratios p(x)/
This splitting or partition refinement can be continued recursively as shown in
An advantage of rejection sampling over techniques in the Markov Chain Monte Carlo (MCMC) family of sampling techniques (e.g., Gibbs or Metropolis-Hastings sampling) is that the rejection sampling provides exact samples from the target distribution right from the start, rather than converging to exact sampling in the limit. However, rejection sampling can be inefficient in producing these samples, if the proposal distribution is a poor approximation of the target distribution. The adaptive rejection sampling approaches disclosed herein retains the property of producing exact samples, but improves efficiency through adaptation of the partition. The disclosed approach advantageously can be applied to arbitrary measurable spaces, whether discrete or continuous, without strong assumptions such as the space being Euclidian, or continuous, or differentiable, or so forth. By basing the adaptation upon refinement of partitioning of the domain, the detailed structure and characteristics of the target distribution and its domain do not impact the adaptation.
The following notation and observations are utilized and referenced in setting forth some illustrative adaptive rejection sampling embodiments. For any measures μ and μ′ on the same measurable domain or space X, the comparison μ≦μ′ is defined as μ(A)≦μ′(A) for A⊂X any measurable subset of X. That is to say, μ is uniformly smaller than μ′, or said another way, μ is uniformly dominated by μ′. If μ≦μ′ then μ is absolutely continuous relative to μ′, and therefore the Radon-Nikodym derivative
exists almost everywhere with respect to μ′. It can be shown that μ≦μ′ is then equivalent to the condition
almost everywhere with respect to μ′. Note that if μ and μ′ are discrete, then
is simply the ratio of two numbers
If μ is an arbitrary finite measure on X, then
is a measure and Y is a measurable subset of X, then μ(Y) is used herein to denote the restriction of μ to Y. If μ is a measure on X and γ is a non-negative number, then γμ is used herein to denote the measure μ′ defined by μ′(A)=γμ(A), for A a measurable subset of X.
Given two finite measures p≦Q, and assuming that it is known how to sample from
The probability for a trial to be accepted (that is, the acceptance rate) is then
The acceptance rate is close to unity if Q(X) is not much larger than p(X). The probability that a trial falls in the set A and is accepted is equal to
which implies that the probability that a trial falls in A knowing that it is accepted is equal to
In other words, the samples produced by the algorithm are distributed according to
The adaptive rejection sampling disclosed herein extends rejection sampling algorithm in accordance with a target distribution
In some adaptive rejection sampling embodiments disclosed herein, all the measures Qj are derived from a single reference measure Q0=q=β
on the whole of X. Given a subset Y of X, β′≦β can be found such that
on Y. A measure q′(Y)≡β′
In one embodiment, the target measure or distribution is denoted by p, and the rejection sampling is in accordance with the distribution
Line 1 of Adaptive Rejection Sampling Algorithm #1 initializes the number of Accepts and of Rejects to 0. Line 2 initializes the partition S to the list (X) (that is, to the entire space or domain 22). Line 3 initializes the partition member selection probability vector π to the list (1) (in other words, sets the partition member selection probability for the single partition member defined in line 2 to unity). On line 4, BX is initialized to a nonnegative number and the probability measure
On line 5, an infinite loop is entered which produces an infinite stream of samples from p. (The infinite loop established by the “while TRUE do” pseudocode of line 5 is merely an illustrative example, can be replaced by a finite loop having a desired termination—for example, line 5 can be replaced by “while Accepts <100 do” in order to acquire 100 samples in accordance with the normalized target distribution p). Lines 1-5 of Adaptive Rejection Sampling Algorithm #1 correspond to the initialization operation 30 of the diagrammatic representation shown in
(this number is less than 1 by construction of the bounding pair). Lines 9-10 correspond to the decision operation 36 of
Lines 15-21 of Adaptive Rejection Sampling Algorithm #1 set forth the partition adaptation. Line 15 (corresponding to decision operation 40 of
and (iii) p(A
For a given partition S, the adaptive proposal distribution over the whole of X is
where BS=ΣiBY
which for xεYi, is equal to
The update of π on lines 20-21 of Adaptive Rejection Sampling Algorithm #1 operates as follows. For clarity, call S′ and π′ the values of the partition and of the probability vector after the split of partition member Y into split partition members Aj and A2. Before the split, for all Yj in S the following pertains: πj=BY
which is the form shown on line 20. Thus, for j≠i, π′j/πj=BS/BS′=α−1, and also, for k=1,2,
which is the form shown on line 21.
In Adaptive Rejection Sampling Algorithm #1, the partition S is suitably represented by the leaves of a tree corresponding to the different splits, which guarantees efficient selection (typically logarithmic in the number of leaves) of a given leaf on line 6. As already noted, on line 15 a partition refinement stopping criterion may be incorporated in order to stop all further refinements after a certain empirical acceptance rate has been reached. Such a partition refinement stopping criterion can avoid the storage and computation costs incurred by indefinitely refining the tree, albeit at the cost of having no further improvement thereafter in terms of observed acceptance rates.
In some embodiments of Adaptive Rejection Sampling Algorithm #1, the function BoundingPair( ) which obtains the bounding pairs (BY
for xεX. The notations q≡qX,
is the reference density ratio at x. Further consider a measurable subset Y of X. It is known that β(X) is an upper bound for the density ratios p(x)/
Now define the measure qY≡β(Y)
In other words, the pair (BY,
With reference to
associated with the supremum of the function on this interval, i.e. β(Yi)=supzεY
For partition S of domain X being a refinement obtained at a given stage of the partition adaptation, denote as the measure qS=ΣY
Lemma 1 is obtained as a consequence of the definition of rejection sampling. Lemma 1 shows that, for a fixed number of trials n, the estimate for the partition function Z=p(X) has a precision which varies with (1−γ)/γ, that is, is poor for small acceptance rates, and good for high acceptance rates. Thus, for a fixed level of refinement S, and with qS its associated proposal distribution, the expected acceptance rate γ is equal to p(X)/qS(X). Each time a refinement step (lines 16-20) is performed, qS can only decrease, and therefore γ can only increase. Then, the following conditions on Adaptive Rejection Sampling Algorithm #1 guarantee that γ can reach, in the limit, an arbitrarily high acceptance level {circumflex over (γ)}≦1, a property called γ-convergence herein.
The following assumes q is the fixed reference proposal measure 26, and the abbreviation f is used here for the density function dp/d
on Yi, and ε is a nonnegative number, we will say that Yi is ε-tight if βi−f(x)≦εa. e. wrtq on Yi. Note in particular that q(Yi)=0 implies that Yi is 0-tight, because of the ‘a.e.’ qualification. This leads to Lemma 2: For any depth d, there exists a time t such that, for all YiεSt, one has: depth(Yi)≧d or Yi is 0-tight. The algorithm respects Condition 1 iff, for any ε1, ε2>0, there exists (with probability 1) a finite time t such that, with St the current refinement at time t, the total q-measure of those YiεSt which are not ε1-tight is less or equal to ε2. In other words, this condition says that, with time passing, the q-measure of the union of those elements Yi of St for which βi tightly approximates the density dp/d
The illustrative example of sampling in accordance with an augmented PCFG (that is, a PCFG with transversal constraints) is one in which γ-convergence can be achieved. Consider all the finite derivation (i.e. parse) trees x defined by a certain PCFG G0 (all the leaves of x are terminals of the grammar), and assign to each x its PCFG probability q(x), i.e. the product of the conditional probabilities of the rules used for producing x. Assume further that the PCFG G0 is normalized in that the sum of the probabilities of all finite derivations is 1. Consider now the following augmentation G of G0: the grammar G has the same rules and derivations as G0, but the score p(x) of a finite derivation x is defined as the product of q(x) and of a “transversal constraint” term f(x) which lies in the interval [0,1]. Such a term may be useful for representing different things—such as for instance the quality of the linguistic agreements between different words in the derivation, where perhaps we want to discourage a singular subject to be associated with a plural verb, without completely forbidding it.
Suppose now that it is desired to sample derivations according to the target (unnormalized) distribution p(x). The equality
Further consider the following adaptive rejection sampling algorithm with G: each Y corresponds to a complete or partial derivation tree in the grammar; if a
The adaptive rejection sampling approaches disclosed herein have been applied to the problem of sampling Ising (or spin-glass) models. These models are studied in the physics literature are a special case of undirected probabilistic graphical models. Given a graph G=(, ) containing n= nodes and a set of edges encoding pairwise interactions, Ising models are defined on X=−1,1n with a density proportional to ef(x) where
The parameters uiε, i=1, . . . , n and vijε, (i,j)ε are assumed to be known. To sample an Ising model using the disclosed adaptive rejection sampling approach, we used the cutset conditioning idea: conditioned on the value of a variable, say xi, the sampling of the remaining variables x−i is slightly simpler, since this has the effect of removing all the edges that are connected to i. Hence, the split operations correspond to choosing an index i of a variable highly connected, and to use an upper bound to the log-partition function of the Ising model with the variable i removed. To obtain local upper bounds on the conditional distributions, we use a minimal energy spanning tree on the unknown variables and set the non-covered edges to the maximal possible value of the interaction.
With reference to
With reference to
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
20120259807 | Dymetman | Oct 2012 | A1 |
Entry |
---|
“A Brief Introduction to Graphical Models and Bayesian Networks”, Retrieved from Internet on Oct. 14, 2011; http://www.cs.berkeley.edu/˜murphyk/Bayes/bayes.html, pp. 1-19. |
Gilks, et al. “Adaptive Rejection Sampling for Gibbs Sampling”, Appl. Statist. (1992), 41, No. 2, pp. 337-348. |
Gorur, et al. “Concave Convex Adaptive Rejection Sampling”, Gatsby Computational Neuroscience Unit, pp. 1-16. |
Hart, et al. “A formal basis for the Heuristic Determination of Minimum Cost Paths”, IEEE Transactions of Systems Science and Cybernetics, vol. SSC-4, No. 2, Jul. 196, pp. 100-107. |
Jordan, et al. “An introduction to Variational Methods for Graphical Models”, Machine Learning, 1999, 37, pp. 183-233. |
Propp, et al. “Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics”, Department of Mathematics, Massachusetts Institute of Technology, Jul. 16, 1996, 1-27. |
Wainwright, et al. “Tree-reweighted belief propogation algorithms and approximate ML estimation by pseudo-moment matching”, 2003, pp. 1-8. |
Wetherell. “Probabilistic Languages: A Review and Some Open Questions”, Computing Surveys, vol. 12, No. 4, Dec. 1980, pp. 1-19. |
Yedidia, et al. “Generalized Belief Propagation”, 2001, pp. 1-7. |
Mansingha, et al. “Exact and Approximate Sampling by Systematic Stochastic Search”, 2009, pp. 1-8. |
Görür, et al., “Concave convex adaptive rejection sampling,” Technical report, Gatsby Computational Neuroscience Unit, pp. 1-16 (2008). |
Hart, et al., “A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions On Systems Science and Cybernetics,” 4(2), pp. 100-107 (1968). |
Schofield, “Fitting maximum-entropy models on large sample spaces,” Ph.D. Thesis submitted to the Department of Computing, Imperial College London, pp. 1-137 (2007). |
Snyder, “Unsupervised Multilingual Learning,” submitted to the Department of Electrical Engineering and Computer Science, pp. 1-242 (2010). |
Andrieu, et al., “A tutorial on adaptive MCMC,” Statistics and Computing, vol. 18, pp. 343-373 (2008). |
Chiang, “Hierarchical Phrase-Based Translation,” Association for Computational Linguistics, vol. 33, No. 2, pp. 201-228 (2007). |
Number | Date | Country | |
---|---|---|---|
20130096877 A1 | Apr 2013 | US |