The following relates to the sampling arts, optimization arts, to applications of sampling such as sampling of hidden Markov models (HMMs), natural language processing (NLP) systems employing probabilistic context free grammars (PCFGs) augmented by constraints, and so forth.
A diverse range of problems can be formulated in terms of sampling of a space or domain (represented herein without loss of generality as X a sample of which may be denoted as x) in accordance with a target distribution, which is represented herein without loss of generality as p(x), which may or may not be normalized. A known approach for performing such sampling is a technique called rejection sampling. In this approach, sampling is performed in accordance with a proposal distribution, which is represented herein without loss of generality as
However, in practice it can be difficult to obtain a proposal distribution
In adaptive rejection sampling (ARS), the rejected samples are used to improve the proposal distribution. ARS assumes that the target distribution p(x) is concave, in which case a tangent line at any given point on the target distribution is guaranteed to define an upper bound. This concavity aspect is used in ARS to refine the proposal distribution
Görür et al., “Concave Convex Adaptive Rejection Sampling”, Technical Report, Gatsby Computational Neuroscience Unit (2008) (hereinafter “Görür et al.”) discloses an improved ARS that is applicable to distributions whose log densities can be expressed as a sum of concave and convex functions, which expands the scope of applicability of ARS. Like conventional ARS, the approach of Görür et al. is generally limited to a target distribution p(X) that is continuous in one dimension. This is a consequence of reliance upon piecewise linear upper bounds that are refined based on rejected samples and that are assured of being upper bounds on account of the continuous curvature between the end points. Such techniques are difficult or impossible to adapt to more difficult problems in which the target distribution p(X) is multi-dimensional, and/or discrete, and/or highly discontinuous, or so forth.
Optimization is generally viewed as a problem that is separate and distinct from the sampling problem. Sampling endeavors to obtain a set of data points that is representative of (or, alternatively, in accordance with) a density function or distribution. In contrast, optimization endeavors to locate the maximum value of a function, which may or may not be a density function or distribution. The goal of optimization may be to find the highest value of the function, i.e. pmax(x0), or to find the spatial location in the space X of that maximum, i.e. to find the value x0.
Some functions may be optimized analytically, e.g. by finding the point where the derivative of the function goes to zero. More commonly, optimization employs iterative approaches, such as the gradient descent method or the Levenberg-Marquardt algorithm. In principle, sampling can be employed for optimization, for example by a Monte Carlo approach in which samples are acquired and used to estimate the maximum value. Such sampling approaches are approximate, and the error is generally expected to roughly correlate with the sample size.
In some illustrative embodiments disclosed as illustrative examples herein, a non-transitory storage medium stores instructions executable by an electronic data processing device to perform rejection sampling to acquire at least one accepted sample of a function or distribution p over a space X in which a proposal distribution q(n) used in the rejection sampling is refined responsive to rejection of a sample x*εX obtained in a current iteration of the rejection sampling to generate a refined proposal distribution q(n+1) for use in a next iteration of the rejection sampling wherein the refined proposal distribution q(n+1) is selected to satisfy the criteria p(x)≦q(n+1)(x)≦q(n)(x) and q(n+1)(x*)<q(n)(x*). In some embodiments the rejection sampling obtains the sample x* by random sampling of the space X, the rejection sampling accepts or rejects x* based on comparison of a ratio p(x*)/q(x*) with a random draw, and the refined proposal distribution q(n+1) is selected to satisfy the criteria: p(x)≦q(n+1)(x)≦q(n)(x) and q(n+1)(x*)<q(n)(x*) and a norm ∥q(n+1)∥α is minimized where α<∞. In some such embodiments α=1. In some embodiments the rejection sampling obtains the sample x* such that q*=q(n)(x*) maximizes q(n) over the space X, the rejection sampling accepts or rejects x* based on a difference between or ratio of q* and p(x*), and the refined proposal distribution q(n+1) is selected to satisfy the criteria: p(x)≦q(n+1)(x)≦q(n)(x) and q(n+1)(x*)<q(n)(x*) and a norm ∥q(n+1)∥∞=max{q(n+1)(x)} is minimized. In some embodiments the non-transitory storage medium stores instructions executable by an electronic data processing device to perform said rejection sampling in one of two selectable modes: sampling mode and optimization mode.
In some illustrative embodiments disclosed as illustrative examples herein, an apparatus comprises: an apparatus comprises a non-transitory storage medium as set forth in the immediately preceding paragraph and an electronic data processing device configured to execute instructions stored on the non-transitory storage medium.
In some illustrative embodiments disclosed as illustrative examples herein, a method comprises, using an electronic data processing device, performing a current iteration (n) of rejection sampling including: obtaining a sample x* from a space X; accepting or rejecting the sample x* based on comparison of q(n)(x*) and p(x*) where q(n) is a proposal distribution over the space X for current iteration (n) of the rejection sampling and p is a function or distribution over the space X; and refining the proposal distribution q(n) to generate a refined proposal distribution q(n+1) over the space X satisfying the criteria: p(x)≦q(n+1)(x)≦q(n)(x) for all xεX; q(n+1)(x*)<q(n)(x*); and a norm ∥q(n+1)∥α is minimized.
With reference to
With reference to
The OS* algorithm 10 performs rejection sampling. To initiate, an initial proposal distribution 20 is selected. The initial proposal distribution is denoted q(0)(x) and is defined for samples xεX. In an operation 22, a sample x* is obtained. In the case of sampling mode, the sample x* is obtained by random sampling. In the case of optimization mode, the sample x* is chosen to maximize q(0)(x) (for the first iteration), or more generally to maximize q(n)(x) (for iteration (n)). In an operation 24, the sample x* is accepted or rejected. The choice of acceptance criteria depends upon whether the OS* system 10 is operating in sampling mode or optimization mode. In the case of sampling mode, a suitable selection criterion is based on comparison of ratio p(x*)/q(x*) with a random draw. (Here, the shorthand notation q=q(n) denotes the proposal distribution for the current iteration of the rejection sampling). The random draw can, for example, be a random draw from a normalized uniform probability distribution U[0,1] that has uniform value between zero and one and for which ∫01U[0,1]dv=1. In the case of optimization mode, a suitable selection criterion is based on a difference between or ratio of q* and p(x*), where the shorthand q*=q(x*) denotes the maximum value of q in the space X.
In an operation 26, if the sample x* is accepted (in operation 24) then a history is updated to include the sample x*. In sampling mode, this entails adding the sample x* to a set of accepted samples. In optimization mode, only one sample is ultimately accepted, namely the first sample encountered for which the difference between or ratio of q* and p(x*) satisfies a maximum threshold ε, i.e. (q*−p(x*))<ε. (As will be further explained herein, that maximum threshold ε will define an error metric for the optimized value of p).
On the other hand, in an operation 30, if the sample x* is rejected (in operation 24) then the proposal distribution q(n) is refined to generate a refined proposal distribution q(n+1) for use in a next iteration of the rejection sampling. The refined proposal distribution q(n+1) is selected to satisfy the following criteria: p(x)≦q(n+1)(x)≦q(n)(x) (where the first inequality ensures that q(n+1) is an upper bound on p(x) and the second inequality ensures that the refined proposal distribution q(n+1) is no worse than q(n) for any point in the space X); q(n+1)(x*)<q(n)(x*) (which ensures that the refined proposal distribution q(n+1) is better than q(n) at the rejected sample x*); and a norm ∥q(n+1)∥α is minimized. The value of α in this third criterion depends on the operational mode. For sampling, α<∞ and more preferably α=1. In this case, choosing q(n+1) to minimize the L1 norm ∥q(n+1)∥1 ensures that the chosen q(n+1) lowers the overall refined proposal distribution q(n+1) as much as possible. For sampling, α=∞, which takes advantage of the equivalency ∥q(n+1)∥∞=max{q(n+1)}. Thus, minimizing ∥q(n+1)∥∞ ensures that the chosen q(n+1) lowers the maximum value of the refined proposal distribution q(n+1) as much as possible.
The operations 22, 24 and operation 26 (for acceptance) or operation 30 (for rejection) form one iteration of the rejection sampling. In a decision 32, it is determined whether the stopping criterion is met. In the case of sampling, a plurality of accepted samples are to be determined in accord with the distribution p(x) (in its normalized form, i.e.
Further disclosure including some conceptual bases for the OS* algorithm are set forth in the following.
Suppose that μ is a base measure on a space X and that p is a L1 nonnegative function on (X,μ), i.e. ∫X p(x)dμ(x)<∞, and let us define
The function p can then be seen as an unnormalized density over X, and
To maximize the acceptance rate, the q curve should be made as low as practicable while keeping it above the p curve. Toward this end, Adaptive Rejection Sampling (ARS) techniques have been developed. See Gilks et al., “Adaptive rejectin sampling for Gibbs sampling”, Applied Statistics pages 337-348 (1992); Görür et al., “Concave convex adaptive rejection sampling”, Technical Report, Gatsby Computational Neuroscience Unit, 2008. In ARS, at certain stages of the process the q curve is updated to a lower curve q′ with a better acceptance rate. These techniques have predominantly been applied to the case where X is the one-dimensional real line and where p is a concave, log-concave or piecewise log-concave curve, in which case it is possible to exploit convexity properties to progressively better and better approximate p by an upper bound consisting of a piecewise linear envelope.
In the OS* algorithm described with reference to
With reference to
A better generic way to find a q′ is the following. Suppose that a finite set of “one-step refinement actions” aj are available, depending on q and x2, which are able to move from q to a new qj′=aj(q,x2) such that for any such aj one has p(x)≦qj′(x)≦q(x) everywhere on X and also qj′(x2)<q(x2). Then from among these available refinement actions the “best” refinement aj is chosen. For sampling, this “best” refinement is suitably the one that is such that the L1 norm of qj′ is minimal among the possible j's, or in other words, such that ∫X qj′(x)dμ(x) is minimal in j. With this selection, the acceptance rate of q′j (which depends directly on ∥q′j∥1) is improved as much as possible, while (i) not having to explore a too large space of possible refinements (assuming that the set {aj} is reasonably small), and (ii) moving from a representation for q to an only slightly more complex representation for q′j, rather to a much more complex representation for a q′ that could result from exploring a larger space of possible refinements for q.
Said otherwise, in the OS*S operation of
With reference to
The following observation is made. Suppose that the distance between q(x1) and p(x1) is smaller than ε. Then it follows that the distance between q(x1) and pmax is also smaller than ε. This can be seen graphically in
In the case of x1 in
To summarize the optimization (OS*O) operation, when xmax is the location in X of the maximum pmax of p, and x1 is the location in X of the maximum qmax of q, then the (unknown) distance |q(x1)−pmax| is smaller or equal to the “gap” |q(x1)−p(x1)|, which is a known quantity. It can also be stated that the (also unknown) distance |pmax−p(x1)| is smaller or equal to the known gap |q(x1)−p(x1)|. So, one can say that the maximum of p is qmax with an error metric |q(x1)−p(x1)|. Alternatively, one can say that the maximum of p is p(x1) with the error metric |q(x1)−p(x1)|. By splitting the difference, one can also say that maximum of p is (qmax+p(x1))/2±|q(x1)−p(x1)|/2. If the error is unacceptably large, then the sample x1 is rejected, a refinement proposal q′ is generated, and its maximum x2 is computed. If the gap |q(x2)−p(x2)| is small enough, the point x2 is accepted, otherwise the process continues to iterate.
While sampling and optimization are usually seen as two different and distinct tasks, as disclosed herein they can actually be viewed as two extremities of a continuous range, when considered in the context of Lp spaces. Roughly speaking, if (X,μ) is a measure space, and if ƒ is a real-valued function on this space, one defines the Lp norm ∥ƒ∥p, for 1≦p<∞ as:
∥ƒ∥p≡(∫X|ƒ|p(x)dμ(x))1/p (1)
with the L∞ norm ∥ƒ∥∞ defined as:
∥ƒ∥∞≡inf{C≧0:|ƒ(x)|≦C for almost every x} (2)
where the right term is called the essential supremum of |ƒ|, and can be thought of roughly as the “max” of the function. So, with some abuse of language, one can write:
The space Lp, for 1≦p≦∞, is then defined as being the space of all functions ƒ for which ∥ƒ∥p<∞. Under the condition that ∥ƒ∥p<∞ for some p<∞, it follows that:
In the following, the notation Lα is used to indicate the norm, rather than the more conventional notation of norm Lp, in order to avoid confusion between the norm index subscript and the target distribution p on which sampling or optimization is performed.
The standard notion of rejection sampling is obtained by performing the OS* algorithm of
In the case α=∞, we will say that we are sampling relative L∞(X,μ), if ƒεL∞(X,μ) and if we perform optimization relative to ƒ, more precisely, if for any ε>0, we are able to find an x such that |∥ƒ∥∞−ƒ(x)|<ε.
The general design for performing the OS* algorithm of
The following Algorithm 1 presents pseudo-code for performing the OS* algorithm selectably for either sampling or optimization:
Algorithm 1 parallels the OS* algorithm shown in
On entry into Algorithm 1, we assume that we are either in sample mode or in optimize mode, and also that we are starting from a proposal q which (1) dominates p and (2) from which we can sample or optimize directly. We use the terminology OS-Sample to represent either one of these cases, where OS-Sample x: q refers to sampling an x according to the proposal q or optimizing x on q (namely finding an x which is an argmax of q), according to the situation. On Algorithm 1 line (1), h refers to the history of the sampling so far, namely to the set of attempts x1, x2, . . . that have been done so far, each being marked for acceptance or rejection (in the case of sampling, this is the usual notion, in the case of optimization, all but the last proposal will be marked as rejections). (In the OS* algorithm of
On Algorithm 1 line (3), the ratio r is computed, and then on line (4) we decide to accept x or not based on this ratio; in optimization mode, we accept x if the ratio is close enough to 1, as determined by a threshold; in sampling mode, we accept x based on a Bernoulli trial of probability r. On line (5), the history is updated by recording the trial x and whether it was accepted or not (or, alternatively, line (5) can be performed only for accepted samples). If x was rejected (Algorithm 1 line (6)), then on line (7), a refinement of q is performed.
In the following, some illustrative applications of the OS* algorithm are set forth. These are merely illustrative examples, and it is to be understood that the OS* algorithm disclosed herein is suitably employed in substantially any application that utilizes sampling, optimization, or both.
In the following, some natural language processing (NLP) applications are described. Numerous NLP problems entail optimizing and/or sampling from a complex objective function p. Some examples include: efficient decoding and sampling with high-order hidden Markov Models (HMMs); combination of a probabilistic context-free grammar (PCFG) with a complex finite-state automaton (FSA) for applications such as tagging by joint usage of a PCFG and a HMM tagger or hierarchical translation with a complex target language model; parsing in the presence of non-local features; implementing PCFGs with transversal constraints or probabilistic unification grammars; and so forth. For illustrative purposed, the following concentrates on (i) decoding and sampling with high-order HMM's, and (ii) combining a PCFG with a complex finite-state language model.
Considering first the optimization case, suppose, for simplicity, that we want to decode (i.e. optimize) with a bigram HMM, where the hidden layer consists of a string x of English words, and where each word xi in the string is associated with an acoustic observation oi. Thus, each bigram xi−1xi in the hidden layer contributes a factor w2(xi|xi−1)≡p(xi|xi −1)p(oi|xi). We are then trying to find a string x=x1, . . . , xn that maximizes the quantity:
Let us just write p(x)≡Πiw2(xi|xi−1); we are then trying to maximize p(x) over word strings of length n. We now introduce the following notion. For a given word xi in the vocabulary, let us define:
where the max is over all possible words xi−1 in the vocabulary, and which we call the “max backoff”, or “optimistic backoff” of the set of bigrams whose last word is xi. The application of OS*O to this setup is now as follows. We define the initial proposal q(1) as:
We see that q(1)(x)≧p(x),∀x, meaning that q(1) dominates p over the space of strings, as required of a proposal. Note that q(1) does not necessarily correspond to a normalized distribution, and typically Σx q(1)(x)>1.
With reference to
Now, suppose for the sake of the example that q(1)(x(1))=0.06, with w1(dog)=0.2, but that p(x(1))=0.00005, with in particular the true bigram weight of dog in this context being w2(dog|two)=0.0002=w1(dog), while the other bigram weights w2 in x(1) are not very different from their optimistic unigram backoff w1. Based on this observation of a large difference between w2(dog|two)=0.0002 and w1(dog), the sample is rejected and the proposal q(1) is refined into q(2), where the only difference is that now q(2) “takes into account” the bigram two dog, with its contextual weight w2(dog|two).
With reference to
Now, the q(2) proposal corresponding to this new deterministic automaton is defined as being the function that to a string of words x1x2x3x4 in the graph gives the weight obtained by multiplying the weights associated with the edges traversed by the unique path associated with this string. In particular q(2)(the two dog barked) is equal to w1(the).w1(two).w2(dog|two).w1(barked), which is smaller than w1(the).w1(two).w1(dog).w1(barked).
At this point, we decode again—through standard Viterbi-type dynamic programming decoding—using the automaton q(2), and it is possible that the optimum path x(2) relative to q(2) is now different from x(1); for example, it may be the path highlighted in the figure: the two dogs barked.
We then iterate the process with this new path, comparing its true weight p(x(2)) to its q(2) weight. If the difference is above a threshold, we refine again; this can be done by identifying among the words making up x(2) which one, if its context was extended by one word, would correspond to the largest decrease in the value of q(2)(x(2)). Thus, we compare the ratios w2(two|the)/w1(two),w2(dogs|two)/w1(dogs),w2(barked|dogs)/w1(barked), and determine the smallest such ratio—which could be said informally to correspond to the “most violated constraint”, or to the “largest disappointment”—and decide to refine the corresponding context by one more word. This can be done similarly to what we have done previously, namely by adding one more state to the automaton, copying some edge, and decreasing the weight of one of the edges.
The procedure that we have sketched for a bigram HMM can be directly extended to any n-gram HMM. For instance, if p is a 5-gram HMM, then we can introduce recursively the following max backoffs:
In this general case, when refining the current best path in the automaton q(k), we might have the choice between expand on this path, say, the context of an existing unigram, bigram, trigram or even of an existing quadrigram, and we could then choose to expand the one which results in the largest “disappointment”.
We stop the refinement process when the ratio p(x(k))/q(k)(x(k)) is closer to 1 than an arbitrary threshold, or even, if we prefer, when p(x(k))/q(k)(x(k))=1, which will be reached at a certain point because we cannot introduce in the automaton more states than there are in the true FSA corresponding to p.
It is noteworthy that, typically, when we stop on the automaton q(k) with the result x(k), we will have introduced many fewer states than would have been necessary to do Viterbi decoding with the automaton underlying the true distribution p. One way to get a feeling for that is the following informal observation. Suppose that the p model is a 7-gram model. Suppose that the true optimum string according to p relative to an acoustic sequence of 11 words, is the sentence x=The best way to predict the future is to invent it, (attributed to Alan Kay) but also suppose—a supposition that is not unrealistic—that even limiting oneself to w3 backoffs, for all other 11-word sentences x′ one has w3(x′)<p(x)=w7(x). Let us then consider an approach that, when it decides which n-gram to refine in a sequence, limits its choices to lower-order n-grams before considering refinement of some higher-order n-gram in the string. In such a situation, the algorithm will never produce any m-gram where m>3 and where this m-gram is not a substring of x. This is because, if the algorithm were ever to produce such an m-gram, this would mean that it would have produced at some point q(k) a sequence x(k) different from x, but for which the w3 optimistic estimate would have been larger than q(k)(x), which would imply that this estimate would have been larger than p(x), a contradiction. Thus, in the situation where trig ram-level optimistic estimates of competitors of x are already worse than the true value of x, the algorithm disclosed herein does not need to take into account many high-order states that the standard Viterbi decoder has to explicitly represent.
The sampling algorithm for high order HMMs is analogous to the optimization version. In the optimization case, it was assumed all along that we were able to find the maximum path in the q(k) automaton, via the usual dynamic programming procedure. In substance, starting from the final state, this procedure computes, for each state, the maximum weight of a path connecting this state to the final state, and iterates this procedure for states farther and farther from the end state. The main difference in sampling is that, instead of finding the maximum path from a given state to the final state, we need to compute the sum of the weights of all the paths connecting this state to the final state, and iterate the procedure similarly to the previous case. Formally, we are operating now in the sum-product semiring while in the optimization case we were operating in the max-product semiring (of which the log version is called the tropical semiring), but otherwise, the procedures are the same. Once these sums have been computed on all the states of the q(k) automaton, they can be used directly to sample from the automaton, by moving forward in the standard way. The refinements are then performed on rejects from the rejection sampler with q(k), up to the time the acceptance rate becomes acceptable, i.e. above a certain reasonable threshold. If x is the rejected string, one selects, as in the optimization case, one of the n-grams for contextual refinement. While a possibility is to select this n-gram based on how much this choice decreases the value of q(k+1)(x) relative to q(k)(x), another possibility is to select it based on how much the choice decreases the overall mass of q(k+1)(x) relative to that of q(k)(x), which is more in line with the L1 objective that is most relevant for sampling. Once we have found a q(k) for which the observed acceptance rate (or more precisely, the cumulative observed acceptance rate for all the attempts done until this point, with all the refined automata up to the current q(k)) is above a threshold, we stop the refinements, and use this final automaton for sampling an arbitrary number of times.
As another example, the application of the OS* algorithm for intersecting PCFG's with high-order LM's is described. This is a component sometimes used in statistical machine translation (SMT), for example.
A standard result of formal language theory is that the intersection of a context-free grammar (CFG) with a finite state automaton (FSA) is a CFG. This construct can be generalized to the intersection of a Weighted CFG (WCFG) with a Weighted FSA (WFSA) (see e.g. Nederhof et al., “Probabilistic parsing as intersection”, in Proc. 8th Int'l. Workshop on Parsing Technologies, 2003), resulting in a WCFG. In our case, this entails optimizing and sampling from the intersection p of a PCFG G with a complex WFSA A representing a high-order language model (LM). For illustration purposes, it is assumed here that A is a trigram language model, but the description can readily be transposed to higher-order cases.
Let us denote by x a derivation in G, and by y=y(x) the string of terminal leaves associated with x (the “yield” of the derivation x). The weighted intersection p of G and A is defined as p(x)≡G(x).A(x) where A(x) is a shorthand for A(y(x)). Due to the intersection result, p can in principle be represented by a WCFG, but for a trigram model A, this grammar can become very large. Our approach will then be the following: we will start by taking the proposal q(0) equal to G, and then gradually refine this proposal by incorporating more and more accurate approximations to the full automaton A, themselves expressed as weighted automata of small complexity. We will stop refining based on having found a good enough approximation in optimization, or a sampler with sufficient acceptance rate in sampling.
To be more concrete, let us consider a PCFG G, with Σx G(x)=1, where x varies among all the finite derivations relative to G. Such a PCFG is said to be consistent, that is, it is such that the total mass of infinite derivations is null. It is possible to sample from G by expanding derivations top-down using the conditional probabilities of the rules. It is also possible to find the derivation x of maximum value G(x) by a standard dynamic programming procedure.
We introduce a sequence of proposals denoted here as q(0)=G,q(1)=q(0).B(1), . . . , q(i+1)=q(i).B(i+1), . . . , where each B(i) is a small automaton including some additional knowledge about the language model represented by A. Each q(i) will thus be a WCFG (not normalized) and refining q(i) into q(i+1) will then consist of a local update of the grammar q(i), ensuring a desirable incrementality of the refinements. Analogous to the HMM case, we have the following notations:
The optimization case is first considered. Suppose that, at a certain stage, the grammar q(i) has already incorporated knowledge of w1(dog). Then suppose that the maximum derivation x(i)=argmax(i)(x) has the yield: the two dog barked, where w1(dog) is much larger than the more accurate w2(dog|two). We then decide to update q(i) into q(i+1)=q(i).B(i+1), where B(i+1) represents the additional knowledge corresponding to w2(dog|two). More precisely, let us define:
The relation α≦1 holds by the definition of w1,w2.
With reference to
With reference to
Sampling is suitably done in the same way, the main difference being that we need to use dynamic programming to compute the sum of weights bottom-up in the grammars q(i). This amounts to using the sum-product semiring instead of the max-product semiring.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Entry |
---|
“A Brief Introduction to Graphical Models and Bayesian Networks”, Retrieved from Internet on Oct. 14, 2011; http://www.cs.berkeley.edu/˜murphyk/Bayes/bayes.html, pp. 1-19. |
Gilks, et al. “Adaptive Rejection Sampling for Gibbs Sampling”, Appl. Statist.(1992), 41, No. 2, pp. 337-348. |
Gorur, et al. “Concave Convex Adaptive Rejection Sampling”, Gatsby Computational Neuroscience Unit, pp. 1-16. |
Hart, et al. “A formal basis for the Heuristic Determination of Minimum Cost Paths”, IEEE Transactions of Systems Science and Cybernetics, vol. SSC-4, No. 2, Jul. 196, pp. 100-107. |
Jordan, et al. “An introduction to Variational Methods for Graphical Models”, Machine Learning, 1999, 37, pp. 183-233. |
Propp, et al. “Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics”, Department of Mathematics, Massachusetts Institute of Technology, Jul. 16, 1996, 1-27. |
Wainwright, et al. “Tree-reweighted belief propogation algorithms and approximate ML estimation by pseudo-moment matching”, 2003, pp. 1-8. |
Wetherell. “Probabilistic Languages: A Review and Some Open Questions”, Computing Surveys, vol. 12, No. 4, Dec. 1980, pp. 1-19. |
Yedidia, et al. “Generalized Belief Propagation”, 2001, pp. 1-7. |
Mansingha, et al. “Exact and Approximate Sampling by Systematic Stochastic Search”, 2009, pp. 1-8. |
Number | Date | Country | |
---|---|---|---|
20130338999 A1 | Dec 2013 | US |