The disclosure pertains to training Boltzmann machines.
Deep learning is a relatively new paradigm for machine learning that has substantially impacted the way in which classification, inference and artificial intelligence (AI) tasks are performed. Deep learning began with the suggestion that in order to perform sophisticated AI tasks, such as vision or language, it may be necessary to work on abstractions of the initial data rather than raw data. For example, an inference engine that is trained to detect a car might first take a raw image and decompose it first into simple shapes. These shapes could form the first layer of abstraction. These elementary shapes could then be grouped together into higher level abstract objects such as bumpers or wheels. The problem of determining whether a particular image is or is not a car is then performed on the abstract data rather than the raw pixel data. In general, this process could involve many levels of abstraction.
Deep learning techniques have demonstrated remarkable improvements such as up to 30% relative reduction in error rate on many typical vision and speech tasks. In some cases, deep learning techniques approach human performance, such as in matching two faces. Conventional classical deep learning methods are currently deployed in language models for speech and search engines. Other applications include machine translation and deep image understanding (i.e., image to text representation).
Existing methods for training deep belief networks use contrastive divergence approximations to train the network layer by layer. This process is expensive for deep networks, relies on the validity of the contrastive divergence approximation, and precludes the use of intra-layer connections. The contrastive divergence approximation is inapplicable in some applications, and in any case, contrastive divergence based methods are incapable of training an entire graph at once and instead rely on training the system one layer at a time, which is costly and reduces the quality of the model. Finally, further crude approximations are needed to train a full Boltzmann machine, which potentially has connections between all hidden and visible units and may limit the quality of the optima found in the learning algorithm. Approaches are needed that overcome these limitations.
Methods of Bayes inference, training Boltzmann machines, and Gibbs sampling, and methods for other applications use rejection sampling in which a set of N samples is obtained from an initial distribution that is typically chosen so as to approximate a final distribution and be readily sampled. A corresponding set of N samples based on a model distribution is obtained, wherein N is a positive integer. A likelihood ratio of an approximation to the model distribution over the initial distribution is compared to a random variable, and samples are selected from the set of samples based on the comparison. In a representative application, a definition of a Boltzmann machine that includes a visible layer and at least one hidden layer with associated weights and biases is stored. At least one of the Boltzmann machine weights and biases is updated based on the selected samples and a set of training vectors.
These and other features of the disclosure are set forth below with reference to the accompanying drawings.
As used in this application and in the claims, the singular forms “a,” “an,” and “the” include the plural forms unless the context clearly dictates otherwise. Additionally, the term “includes” means “comprises.” Further, the term “coupled” does not exclude the presence of intermediate elements between the coupled items.
The systems, apparatus, and methods described herein should not be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and non-obvious features and aspects of the various disclosed embodiments, alone and in various combinations and sub-combinations with one another. The disclosed systems, methods, and apparatus are not limited to any specific aspect or feature or combinations thereof, nor do the disclosed systems, methods, and apparatus require that any one or more specific advantages be present or problems be solved. Any theories of operation are to facilitate explanation, but the disclosed systems, methods, and apparatus are not limited to such theories of operation.
Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed systems, methods, and apparatus can be used in conjunction with other systems, methods, and apparatus. Additionally, the description sometimes uses terms like “produce” and “provide” to describe the disclosed methods. These terms are high-level abstractions of the actual operations that are performed. The actual operations that correspond to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art.
In some examples, values, procedures, or apparatus' are referred to as “lowest”, “best”, “minimum,” or the like. It will be appreciated that such descriptions are intended to indicate that a selection among many functional alternatives can be made, and such selections need not be better, smaller, or otherwise preferable to other selections.
The methods and apparatus described herein generally use a classical computer coupled to train a Boltzmann machine. In order for the classical computer to update a model for a Boltzmann machine given training data, a classically tractable approximation to the state provided by a mean field approximation, or a related approximation, is used.
The Boltzmann machine is a powerful paradigm for machine learning in which the problem of training a system to classify or generate examples of a set of training vectors is reduced to the problem of energy minimization of a spin system. The Boltzmann machine consists of several binary units that are split into two categories: (a) visible units and (b) hidden units. The visible units are the units in which the inputs and outputs of the machine are given. For example, if a machine is used for classification, then the visible units will often be used to hold training data as well as a label for that training data. The hidden units are used to generate correlations between the visible units that enable the machine either to assign an appropriate label to a given training vector or to generate an example of the type of data that the system is trained to output.
Formally, the Boltzmann machine models the probability of a given configuration (v,h) of hidden and visible units via the Gibbs distribution:
P(v,h)=e−E(v,h)/Z,
wherein Z is a normalizing factor known as the partition function, and v,h refer to visible and hidden unit values, respectively. The energy E of a given configuration of hidden and visible units is of the form:
wherein vectors v and h are visible and hidden unit values, vectors b and d are biases that provide an energy penalty for a bit taking a value of 1 and wi,j is a weight that assigns an energy penalty for the hidden and visible units both taking on a value of 1. Training a Boltzmann machine reduces to estimating these biases and weights by maximizing the log-likelihood of the training data. A Boltzmann machine for which the biases and weights have been determined is referred to as a trained Boltzmann machine. A so-called L2-regularization term can be added in order to prevent overfitting, resulting in the following form of an objective function:
This objective function is referred to as a maximum likelihood-objective (ML-objective) function and λ represents the regularization term. Gradient descent provides a method to find a locally optimal value of the ML-objective function. Formally, the gradients of this objective function can be written as:
The expectation values for a quantity x(v,h) are given by:
Note that it is non-trivial to compute any of these gradients: the value of the partition function Z is #P-hard to compute and cannot generally be efficiently approximated within a specified multiplicative error. This means modulo reasonable complexity theoretic assumptions, neither a quantum nor a classical computer should be able to directly compute the probability of a given configuration and in turn compute the log-likelihood of the Boltzmann machine yielding the particular configuration of hidden and visible units.
In practice, approximations to the likelihood gradient via contrastive divergence or mean-field assumptions have been used. These conventional approaches, while useful, are not fully theoretically satisfying as the directions yielded by the approximations are not the gradients of any objective function, let alone the log-likelihood. Also, contrastive divergence does not succeed when trying to train a full Boltzmann machine which has arbitrary connections between visible and hidden units. The need for such connections can be mitigated by using a deep restricted Boltzmann machine (shown in
Boltzmann machines can be used in a variety of applications. In one application, data associated with a particular image, a series of images such as video, a text string, speech or other audio is provided to a Boltzmann machine (after training) for processing. In some cases, the Boltzmann provides a classification of the data example. For example, a Boltzmann machine can classify an input data example as containing an image of a face, speech in a particular language or from a particular individual, distinguish spam from desired email, or identify other patterns in the input data example such as identifying shapes in an image. In other examples, the Boltzmann machine identifies other features in the input data example or other classifications associated with the data example. In still other examples, the Boltzmann machine preprocesses a data example so as to extract features that are to be provide to a subsequent Boltzmann machine. In typical examples, a trained Boltzmann machine can process data examples for classification, clustering into groups, or simplification such as by identifying topics in a set of documents. Data input to a Boltzmann machine for processing for these or other purposes is referred to as a data example. In some applications, a trained Boltzmann machine is used to generate output data corresponding to one or more features or groups of features associated with the Boltzmann machine. Such output data is referred to as an output data example. For example, a trained Boltzmann machine associated with facial recognition can produce an output data example that is corresponding to a model face.
Disclosed herein are efficient classical algorithms for training deep Boltzmann machines using rejection sampling. Error bounds for the resulting approximation are estimated and indicate that choosing an instrumental distribution to minimize an α=2 divergence with the Gibbs state minimizes algorithmic complexity. The disclosed approaches can be parallelized.
A quantum form of rejection sampling can be used for training Boltzmann machines. Quantum states that crudely approximate the Gibbs distribution are refined so as to closely mimic the Gibbs distribution. In particular, copies of quantum analogs of the mean-field distribution are distilled into Gibbs states. The gradients of the average log-likelihood function are then estimated by either sampling from the resulting quantum state or by using techniques such as quantum amplitude amplification and estimation. A quadratic speedup in the scaling of the algorithm with the number of training vectors and the acceptance probability of the rejection sampling step can be achieved. This approach has a number of advantages. Firstly, it is perhaps the most natural method for training a Boltzmann machine using a quantum computer. Secondly, it does not explicitly depend on the interaction graph used. This allows full Boltzmann machines, rather than layered restricted Boltzmann machines (RBMs), to be trained. Thirdly, such methods can provide better gradients than contrastive divergence methods. However, available quantum computers are generally limited to fewer than ten units in the graphical model, and thus are not suitable for many practical machine learning problems. Approaches that do not require quantum computations are needed. Disclosed herein are methods and apparatus based on classical computing that retain the advantages of quantum algorithms, while providing practical advantages for training highly optimized deep Boltzmann machines (albeit at a polynomial increase in algorithmic complexity). Using rejection sampling on samples drawn from the mean-field distribution is not optimal, and using product distributions that minimize the α=2 divergence provides dramatically better results if weak regularization is used.
Rejection sampling (RS) can be used to draw samples from a distribution
by sampling instead from an instrumental distribution Q(x) that approximates the Gibbs state and rejecting samples with a probability
wherein κ is a normalizing constant introduced to ensure that the rejection probability is well defined. A major challenge faced when training Boltzmann machines is that Z is seldom known. Rejection sampling can nonetheless be applied if an approximation to Z is provided. If ZQ>0 is such an approximation and
then samples from
can be obtained by repeatedly drawing samples from Q and rejecting the samples with probability
until a sample is accepted. This can be implemented by drawing y uniformly from the interval [0,1] and accepting x if y≤Praccept(x|Q(x),κ,ZQ).
In many applications the constants needed to normalize (2) are not known or may be prohibitively large, necessitating approximate rejection sampling. A form of approximate rejection sampling can be used in which κA<κ such that
for some configurations referred to herein as “bad.” The approximate rejection sampling algorithm then proceeds in the same way as precise rejection sampling except that a sample x will always be accepted if x is bad. This means that the samples yielded by approximate rejection sampling are not precisely drawn from P/Z. The acceptance rate depends on the choice of Q. One approach is to choose a distribution that minimizes the distance between P/Z and Q, however it may not be immediately obvious which distance measure (or more generally divergence) is the best choice to minimize the error in the resultant distribution given a maximum value of κA. Even if Q closely approximates P/Z for the most probable outcomes it may underestimate P/Z by orders of magnitude for the less likely outcomes. This can necessitate taking a very large value of κA if the sum of the probability of these underestimated configurations is appreciable. Generally, it can be shown that to minimize the error ε, the sum Σx∈badP(x)/Z should be minimized. It can be shown that by choosing Q to minimize the α=2 divergence D2(P/Z∥Q), the error in the distribution of samples is minimized. Choosing Q to minimize D2 thus reduces κ.
As discussed above, conventional training methods based on contrastive divergence can be computationally difficult, inaccurate, or fail to converge. In one approach, Q is selected as a mean-field approximation in which Q is a factorized probability distribution over all of the hidden and visible units in the graphical model. More concretely, the mean-field approximation for a restricted Boltzmann machined (RBM) is a distribution such that:
wherein μi and νj are chosen to minimize KL(Q|P), wherein KL is the Kullback-Leibler (KL) divergence. The parameters μi and νj are called mean-field parameters. In addition,
μj=(1+e−b
νj=(1+e−b
A mean-field approximation for a generic Boltzmann machine is similar.
Although the mean-field approximation is expedient to compute, it is not theoretically the best product distribution to use to approximate P/Z. This is because the mean-field approximation is directed to minimization of the KL divergence and the error in the resultant post-rejection sampling distribution depends instead on D2 which is defined for distributions p and q to be
Finding QMF does not target minimization of D2 because the α=2 divergence does not contain logarithms; more general methods such as fractional belief propagation can be used to find Q. Product distributions that target minimization of the α=2 divergence are referred to herein as Qα=2. In this case, Q is selected variationally to minimize an upper bound on the log partition function that corresponds to the choice α=2. Representative methods are described in Wiegerinck et al., “Fractional belief propagation,” Adv. Neural Inf. Processing Systems, pages 455-462 (2003), which is incorporated herein by reference.
The log-partition function can be efficiently estimated for any product distribution
wherein H[Q(x)] is the Shannon entropy of Q(x) and E is the expected energy of the state Q(x). This equality is true if and only if Q(x)=e−E(x)/Z. The estimate is becomes more accurate as Q(x) approaches the Gibbs distribution. If Eqn. (3) is used to estimate the partition function, the mean-field distribution provides a superior estimate as ZMF. Other estimates of the log-partition function can be used.
With reference to
with instrumental distribution Q(h|x) and ZQ(h|x)κ
It can be shown that rejection sampling (RS) methods of training such as disclosed herein can be less computationally complex that conventional contrastive divergence (CD) based methods, depending on network depth. In addition, RS-based methods can be parallelized, while CD-based methods generally must be performed serially. For example, as shown in
The accuracy of RS-based methods depends on a number of samples used in rejection sampling Q and the value of the normalizing constant κA. Typically, values of κA that are greater than or equal to four are suitable, but smaller values can be used. For sufficiently large κA, error shrinks as
where Nsamp is the number of samples used in the estimate of the derivatives. As noted above, a more general product distribution or an elementary non-product distribution can be used instead of a mean-field approximation.
As discussed above, rejection sampling can be used to train Boltzmann machines by refining variational approximations to the Gibbs distribution such as the mean-field approximation, into close approximations to the Gibbs state. Cost can be minimized by reducing the α=2 divergence between the true Gibbs state and the instrumental distribution. Furthermore, the gradient yielded by the disclosed methods approaches that of the training objective function as KκA→∞ and the costs incurred by using a large KA can be distributed over multiple processors. In addition, the disclosed methods can lead to substantially better gradients than a state of the art algorithm known as contrastive divergence training achieves for small RBMs.
A maximum likelihood objective function can be used in training using a representative method illustrated in Table 1 below.
Approximate model and data distributions Q(v,h),;v), respectively, are sampled via rejection sampling and the accepted samples are used to compute gradients of the weights, visible biases, and hidden biases.
Such a method 400 is further illustrated in
With reference to
RS as discussed above can also be used to periodically retrofit a posterior distribution to a distribution that can be efficiently sampled. With reference to
RS as discussed above can also be used to sample from a Gibbs Distribution. Referring to
In quantum computing, determination of eigenphases of a unitary operator U is often needed. Typically, estimation of eigenphases involves repeated application of a circuit such as shown in
If the prior mean is μ and the prior standard deviation is σ, then
The constant factor 1.25 is based on optimizing median performance of the method. In some cases, the computation of σ depends on the interval that is available for θ (for example, [0, 2π] it may be desirable to shift the interval to reduce the effects of wrap around.
In some cases, the likelihoods above vary due to decoherence. With a decoherence time T2, the likelihoods are:
A method for selecting M, θ with such decoherence is summarized in Table 2. Inputs: Prior RS sample state mean μ and covariance Σ and sampling kernel F.
M←1/√{square root over (Tr(Σ))}
if M≥T2,then
M˜f(x;1/T2)(draw M from exponential distribution with mean T2)
−(θ/M)˜F(μ,Σ)
return M,θ
An exponential distribution is used in Table 2 as such a distribution corresponds to exponentially decaying probability. Other distributions such as a Gaussian distribution can be used as well. In some cases, to avoid possible instabilities, multiple events can be batched together in a single step to form an effective likelihood function of the form:
With reference to
With reference to
With reference to
As shown in
The exemplary PC 1100 further includes one or more storage devices 1130 such as a hard disk drive for reading from and writing to a hard disk, a magnetic disk drive for reading from or writing to a removable magnetic disk, and an optical disk drive for reading from or writing to a removable optical disk (such as a CD-ROM or other optical media). Such storage devices can be connected to the system bus 1106 by a hard disk drive interface, a magnetic disk drive interface, and an optical drive interface, respectively. The drives and their associated computer readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules, and other data for the PC 1100. Other types of computer-readable media which can store data that is accessible by a PC, such as magnetic cassettes, flash memory cards, digital video disks, CDs, DVDs, RAMs, ROMs, and the like, may also be used in the exemplary operating environment.
A number of program modules may be stored in the storage devices 1130 including an operating system, one or more application programs, other program modules, and program data. Storage of Boltzmann machine specifications, and computer-executable instructions for training procedures, determining objective functions, and configuring a quantum computer can be stored in the storage devices 1130 as well as or in addition to the memory 1104. A user may enter commands and information into the PC 1100 through one or more input devices 1140 such as a keyboard and a pointing device such as a mouse. Other input devices may include a digital camera, microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the one or more processing units 1102 through a serial port interface that is coupled to the system bus 1106, but may be connected by other interfaces such as a parallel port, game port, or universal serial bus (USB). A monitor 1146 or other type of display device is also connected to the system bus 1106 via an interface, such as a video adapter. Other peripheral output devices 1145, such as speakers and printers (not shown), may be included. In some cases, a user interface is display so that a user can input a Boltzmann machine specification for training, and verify successful training.
The PC 1100 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1160. In some examples, one or more network or communication connections 1150 are included. The remote computer 1160 may be another PC, a server, a router, a network PC, or a peer device or other common network node, and typically includes many or all of the elements described above relative to the PC 1100, although only a memory storage device 1162 has been illustrated in
When used in a LAN networking environment, the PC 1100 is connected to the LAN through a network interface. When used in a WAN networking environment, the PC 1100 typically includes a modem or other means for establishing communications over the WAN, such as the Internet. In a networked environment, program modules depicted relative to the personal computer 1100, or portions thereof, may be stored in the remote memory storage device or other locations on the LAN or WAN. The network connections shown are exemplary, and other means of establishing a communications link between the computers may be used.
In some examples, a logic device such as a field programmable gate array, other programmable logic device (PLD), an application specific integrated circuit can be used, and a general purpose processor is not necessary. As used herein, processor generally refers to logic devices that execute instructions that can be coupled to the logic device or fixed in the logic device. In some cases, logic devices include memory portions, but memory can be provided externally, as may be convenient. In addition, multiple logic devices can be arranged for parallel processing.
Having described and illustrated the principles of the disclosed technology with reference to the illustrated embodiments, it will be recognized that the illustrated embodiments can be modified in arrangement and detail without departing from such principles. The technologies from any example can be combined with the technologies described in any one or more of the other examples. Alternatives specifically addressed in these sections are merely exemplary and do not constitute all possible examples.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2016/032942 | 5/18/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62171195 | Jun 2015 | US |