Various algorithms, such as machine learning algorithms, often use statistical probabilities to make decisions or to model systems. Some such learning algorithms may use Bayesian statistics, or may use other statistical models that have a theoretical basis in natural phenomena. In the execution of such algorithms, typically such statistical probabilities are calculated using classical computing devices, wherein the statistical probabilities are then used by other aspects of the algorithm. As an example, statistical probabilities may be used to generate a random number, wherein the random number is then used to evaluate some other aspect of the algorithm.
Generating such statistical probabilities may involve performing complex calculations which may require both time and energy to perform, thus increasing a latency of execution of the algorithm and/or negatively impacting energy efficiency. In some scenarios, calculation of such statistical probabilities using classical computing devices may result in non-trivial increases in execution time of algorithms and/or energy usage to execute such algorithms.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.
The present disclosure relates to methods, systems, and an apparatus for performing computer operations using a thermodynamic chip. In some embodiments, physical elements of a thermodynamic chip may be used to physically model evolution according to Langevin dynamics. For example, in some embodiments, a thermodynamic chip includes a substrate comprising oscillators implemented using superconducting flux elements. The oscillators may be mapped to neurons (visible or hidden) that “evolve” according to Langevin dynamics. For example, the oscillators of the thermodynamic chip may be initialized in a particular configuration and allowed to thermodynamically evolve. As the oscillators “evolve,” degrees of freedom of the oscillators may be sampled. Values of these sampled degrees of freedom may represent, for example, vector values for neurons that evolve according to Langevin dynamics. For example, algorithms that use stochastic gradient optimization and require sampling during training, such as those proposed by Welling and Teh, and/or other algorithms, such as natural gradient descent, mirror descent, etc. may be implemented using a thermodynamic chip. In some embodiments, a thermodynamic chip may enable such algorithms to be implemented directly by sampling the neurons (e.g., degrees of freedom of the oscillators of the substrate of the thermodynamic chip) directly without having to calculate statistics to determine probabilities. As another example, thermodynamic chips may be used to perform autocomplete tasks, such as those that use Hopfield networks, which may be implemented using the Welling and Teh algorithm. For example, visible neurons may be arranged in a fully connected graph (such as a Hopfield network as shown in
In some embodiments, a thermodynamic chip includes superconducting flux elements arranged in a substrate, wherein the thermodynamic chip is configured to modify magnetic fields that couple respective ones of the oscillators with other ones of the oscillators. In some embodiments, non-linear (e.g., anharmonic) oscillators are used that have dual-well potentials. These dual-well oscillators may be mapped to neurons of a given model that the thermodynamic chip is being used to implement. Also, in some embodiments, at least some of the oscillators may be harmonic oscillators with single-well potentials. The single-well oscillators may be mapped to non-visible (or hidden) neurons that are not mapped to input variables or output variables, but instead represent other relationships in the model, such as those that are not readily visible. In some embodiments, oscillators may be implemented using superconducting flux elements with varying amounts of non-linearity. In some embodiments, an oscillator may have a single well potential, a dual-well potential, or a potential somewhere in a range between a single-well potential and a dual-well potential. In some embodiments, both visible and non-visible neurons may be mapped to oscillators having a single well potential, a dual-well potential, or a potential somewhere in a range between a single-well potential and a dual-well potential.
In some embodiments, parameters of an energy based model or other learning algorithm may be trained by sampling the oscillators of a thermodynamic chip, that have been configured in a current configuration with couplings that correspond to a current engineered Hamiltonian being used to approximate aspects of the energy based model. Based on the sampling, a computing device coupled to the thermodynamic chip, such as a field programmable gate array (FPGA) or application specific integrated circuit (ASIC), which may be co-located in a dilution refrigerator with the thermodynamic chip, or in an external environment, external to the dilution refrigerator that hosts the thermodynamic chip, may determine updated weightings or biases to be used in the engineered Hamiltonian. In some embodiments, such measurements and updates to weightings and biases may be performed until the engineered Hamiltonian has been adjusted such that the samples taken from the thermodynamic chip satisfy one or more training criteria for training the thermodynamic chip such that the thermodynamic chip accurately implements the samples need to compute a model the thermodynamic chip is being used to approximate.
For example, in some embodiments, the engineered Hamiltonian (shown below) may be used to model a Monte Carlo sampling method and may be implemented using a thermodynamic chip wherein the first two terms of the Hamiltonian represent visible and non-visible neurons and the latter two terms of the Hamiltonian represent couplings between the weights and biases and the visible and non-visible neurons. Note that additional details regarding training and implementation of the engineered Hamiltonian to perform Bayesian learning tasks is further described herein.
In the above equation, V represents vertices such as the neurons 254 shown in
In some embodiments, the use of a thermodynamic chip in a computer system may enable a learning algorithm to be implemented in a more efficient and faster manner than if the learning algorithm was implemented purely using classical components. For example, measuring the neurons in a thermodynamic chip to determine Langevin statistics may be quicker and more energy efficient than determining such statistics via calculation (e.g., using a classical computing device). Similar benefits accrue when thermodynamic chips are used in other algorithms that have statistical sub-components such as Monte Carlo sampling methods. For example, the thermodynamic chip may function as a co-processor of a computer system, such as is shown for thermodynamic chip 1380 which is a co-processor with processors 1310 of computer system 1300 (shown in
Broadly speaking, classes of algorithms that may benefit from thermodynamic chips include those algorithms that involve probabilistic inference. Such probabilistic inferences (which otherwise would be performed using a CPU or GPU) may instead be delegated to the thermodynamic chip for a faster and more energy efficient implementation. Thus, in some embodiments, a thermodynamic chip may be used to perform a sub-routine of a larger algorithm that may also involve other calculations performed on a classical computer system. At a physical level, the thermodynamic chip harnesses electron fluctuations in superconductors coupled in flux loops to model Langevin dynamics.
Note that in some embodiments, electro-magnetic or mechanical (or other suitable) oscillators may be used. A thermodynamic chip may implement neuro-thermodynamic computing and therefore may be said to be neuromorphic. For example, the neurons implemented using the oscillators of the thermodynamic chip may function as neurons of a neural network that has been implemented directly in hardware. Also, the thermodynamic chip is “thermodynamic” because the chip may be operated in the thermodynamic regime slightly above 0 Kelvin, wherein thermodynamic effects cannot be ignored. For example some thermodynamic chips may be operated at 2, 3, 4, etc. degrees Kelvin. In some embodiments, temperatures less than 15 Kelvin may be used. Though other temperatures ranges are also contemplated. This also, in some contexts, may be referred to as analog stochastic computing. In some embodiments, the temperature regime and/or oscillation frequencies used to implement the thermodynamic chip may be engineered to achieve certain statistical results. For example, the temperature, friction (e.g., damping) and/or oscillation frequency may be controlled variables that ensure the oscillators evolve according to a given dynamical model, such as Langevin dynamics. In some embodiments, temperature may be adjusted to control a level of noise introduced into the evolution of the neurons. As yet another example, a thermodynamic chip may be used to model energy models that require a Boltzmann distribution. Also, a thermodynamic chip may be used to solve variational algorithms. In some embodiments, sampling methods for sampling the thermodynamic chip are timed assuming thermal equilibrium is reached at very fast time scales, which can be in the nano-second to pico-second range.
Bayesian Learning with Energy-Based Models
As introduced above, a thermodynamic chip may be used to model energy-based models, according to some embodiments. For example, a stochastic gradient optimization algorithm, such as that of Welling and Teh, may be adapted for use in energy-based models. In such embodiments, a set of N data items X={xi}i−1N with a posterior distribution pθ(x)=exp(−εθ(x))/Z(θ) and partition function Z(θ)=∫pθ(x) dx may be constructed, and the Welling and Teh stochastic gradient optimization algorithm may be combined with Langevin dynamics to obtain a parameter update algorithm that provides efficient use of large datasets while also providing for parameter uncertainty to be captured in a Bayesian context. As such, the update rule may be written as
Furthermore, ϵt may be restricted to satisfy the following properties: Σt=1∞ϵt=∞ and Σt−1∞ϵt2<∞. With regard to the property Σt−1∞ϵt=∞, ϵt may be restricted to satisfy said property in order for parameters to reach high probability regions regardless of when/where said parameters are initialized, according to some embodiments. With regard to the property Σt−1∞ϵt2<∞, ϵt may be restricted to satisfy said property in order for parameters to converge to a mode instead of oscillating around said mode, according to some embodiments. A functional form which may satisfy said properties is accomplished by setting ϵt=a(b+t)−γ, wherein, at each iteration t, a subset of data items with size n, e.g., Xt={xt
Continuing with the posterior distribution pθ(x)=exp(−εθ(x))/Z(θ) for an energy-based model, it may be defined that
wherein
may be further defined as
Therefore, applying the above equations, θt+1 may be rewritten as
according to some embodiments.
In some embodiments, in order to efficiently compute the term x˜p
The Langevin MCMC algorithm may then be used to sample from pθ(x) by first drawing an initial sample x0 from a given prior distribution, and then by simulating the overdamped Langevin diffusion process for K steps with size δ>0 as
wherein ξk˜N(0, I). Furthermore, when δ→0 and K→∞, then xk may be guaranteed to distribute as pθ(x), according to some embodiments. In addition, to even further improve accuracy, the Metropolis-Hastings algorithm may be incorporated as follows. Firstly, a quantity α may be computed such that
wherein q(x′|x)∝exp
may be defined as the transition density from x to x′. Secondly, u may be drawn from a continuous distribution on the interval [0,1] such that if u≤α, the update defined by xk+1=xk+δ∇x log pθ(xk)+√{square root over (2δ)}ξk=xk−δ∇xεθ(x)+√{square root over (2δ)}ξk may be applied. Otherwise, xk+1 may be set as xk+1=xk.
In some embodiments, and in order to further define Bayesian learning techniques for energy-based models used herein, an adaptive pre-conditioning method based on a diagonal approximation of the second order moment of gradient may be applied, which may also be referred to herein as an adaptively pre-conditioned SGLD. As such, a generalizability of SGLD and the training speed of adaptive first order methods may be additionally be applied. By initializing μ0=0 and C0=0, (θt) may be defined as
Then, at least time step t, the following updates may be performed. Firstly, a momentum update may be computed as
followed by a Ct update
Secondly, a parameter update may then be computed as
wherein ξt˜N(μt, Ct), and ψ may be defined as a noise parameter.
In some embodiments, and in order to further define Bayesian learning techniques for energy-based models used herein, gradient descent-based techniques may be used to compute an estimate of the maximum of the posterior distribution pθ(x) as defined above (e.g., instead of using stochastic Langevin-like dynamics for parameters). In such embodiments, information-geometric optimizers may be applied for such gradient-based training of energy-based models. The following paragraphs detail how to perform natural gradient descent for energy-based models.
In some embodiments, when applying the natural gradient descent algorithm to energy based models, the parameters may be updated as follows
wherein 1/λj may be defined as the learning rate and 30 (θ) may be defined as the Moore-Penrose pseudo-inverse of the information matrix
(θ).
A calculation of the Bogoliubov-Kubo-Mori (BKM) metric, BKM(θ), may additionally be computed as follows, wherein the BKM metric may be defined as a special selection of the metric
(θ) and may produce asymptotic optimality criteria.
Furthermore, the term ∂θ
and the term ∂θ
leading to BKM(θ)j,k being rewritten as
In some embodiments applying the BKM metric, the sampling operations utilized by BKM(θ)j,k may be computed efficiently when implemented using a thermodynamic chip architecture, such as those described herein. Furthermore, the matrix defined in the equation above for
BKM(θ)j,k may be sparsified when applying a block diagonal approximation, a KFAC, or a diagonal approximation, according to some embodiments. Such techniques may reduce the number of matrix elements to be estimated using the given thermodynamic chip architecture, and may additionally lead to similar performance and gradient descent dynamics.
In some embodiments, and in order to further define Bayesian learning techniques for energy-based models used herein, additional gradient descent-based techniques may be used to compute an estimate of the maximum of the posterior distribution pθ(x) as defined above. The following paragraphs detail how to perform mirror descent for energy-based models.
In some embodiments, when applying the mirror descent algorithm to energy based models, the parameters may be updated as follows for values k=1, 2, . . . , K and for a given j:
wherein ηk and λj may be defined as learning rates. The parameters may then be updated as θj+1←θjK+1. Furthermore, the relative entropy term D(pθ(x)∥pθ
which may then be rewritten as the following when using the expression of the probability density for energy-based models
In addition, in order to compute the gradient of the relative entropy term D(pθ(x)∥pθ
may be computed as
while the gradient of the term ∫pθ(x)(εθ
Therefore, the gradient of the relative entropy term D(pθ(x)∥pθ
As further explained herein with regard to sampling operations of a thermodynamic chip architecture, said architecture may provide a speedup of the implementation of the mirror descent algorithm, according to some embodiments.
As introduced above, an implementation of an engineered Hamiltonian into a thermodynamic chip may include non-visible neurons. As such, the equations provided above with regard to θt+1 may be rewritten to incorporate said non-visible neurons, according to some embodiments. The following paragraphs further detail the incorporation of non-visible neurons into equations regarding θt+1.
Firstly, the parameters may be updated as
wherein pθ(x, z)=exp(−εθ(x, z))/Z(θ), and
such that data may be clamped to xn. Furthermore,
Applying the above definitions, the parameter update definition for θt+1 may then be rewritten to incorporate non-visible neurons as follows:
Such a parameter update definition indicates that z may be sampled from the posterior distribution when clamping the visible nodes to the data and according to the term θt, and further indicates, according to the second term, that both x and z may be sampled from the posterior distribution.
In addition, the Langevin MCMC algorithm, as introduced above, may then indicate that x may be sampled from the distribution pθ(x). Therefore, when implementing non-visible neurons, the Langevin MCMC update rules may be rewritten as follows such that samples occurs over the non-visible neurons:
wherein a random variable ξk may be defined as ξk˜N (0, I). It should be noted that during inference (once the weights and biases of the engineered Hamiltonian have been learned) it is not necessary to sample the non-visible neurons (labeled z in the above equation) in order to generate inferences. However, during training (e.g. during the process of learning the weights and biases) samples of the non-visible neurons may be collected and used to compute the relevant gradients on the ASIC/FPGA.
In some embodiments, and in order to further define Bayesian learning techniques for energy-based models used herein, consideration may also be given for the negative phase term
in above iterations of definitions for θt+1. For example, said negative phase term may be approximated using a time series average, which may be well suited for a physics-based implementation of definitions for θt+1, according to some embodiments. In such an example, the negative phase term may be rewritten as
wherein xi may be computed from the Langevin MCMC process introduced above, or from a general Langevin dynamical evolution with finite friction, and T may be defined as a total number of time steps used in the approximation of the negative phase term. It may also be noted that, rather than summing over multiple paths sampled from the Langevin MCMC process (e.g., defined as the space average for the negative phase term), the above approximation of the negative phase term defines a summation over a single path evolving through time following the Langevin MCMC update rules. A space average implementation of the negative phase term may instead be written as
wherein there are M independent paths, and the xi(T) terms may be computed via definitions for xk+1 introduced above and after performing a given T number of iterations.
Therefore, for a time average approach, the parameter updates may be rewritten as
As additionally detailed below, consideration as to the initialization of xi in the equation above may be given at each iteration t, as the impacts of such selections are non-trivial. In some embodiments, time averages can also be used for parameter updates when using non-visible neurons. In such as case, the hidden (latent) variables may be sampled through time to compute the relevant gradients.
In some embodiments, a thermodynamic computing system 100 (as shown in
As introduced above, an implementation of an engineered Hamiltonian into a thermodynamic chip, such as thermodynamic chip 120, for performing Bayesian learning tasks regarding energy-based models may be defined via terms representing visible neurons, non-visible neurons, and coupling terms between the weights and biases and the visible and non-visible neurons. For example, such a thermodynamic computing system 100 may be used to train an energy-based model applied to a graph-based architecture g={V, ε}, wherein V represents a set of vertices (e.g., nodes), and ε represents a set of edges. In such implementations, neurons may reside on the nodes of the graph, each accompanied by a bias, while the synapses (weights) may reside on the edges of the graph. As additionally introduced above, an engineered Hamiltonian that may be used to derive the potential energy function used in an energy-based model, such as those applied herein, may therefore be written as
In the Htotal definition above, it may be noted that neurons are linearly coupled to the weights, which are defined as qs may be partitioned into sets of visible neurons,
vis, and non-visible neurons,
non-vis, wherein visible neurons may have different masses and/or frequencies than those of the non-visible neurons, according to some embodiments.
In some embodiments, as opposed to describing the engineered Hamiltonian via linear couplings between respective weights and neurons, the engineered Hamiltonian may be described using quadratic couplings. For example, the engineered Hamiltonian may be written as
In this respective Htotal definition above, it may be noted that energy terms with regard to non-visible variables of the engineered Hamiltonian may be defined as having dual-well potentials, e.g.,
However, in other embodiments, single-well potentials may be defined, etc. For example, in defining an engineered Hamiltonian with non-visible neurons defined via single-well potentials, the following term replacements may be made to the above Htotal definitions:
A person having ordinary skill in the art should understand that, depending upon a particular application of a given thermodynamic computing system 100, single-well potentials, dual-well potentials, etc. may be preferred over other types of potentials, etc.
In addition, when performing inference and sampling using Langevin dynamics for a thermodynamic computing system 100, the Langevin MCMC update rules introduced above, e.g., xk+1, may be computed using the equation of motion for a system of particles undergoing Langevin dynamics, wherein an associated engineered Hamiltonian is defined as
Furthermore, a person having ordinary skill in the art should understand that, if coupling terms in Htotal may be engineered (e.g., engineered such that particles undergoing Langevin dynamics correspond to visible and non-visible neurons), inference and sampling may be implemented natively by letting said system of coupled particles evolve through time, according to some embodiments.
In order to define such an evolution, a potential energy function Uθ(q) may be considered (e.g., an engineered Hamiltonian such as Htotal, without momentum-related terms), wherein visible neurons may be written as qj∈. Furthermore, in the following definitions, θ may be used to label respective weights and biases. As such, the equation of motion for overdamped Langevin dynamics may be written as
wherein Wt is a Weiner process. To the leading order, therefore, it may be written that
Next, a potential energy function may be derived that incorporates both visible and non-visible neurons. In such a derivation, a rate of change of the positions of the non-visible neurons may be regarded as faster than those of the visible neurons. As such, the equation of motion for the non-visible neurons may still be given by
However, in order to treat visible neurons, the equation of motion for overdamped Langevin dynamics may be rewritten as
In addition, since it may be regarded that non-visible neurons may evolve on a faster time scale than visible neurons, the term
may be rewritten as
wherein
corresponds to a time average over a length of time δt of the term
Furthermore, since weights and biases are fixed during inference, and visible neurons change by a small amount during a given time δt,
may additionally be understood as an approximation to the time series average of
during the time interval [t, t+δt], e.g.,
It may therefore be rewritten as
The above description of an evolution through time of particles undergoing Langevin dynamics demonstrates that inference with non-visible neurons may be performed by letting a system engineered with the couplings described via Htotal evolve through time while also ensuring that conditions defined by δt[∇xUθ(x, z)] are satisfied. Furthermore, the definition introduced above for the equation of motion for overdamped Langevin dynamics is valid at least within the large friction limit. If, however, γ is small, the equations of motion for position and momentum may not be able to be decoupled, according to some embodiments. This may be further understood by noting that the general Langevin equations of motion for position and momentum may be written as
wherein σ=√{square root over (2kbTγ)}. In order to solve said generalized Langevin equations, weakly second order numerical integration methods may be applied, such as the GJF method. By applying the GJF method, the equations of motion for position and momentum may be written as
wherein a and b may be written as
It should be understood that the Langevin MCMC algorithm may be implemented using the general Langevin equations of motion for position and momentum above or the re-written versions that include the numerical integrations, as shown above.
In addition, as introduced above, certain steps during the parameter rules updates may require clamping visible neurons to the data. Such clamping operations may be configured by adding a term to
such that the engineered Hamiltonian is energetically favorable for the visible nodes to take on the respective values of the data. For example, the engineered Hamiltonian may be rewritten as
wherein ε(t) may be defined as a hyperparameter than may be turned on or off, and wherein qd⊆
is defined as corresponding to the visible neurons of the given network architecture (see also description regarding
In some embodiments, a substrate 202 may be included in a thermodynamic chip, such as thermodynamic chip 102. Oscillators 204 of substrate 202 may be mapped in a logical representation 252 to neurons 254. In some embodiments, oscillators 204 may include oscillators with potential ranging from a single well potential to a dual-well potential and may be mapped to visible neurons and non-visible (e.g., hidden) neurons.
In some embodiments, Josephson junctions and/or superconducting quantum interference devices (SQUIDS) may be used to implement and/or excite/control the oscillators 204. In some embodiments, the oscillators 204 may be implemented using superconducting flux elements. In some embodiments, the superconducting flux elements may physically be instantiated using a superconducting circuit built out of coupled nodes comprising capacitive, inductive, and Josephson junction elements, connected in series or parallel, such as shown in
In some embodiments, non-visible neurons are not sampled. This may allow the thermodynamic chip to be configured with fewer control lines for the oscillators that are mapped to the non-visible neurons than are used for the oscillators that are mapped to the visible neurons. This may allow for scaling a thermodynamic chip to include more oscillators than would be otherwise possible if a same number of control lines were used for all oscillators.
In some embodiments, classical computing device 106 may learn relationships between respective ones of the neurons such as relationship A (352), relationship B (354), and relationship C (356). These relationships may be physically implemented in substrate 202 via couplings between oscillators 204, such as couplings 302, 304, and 306 that physically implement respective relationships 352, 354, and 356.
In some embodiments, a drive 402 may cause pulses 404 to be emitted to implement couplings 302, 304, and 306. Also in some embodiments, drive 402 may control a SQUID that is used to emit flux via flux lines. In some embodiments, DC signals could be used in addition to or instead of pulses 404 to implement couplings 302, 304, and 306. In general, time dependent signals may be used to control the oscillators and couplings between oscillators, wherein the time dependent signals may be implemented using various techniques.
In some embodiments, input neurons and output neurons, such as visible input neurons 502 and visible output neurons 504, may be directly linked via connected edges 506. As shown in
In some embodiments,
In some embodiments,
In some embodiments, Hopfield network configurations may be used for auto completion tasks. As an example, neurons of a Hopfield network may be mapped to pixels in an image, and a thermodynamic chip with oscillators coupled to form a physical instantiation of the logical Hopfield network (as shown in
In some embodiments, a thermodynamic computing system 100 may be used to train an energy-based model applied to a graph-based architecture such as a deep Boltzmann machine (DBM). As shown in
As shown in
In the following paragraphs, further detail pertaining to training a deep Boltzmann machine and performing inference is provided.
In some embodiments, a deep Boltzmann machine, such as that which is shown in
the
term may indicate the positive phase term, e.g. the clamped phase, and the
term may indicate the negative phase term, e.g., the unclamped phase. Furthermore, sampling operations may be performed using the Langevin MCMC processes described herein, according to some embodiments. In addition, in the explanation of training a deep Boltzmann machine that follows, visible neurons 552 may be used to encode a given energy-based model's prediction, while other visible neurons of visible neurons 550 may be used for input data.
In some embodiments, a deep Boltzmann machine may be trained by RBM. For example, training may start with RBM 560, then proceed to training of RBM 562, and then to training of RBM 564. For clarify of notation in what follows, B1, B2, and B3 refer to RBMs 560, 562, and 564, respectively, and h1, h2, and h3 refer to non-visible neuron layers 554, 556, and 558, respectively.
For each epoch, the positive phase term,
may be computed, in addition to the negative phase term,
For each input data xi, non-visible nodes
may be sampled, while visible nodes are clamped to the input data. It may be noted that samples obtained from non-visible variables constrained to the non-visible layer of the given RBM being trained (e.g., non-visible neurons 554 in the case that RBM 560 is currently being trained) may be labeled herein as
for each element or the given training data. Said obtained samples may then be used to compute the positive phase term
Furthermore, in order to compute the negative phase term.
results obtained from non-visible states
may be used to sample visible nodes
Then, using sampled values for the visible nodes,
may be sampled. Next, xB
multiple times, according to some embodiments. Following a computation of the positive and negative phase terms, weights and biases that are constrained to RBM 560 may be updated according to the parameter update definition for θt+1 provided above.
In some embodiments, training may then proceed to RBM 562, wherein sampled values for the non-visible nodes of RBM 560 that were computed for the positive phase term
may be used as inputs for the visible nodes (e.g., inputs used in the non-visible neurons 554 layer of the deep Boltzmann machine shown in
may now assume the role of input data xi for each vector used to store the training data. Then, a process of computing the positive and negative phase terms, as described above, may be repeated. Furthermore, training may then proceed to RBM 564, and then to any further RBMs of the given deep Boltzmann machine currently being trained.
Furthermore, inference may be performed according to the Langevin MCMC update rules introduced above that account for non-visible neurons, e.g.,
In order to sample non-visible variables using the probability pθ(z|xk), the following decomposition may be applied,
wherein a deep Boltzmann machine may be composed of k RDMs (e.g., in
A person having ordinary skill in the art should understand that implementations described herein with regard to accelerating sampling steps by performing Langevin MCMC steps on a thermodynamic chip of a given thermodynamic computing system 100 may be applied to training a deep Boltzmann machine, according to some embodiments.
In some embodiments, samples may be space averaged. For example, in
In some embodiments, samples may be time averaged, wherein samples are taken at various times during the evolution of the system that has been configured according to the engineered Hamiltonian. In some embodiments, time averaging may involve re-initializing the system and repeating the evolution wherein the re-initialization picks up where a prior evolution left off.
In some embodiments, various initialization schemes may be used to time and/or space averaging such as: re-initializing neurons of the algorithm mapped to the oscillators of the thermodynamic chip to repeat the evolution between successive instances of performing two or more measurement operations; originally initializing neurons according to a distribution and for subsequent initializations, re-initializing the neurons to have same values as in the distribution used for the original initialization; originally initializing neurons according to a distribution and for subsequent initializations re-initializing the neurons to have same values as ending values of an immediately preceding evolution; originally initializing neurons according to a distribution and for subsequent initializations, re-initializing the neurons according to the distribution, wherein the neurons are not required to have same values as resulted from the original or a preceding distribution.
A person having ordinary skill in the art should understand that replicas 604, 606, 608, and 610 may resemble a graph-based architecture such as that which is shown in
Furthermore, additional hardware designs may be implemented such that sequential sampling, for example, may be performed. In another example, more than one thermodynamic chip may be implemented within a given thermodynamic computing system 100, according to some embodiments. In such embodiments, one or more thermodynamic chips may be dedicated to performing sampling operations, while one or more additional thermodynamic chips may be dedicated to performing inference operations.
As shown in
As introduced above, neurons of a set V in a given engineered Hamiltonian Htotal may be implemented using superconducting flux elements, according to some embodiments. Superconducting flux elements may be fabricated as non-linear oscillators with either single or dual-well potentials and, as such, are applicable to terms of an engineered Hamiltonian Htotal. Furthermore, superconducting flux elements take on continuous values in the classical limit, and the energy difference governed by oscillations between energy levels of such elements operate in the GHz regime, thus leading to faster Langevin dynamics and improved sampling and inference as performed on thermodynamic chip 702 with regard to that which could be performed using FPGA 706 (or ASIC 806).
In some embodiments, for performing inference and/or sampling operations, the dynamical components of a given thermodynamic computing system 100 include neurons. Furthermore, weights and biases may be trained using an FPGA (or an ASIC, see description pertaining to
In some embodiments, inference may be performed using hardware designs such as those which are shown in
The configuration shown in
As shown in
In some embodiments in which hardware designs such as those shown in
In other embodiments in which natural descent and/or mirror descent algorithms are applied, thermodynamic chip 902 may perform sampling operations, for example, at respective iterations defined by
Furthermore, FPGA 906 may then be used to compute weights and biases, whose results may then be used to fix qs
Furthermore, dilution refrigerators 704 and 904 may refer to any environment that enables at least thermodynamic chips 702 and 902 (and also FPGA 906 and/or ASIC 1006, in some embodiments as shown in
The configuration shown in
At block 1102, an initial version of an engineered Hamiltonian is generated (or received). The Hamiltonian is to be used to configure physical elements (e.g., oscillators) of a thermodynamic chip such that the physical elements evolve in an engineered way that can be sampled to execute, at least in part, a portion of an algorithm, such as a Monte Carlo sampling method embedded in a larger algorithm, or any other stochastic sampling model used in an algorithm, such as those that follow Langevin dynamics.
At block 1104, the oscillators of the substrate of the thermodynamic chip are coupled according to the engineered Hamiltonian. For example, the engineered Hamiltonian may define relationships between visible and non-visible neurons, including weightings (applied at edges between neurons) and biases applied to nodes (e.g., the neurons). For example, relationships 352, 354, and 356 as shown in
At block 1106, samples may be collected at one or more points during the evolution of the oscillators (that represent evolution of neurons) configured according to the engineered Hamiltonian.
At block 1108, the classical computing device (such as an FPGA or ASIC), such as classical computing device 106, as shown in
At block 1108, updated weightings and biases may be determined based on the samples collected at block 1106.
At block 1110, an updated engineered Hamiltonian that has been updated to include the determined updated weightings and/or biases may be implemented on the thermodynamic chip.
At block 1112 additional samples may be collected form the thermodynamic chip with the updated engineered Hamiltonian implemented. Said updating the weights and/or biases, implementing an updated Hamiltonian including the updated weights and/or biases, and sampling the thermodynamic chip with the updated Hamiltonian implemented may be repeated until it is determined, at block 1114, that the thermodynamic chip has been sufficiently trained.
At block 1116, once the thermodynamic chip is trained, it may be used to perform a delegated portion of the algorithm, such as generating inferences or samples to be used by other components of the algorithm.
In some embodiments a process of executing an algorithm including stochastic probabilities, such as may be determined via Monte Carlo sampling methods (e.g., block 1202), includes steps, such as shown in blocks 1204 through 1212.
At block 1204, one or more portions of the algorithm are executed using classical computing devices, such as processors 1310 of computer system 1300, as shown in
At block 1206, one or more portions of the algorithm are delegated to be performed on a thermodynamic chip, such as thermodynamic chip 1380 (as shown in
At block 1208, one or more classical computing devices, such as processors 1310, receive from the thermodynamic chip (such as thermodynamic chip 1380) statistics or other sampled values for use in performing other aspects of the algorithm. In some embodiments, statistics are obtained from the measurement of multiple neurons on a thermodynamic chip at the end of their evolution following Langevin dynamics. For example, the neurons may evolve on the thermodynamic chip following Langevin dynamics. Samples used to perform averages on classical computer may be obtained by measuring the neurons of the thermodynamic chip at the end of the evolution of the neurons. The measurement results may then be fed back to the classical computer where an average is performed (for example as discussed at block 1210).
At block 1210, a classical computing device such as FPGA or ASIC 106, performs additional post processing steps (if needed), such as time averaging, space averaging, etc. on the samples returned from the thermodynamic chip.
At block 1212, one or more classical computing devices, such as processor 1310, use the returned statistics or samples in execution of other parts of the algorithm.
In the illustrated embodiment, computer system 1300 includes one or more processors 1310 coupled to a system memory 1320 (which may comprise both non-volatile and volatile memory modules) via an input/output (I/O) interface 1330. Computer system 1300 further includes a network interface 1340 coupled to I/O interface 1330. Classical computing functions may be performed on a classical computer system, such as computing computer system 1300.
Additionally, computer system 1300 includes computing device 1370 coupled to thermodynamic chip 1380. In some embodiments, computing device 1370 may be a field programmable gate array (FPGA), application specific integrated circuit (ASIC) or other suitable processing unit. In some embodiments, computing device 1370 may be a similar computing device as described in
In various embodiments, computer system 1300 may be a uniprocessor system including one processor 1310, or a multiprocessor system including several processors 1310 (e.g., two, four, eight, or another suitable number). Processors 1310 may be any suitable processors capable of executing instructions. For example, in various embodiments, processors 1310 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 1310 may commonly, but not necessarily, implement the same ISA. In some implementations, graphics processing units (GPUs) may be used instead of, or in addition to, conventional processors.
System memory 1320 may be configured to store instructions and data accessible by processor(s) 1310. In at least some embodiments, the system memory 1320 may comprise both volatile and non-volatile portions; in other embodiments, only volatile memory may be used. In various embodiments, the volatile portion of system memory 1320 may be implemented using any suitable memory technology, such as static random-access memory (SRAM), synchronous dynamic RAM or any other type of memory. For the non-volatile portion of system memory (which may comprise one or more NVDIMMs, for example), in some embodiments flash-based memory devices, including NAND-flash devices, may be used. In at least some embodiments, the non-volatile portion of the system memory may include a power source, such as a supercapacitor or other power storage device (e.g., a battery). In various embodiments, memristor based resistive random-access memory (ReRAM), three-dimensional NAND technologies, Ferroelectric RAM, magneto resistive RAM (MRAM), or any of various types of phase change memory (PCM) may be used at least for the non-volatile portion of system memory. In the illustrated embodiment, program instructions and data implementing one or more desired functions, such as those methods, techniques, and data described above, are shown stored within system memory 1320 as code 1325 and data 1326.
In some embodiments, I/O interface 1330 may be configured to coordinate I/O traffic between processor 1310, system memory 1320, computing device 1370, and any peripheral devices in the computer system, including network interface 1340 or other peripheral interfaces such as various types of persistent and/or volatile storage devices. In some embodiments, I/O interface 1330 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 1320) into a format suitable for use by another component (e.g., processor 1310). In some embodiments, I/O interface 1330 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 1330 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 1330, such as an interface to system memory 1320, may be incorporated directly into processor 1310.
Network interface 1340 may be configured to allow data to be exchanged between computing device 1300 and other devices 1360 attached to a network or networks 1350, such as other computer systems or devices. In various embodiments, network interface 1340 may support communication via any suitable wired or wireless general data networks, such as types of Ethernet network, for example. Additionally, network interface 1340 may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
In some embodiments, system memory 1320 may represent one embodiment of a computer-accessible medium configured to store at least a subset of program instructions and data used for implementing the methods and apparatus discussed in the context of
Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Generally speaking, a computer-accessible medium may include storage media or memory media such as magnetic or optical media, e.g., disk or
DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g., SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
The various methods as illustrated in the Figures above and described herein represent exemplary embodiments of methods. The methods may be implemented in software, hardware, or a combination thereof. The order of method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.
It will also be understood that, although the terms first, second, etc., may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the present invention. The first contact and the second contact are both contacts, but they are not the same contact.
Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended to embrace all such modifications and changes and, accordingly, the above description is to be regarded in an illustrative rather than a restrictive sense.
This application claims benefit of priority to U.S. Provisional Application Ser. No. 63/492,171, entitled “Hybrid Thermodynamic Classical Computing System,” filed Mar. 24, 2023, and which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63492171 | Mar 2023 | US |