Various algorithms, such as machine learning algorithms, often use statistical probabilities to make decisions or to model systems. Some such learning algorithms may use Bayesian statistics, or may use other statistical models that have a theoretical basis in natural phenomena.
Generating such statistical probabilities may involve performing complex calculations which may require both time and energy to perform, thus increasing a latency of execution of the algorithm and/or negatively impacting energy efficiency. In some scenarios, calculation of such statistical probabilities using classical computing devices may result in non-trivial increases in execution time of algorithms and/or energy usage to execute such algorithms.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.
The present disclosure relates to methods, systems, and an apparatus for performing computer operations using various multi-thermodynamic chip architectures. In some embodiments, neuro-thermodynamic processors may be configured such that learning algorithms for energy-based models may be applied using Langevin dynamics. For example, as described herein, two, three, or more neuro-thermodynamic processors may be arranged in respective three-dimensional architectures such that, given a series of Hamiltonian and coupling terms, weights and biases (e.g., synapses) may naturally evolve (and thus be learned) through Langevin dynamics.
Furthermore, in some embodiments, physical elements of a thermodynamic chip may be used to physically model evolution according to Langevin dynamics. For example, in some embodiments, a thermodynamic chip includes a substrate comprising oscillators implemented using superconducting flux elements. The oscillators may be mapped to neurons (visible or hidden) that “evolve” according to Langevin dynamics. For example, the oscillators of the thermodynamic chip may be initialized in a particular configuration and allowed to thermodynamically evolve. As the oscillators “evolve” degrees of freedom of the oscillators may be sampled. Values of these sampled degrees of freedom may represent, for example, vector values for neurons that evolve according to Langevin dynamics. For example, algorithms that use stochastic gradient optimization and require sampling during training, such as those proposed by Welling and Teh, and/or other algorithms, such as natural gradient descent, mirror descent, etc. may be implemented using a thermodynamic chip. In some embodiments, a thermodynamic chip may enable such algorithms to be implemented directly by sampling the neurons (e.g., degrees of freedom of the oscillators of the substrate of the thermodynamic chip) directly without having to calculate statistics to determine probabilities. As another example, thermodynamic chips may be used to perform autocomplete tasks, such as those that use Hopfield networks, which may be implemented using the Welling and Teh algorithm. For example, visible neurons may be arranged in a fully connected graph (such as a Hopfield network, etc.), and the values of the auto complete task may be learned using the Welling and Teh algorithm. As another examples, inferred values (e.g. degrees of freedom of oscillators) of a first energy-based model may be relayed as outputs of the first energy-based model, and may further be used as inputs to an additional energy-based model.
In some embodiments, a thermodynamic chip includes superconducting flux elements arranged in a substrate, wherein the thermodynamic chip is configured to modify magnetic fields that couple respective ones of the oscillators with other ones of the oscillators. In some embodiments, non-linear (e.g., anharmonic) oscillators are used that, for example, have dual-well potentials. These dual-well oscillators may be mapped to neurons of a given model that the thermodynamic chip is being used to implement. Also, in some embodiments, at least some of the oscillators may be harmonic oscillators with single-well potentials. In some embodiments, oscillators may be implemented using superconducting flux elements with varying amounts of non-linearity. In some embodiments, an oscillator may have a single well potential, a dual-well potential, a potential somewhere in a range between a single-well potential and a dual-well potential, or a multi-well potential. In some embodiments, visible neurons may be mapped to oscillators having a single well potential, a dual-well potential, a potential somewhere in a range between a single-well potential and a dual-well potential, or a multi-well potential.
In some embodiments, parameters of an energy based model or other learning algorithm may be learned through evolution of the oscillators of a set of thermodynamic chips that have been configured in a current configuration with couplings that correspond to a current engineered Hamiltonian being used to approximate aspects of the energy-based model for the respective thermodynamic chips. For example, a server thermodynamic chip may use positive and negative phase terms to drive weights and biases of a first thermodynamic chip to approximately match those of a second thermodynamic chip, according to some embodiments. However, in the second thermodynamic chip, visible neurons may be left un-clamped, whereas in the first thermodynamic chip the visible neurons may be clamped to input data. In this way, the weights and biases of a current evolution step influence the un-clamped vision neurons of the second thermodynamic chip, which in turn influence the weights and biases that are maintained the same across the first and second thermodynamic chips. Thus, even though the visible neurons of the first thermodynamic chip are clamped to the input data, the weights and biases are free to evolve to “learn” relationships between the clamed input data via the back-and-forth evolution to the second thermodynamic chip, via the server thermodynamic chip. In this way, updated weightings or biases to be used in the engineered Hamiltonian may be automatically learned via evolution of the multi-chip architecture comprising multiple thermodynamic chips coupled via a server thermodynamic chip. In some embodiments, such updates to weightings and biases may be allowed to evolve until the engineered Hamiltonian of the first thermodynamic chip has been adjusted such that samples taken from the first thermodynamic chip satisfy one or more training criteria for training the first thermodynamic chip such that the first thermodynamic chip accurately implements inferences.
For example, in some embodiments, the engineered Hamiltonian (shown in the equation below) may be implemented using a thermodynamic chip (e.g., a clamped chip, un-clamped chip, etc.) wherein the first term of the Hamiltonian represents visible neurons, the second and third terms represent potentials for weights and biases, and the final two terms of the Hamiltonian represent couplings between the weights and biases and the visible neurons.
In some embodiments, the hardware implementation of the neurons used to encode the data may be based on a flux qubit design. For example, the neuron may be represented by a phase/flux degree of freedom, wherein an oscillator design that implements the phase/flux degree of freedom is based on a DC SQUID (direct current superconducting quantum interference device), for example which contains two junctions. In what follows, Ej will be used to denote the Josephson energy. L corresponds to the inductance of the main loop, and results in the inductive energy EL. {tilde over (φ)}L is the external flux coupled to the main loop and PDC is the external flux coupled into the DC SQUID loop. The Hamiltonian used to represent the physical neurons, along with how they couple with each other is given by:
where is the set corresponding to the visible neurons (which are used to encode the data in a given learning algorithm). In some embodiments, a term may be added to the Hamiltonian above, wherein the added term represents hidden neurons in a system. The term relating to hidden neurons may be of similar form as the visible neurons. The set of synapses ε (e.g. weightings and biases) includes the edges representing the couplings between the visible neurons and potentially present hidden neurons. Biases are added to all the visible neurons and potentially present hidden neurons.
In some embodiments, the potentials used in the above Hamiltonian are given by,
where the parameters EL, Ej0, {tilde over (φ)}L and {tilde over (φ)}DC may be tuned to obtain single well, double well, or multiple well potentials. For instance, if, in a given energy-based model, the data is constrained to take on values that are either +1 or −1, a double well potential for the visible neurons would be most appropriate. For quadratic potentials, the condition {tilde over (φ)}DC=π may be chosen since this causes the second term to vanish. For double well potentials, a more careful choice of {tilde over (φ)}DC may be required. Note that the coupling parameters α and β in the Hamiltonian above can take on both positive and negative values. The ability to flip between positive and negative signs may prove useful when developing a self-learning algorithm following Langevin dynamics. In some embodiments, sign flips are used along with squeezing type operations to generate a self-learning algorithm. In other embodiments, residual terms in the approach presented below are of higher order, and thus have much less of an impact on the learning dynamics.
In some embodiments, learning algorithms for energy-based models may be based on the stochastic gradient optimization algorithm of Welling and Teh. Consider a set of N data items X={xi}i=1N with posterior distribution pθ(x)=exp(−εθ(x))/Z(θ) with the partition function Z(θ)=∫pθ(x)dx. In some embodiments, stochastic gradient algorithms may be combined with Langevin dynamics to obtain a parameter update algorithm that allows for efficient use of large datasets while allowing for parameter uncertainty to be captured in a Bayesian manner. For example, the update rule may be given by,
with nt˜N(0,∈t) if no hidden neurons are used. The step sizes Et must satisfy the property Σt=1∞∈t=∞, and Σt−1∞∈t2<∞. During learning of weights and biases, using a stochastic gradient algorithm, the first condition ensures that the parameters (e.g. weights and biases) will reach high probability regions regardless of where they are initialized and the second condition ensures the parameters (e.g. weights and biases) converge to the mode instead of oscillating around it. In some embodiments, a functional form which satisfies these requirements is to set ∈t=α(b+t)−γ. Note that at each iteration t, a subset of data items (with size n) Xt={xt
In some embodiments using the Hamiltonian above, the energy function may be written as εθ(x)=(θ)+
(x)+ε(c)(θ,x), where the energy functions are decomposed into a sum of potential terms for the synapse parameters θ(
(θ)), the visible neurons (
(x)), and (potentially) hidden neurons (
(z)), along with coupling terms given by ε(c)(θ,x) or ε(c)(θ,x,z). Using this structure, the gradient of the log likelihood may be computed as follows,
Accordingly, only the coupling terms of the Hamiltonian above are relevant in computing the gradient of the log likelihood.
In some embodiments, a parameter update rule may be written as,
The first term in the large parentheses above is known as the positive phase term, whereas the second term in the above equation is known as the negative phase term. Note the sign difference between the two terms. Note also that the negative phase term requires averaging over sampled values of the visible nodes x from pθ(x). Monte Carlo methods as well as persistent contrastive divergence methods may be employed to sample multiple paths from pθ(x) without having to take too many Monte Carlo steps. A time average approach may also be used to approximate an expectation value for a self-learning algorithm.
In some embodiments, a Hamiltonian description for a three-chip architecture may resemble the following:
Note that for ease of notation, φn represents vertices, such as the visible neurons 254 shown in
In some embodiments, a Hamiltonian description for a two-chip architecture (as shown in
In some embodiments, the use of a thermodynamic chip in a computer system may enable a learning algorithm to be implemented in a more efficient and faster manner than if the learning algorithm was implemented purely using classical components. For example, measuring the neurons in a thermodynamic chip to determine Langevin statistics may be quicker and more energy efficient than determining such statistics via calculation.
Broadly speaking, classes of algorithms that may benefit from thermodynamic chips include those algorithms that involve probabilistic inference. Such probabilistic inferences (which otherwise would be performed using a CPU or GPU) may instead be delegated to the thermodynamic chip for a faster and more energy efficient implementation. At a physical level, the thermodynamic chip harnesses electron fluctuations in superconductors coupled in flux loops to model Langevin dynamics. In some embodiments, multi-chip architectures such as those described herein may resemble a full self-learning architecture, wherein classical computing device(s) (e.g., a GPU, FPGA, etc.) may be replaced with such self-learning neuro-thermodynamic computer(s) in order to implement a full learning algorithm (e.g., the Welling and Teh learning algorithm) without use of such classical computing co-processor(s).
Note that in some embodiments, electro-magnetic or mechanical (or other suitable) oscillators may be used. A thermodynamic chip may implement neuro-thermodynamic computing and therefore may be said to be neuromorphic. For example, the neurons implemented using the oscillators of the thermodynamic chip may function as neurons of a neural network that has been implemented directly in hardware. Also, the thermodynamic chip is “thermodynamic” because the chip may be operated in the thermodynamic regime slightly above 0 Kelvin, wherein thermodynamic effects cannot be ignored. For example, some thermodynamic chips may be operated within the milli-Kelvin range, and/or at 2, 3, 4, etc. degrees Kelvin. In some embodiments, temperatures less than 15 Kelvin may be used. Though other temperatures ranges are also contemplated. This also, in some contexts may be referred to as analog stochastic computing. In some embodiments, the temperature regime and/or oscillation frequencies used to implement the thermodynamic chip may be engineered to achieve certain statistical results. For example, the temperature, friction (e.g., damping) and/or oscillation frequency may be controlled variables that ensure the oscillators evolve according to a given dynamical model, such as Langevin dynamics. In some embodiments, temperature may be adjusted to control a level of noise introduced into the evolution of the neurons. As yet another example, a thermodynamic chip may be used to model energy models that require a Boltzmann distribution. Also, a thermodynamic chip may be used to solve variational algorithms and perform full self-learning tasks and operations.
In some embodiments, a thermodynamic computing system 100 (as shown in
In some embodiments, a self-learning neuro-thermodynamic computing device 102 may include multiple thermodynamic chips, such as clamped chip 104, server chip 106, and un-clamped chip 108. A person having ordinary skill in the art should understand that
In some embodiments, classical computing device 114 may include one or more devices such as a field-programmable gate array (FPGA), an application specific integrated circuit (ASIC), and/or other devices that may be configured to interact and/or interface with thermodynamic chips within the architecture of self-learning neuro-thermodynamic computer 102. For example, such devices may be used to tune hyperparameters of the given thermodynamic system, etc.
In some embodiments, a substrate 202 may be included in a thermodynamic chip, such as any one of the thermodynamic chips implemented in self-learning neuro-thermodynamic computing device 102. Oscillators 204 of substrate 202 may be mapped in a logical representation 252 to neurons 254, as well as weights and biases (shown in
In some embodiments, Josephson junctions and/or superconducting quantum interference devices (SQUIDS) may be used to implement and/or excite/control the oscillators 204. In some embodiments, the oscillators 204 may be implemented using superconducting flux elements (e.g., qubits). In some embodiments, the superconducting flux elements may physically be instantiated using a superconducting circuit built out of coupled nodes comprising capacitive, inductive, and Josephson junction elements, connected in series or parallel, such as shown in
While weights and biases are not shown in
In some embodiments, oscillators associated with weights and biases, such as bias 256 and weights 258 and 260, may be allowed to evolve during a training phase and may be held nearly constant during an inference phase, as further described in
In some embodiments, the self-learning neuro-thermodynamic computing device 102 may learn relationships between respective ones of the neurons such as relationship A (352), relationship B (354), and relationship C (356). These relationships may be physically implemented in substrate 202 via couplings between oscillators 204, such as couplings 302, 304, and 306 that physically implement respective relationships 352, 354, and 356. These learned relationships may comprise learned weights and biases, that are learned via evolution of the self-learning neuro-thermodynamic computing device 102.
In some embodiments, a drive 402 may cause pulses 404 to be emitted to initialize hyperparameters. An FPGA and/or other classical computing device, as represented via classical computing device 114 in
In some embodiments, visible neurons, such as visible neurons 254, may be linked via connected edges 506. Furthermore, as shown in
In some embodiments, a three-chip architecture may resemble that which is shown in
In some embodiments, a multi-chip thermodynamic processor architecture consists of three chips, each representing a thermodynamic processor. The dynamics of the chips labeled clamped 606 and un-clamped 602 are both governed by a Hamiltonian (e.g. such as shown above). However for the clamped chip, the visible neurons are clamped to data, which can be achieved by setting positions q and momentums p to a value of the data. For the un-clamped chip 602, the visible nodes evolve freely in the sense that the initial condition q=0 is set. Note that for ease of notation, φn
where Hc and Hs are Hamiltonians for the clamped and un-clamped chips given by a Hamiltonian (e.g. such as shown above). Note that for ease of notation, φn
wherein, the server Hamiltonian contains potential terms similar to those used for the clamped and un-clamped chips. However, the weights and bias degrees of freedom of the server chip are not coupled to any visible or hidden neurons. The final terms in equation Htot above represent couplings between the position degrees of freedom of the server chip and the clamped and un-clamped chips, as well as the momentum couplings between the server chip and the clamped and un-clamped chips. Note that the sums over in the coupling terms are used for the bias terms and should not be confused with the visible neurons.
In some embodiments such as given above, to show that a Hamiltonian results in parameter update rules (e.g. such as shown in θt+1 above), with parameters being represented by the synapses and biases, the Euler-Maruyama method with time steps of size δt may be applied to the Langevin equations of motion to obtain weakly first order solutions. Provided below are solutions to the position and momentum variables of the synapses (identical solutions can be found for the biases) for all three chips. Solutions will be provided for several increments of time δt until the time evolution pattern becomes clear.
In some embodiments, general equations of motion for position variables of the clamped 606 and un-clamped 602 chip are given by,
where Wt is a Wienner process. Note that it is assumed that respective rates of change of a position degrees of freedom of visible nodes are faster compared to synapses and biases (e.g. using a smaller mass for the visible nodes, and note that when using a DC SQUID design, masses are given by m=ϕ02C, where C is a capacitance in parallel to a junction/DC SQUID, and ϕ0=h/2e is a reduced magnetic flux. Note that e is the elementary charge.). In such an embodiment, the position degrees of freedom for the un-clamped system to leading order in δt may be written as,
corresponds to a time average of
over a time interval δt, wherein the dependence on the visible nodes is labeled as x instead of q. The generalized Jacboi (G-JF) method is a numerical integration method for solving a stochastic differential equation (SDE) of the general equations of motion (e.g. such as shown above) that is weakly second order (the standard Euler-Maruyama method is weakly first order). In some embodiments, the G-JF method may be used to solve the equations of motion for a three-chip architecture (e.g. such as shown above). The position and momentum equations of motion are given by
In the large friction limit, such equations reduce to,
Using the immediately above two equations, the position and momentum equations of motion for the server, clamped and un-clamped chips are,
Note that under the Born-Oppenheimer approximation, the time averages above may be replaced with space averages. By inserting the momentum terms into the equations for the position degrees of freedom, the above simplifies to (where the noise terms are omitted to keep the equations simple)
A few remarks are in order. First, it is noted that if λ1=λ2, η1=η2 and q(s)(t)=q(u)(t)=q(c)(t), then q(s)(t+δt) simplifies to,
since the remaining terms all cancel. Note that the condition q(s)(t)=q(u)(t)=q(c)(t) can be achieved using a large coupling parameter λ relative to the gradients. It is also noted that q(s) has a form analogous to the Welling and Teh update rule. Hence the above conditions on the λ and n parameters can allow for a full self-learning protocol. Lastly, the term proportional to ∂Hs/∂q(S) in q(s) can be interpreted as a log prior term which arises from the potential terms in the server chip Hamiltonian.
Lastly, in what follows, it is pointed out that for the Gibbs distribution of the three-chip system introduced in this section to be a steady state of the Fokker-Planck equation, the noise matrix D representing the noise across the three chips is chosen to be non-diagonal.
The Fokker-Planck equation is given by,
For example, consider the server chip coupled to clamped and un-clamped chips. Considering this, the Hamiltonian for the three chip system can be written as,
For brevity, the un-clamped chip is labeled with the superscript u instead of uc. Note that the server chip does not contain hidden or visible neurons, it only has weights and biases and it's purpose is to generate the desired gradient updates on the clamped and un-clamped chips. Given a Hamiltonian of a server, Q is given by,
where the position and momentum degrees of freedom are explicitly labeled.
The relevant derivatives are now given by.
Next the derivatives of Q as defined herein are computed. This gives.
In addition, the relevant second order derivatives of the momentum are written as follows,
Before moving forward, the Pokker-Plank operator may be re-written as follows,
Now due to the symmetry in the derivatives, the first two terms vanish. As such, the Fokker-Plank operator simplifies to,
Next the above derivatives are inserted into the Fokker-Plank operator. For a given index i, the contribution arising from the server, clamped and un-clamped chip is computed, and all three of the results are summed due to the corresponding couplings between the three chips. In what follows it is assumed that Γ is diagonal. First, start by writing the i′th term in the sum of Q for the server chip,
The i′th term in the sum of Q for the clamped chip is given by,
The i′th term in the sum of Q for the un-clamped chip is given by,
Now as in the standard Langevin equation, Γii(u)=Γii(c)=Γii(s)=, and Dii(c)=Dii(s)=Dii(u)=D are set. Lastly, if n is large, this gives pi(c)≈−pi(s) and pi(u)≈pi(s). In this case, the Fokker-Plank operators simplify to,
Now looking at the first equation, it can be re-written as,
For this to be zero, the term proportional to
requires that
But then,
Furthermore, it is not possible to get this to be zero even if non-diagonal matrix elements are added to D. As such, the position degrees of freedom for qi(s) under a Hamiltonian are not sampled from the Gibbs distribution. This is desired since they are to be sampled from the posterior.
The calculations leading to the above equation assumed that the noise and friction terms were non-diagonal. Since the result is non-zero, general noise and friction matrices are considered. Taking into account all of the cross-terms, the following expression should be zero,
Assuming a large η value, it gives that.
To simplify the notation, a term like
is re-written as Dsu and so on. Next, require that the term proportional to
be,
If it is assumed that Γ is diagonal, it can be required that,
When mη>>1, the above simplifies to,
Keeping in mind the above mathematical proofs of the self-learning nature of the neuro-thermodynamic computers described herein, the next section further describes how such neuro-thermodynamic computers are used to perform machine learning training and inference.
For example,
As an example, values of training dataset 700 may be clamped to neurons 702, 704, 706, and 708 of clamped chip 104. For example, neuron 702 may be clamped to a value of 0, neuron 704 may be clamped to a value of 1, neuron 706 may be clamped to a value of 1 and neuron 708 may be clamped to a value of 0. The self-learning neuro-thermodynamic computing device 102 may then be allowed to evolve according to Langevin dynamics such that weights and biases associated with neurons 702, 704, 706, and 708 are learned, such as the weights 504 and biases 502 shown for neurons 254 in
Once the weights and biases are learned, they may be held constant, or allowed to evolve, and new data (e.g. test dataset 720) may be clamped to at least some of the neurons, while others remained un-clamped and are used to generate inference values. For example, neuron 704 may be clamped to have a value of 0 and neuron 706 may be clamped to have a value of 0. Also, neurons 702 and 708 may be un-clamped and may be used to generate inferences.
With the weights and biases learned and the test data clamped to some of the neurons, the self-learning neuro-thermodynamic computing device 102 may then be allowed to evolve according to Langevin dynamic such that values of the un-clamped neurons are updated. The un-clamped neurons, may be measured (e.g. measured outputs for interface 740) to determine inference values, such as that neuron 702 has an inference value of 1 and neuron 708 has an inference value of 1.
After training an energy based model using the dynamics and architecture described, if the position coupling strengths (mediated by the A parameters) are large enough such that the weights and biases are nearly identical on the server, clamped and un-clamped chips, inference can be performed using the clamped chip as follows. First the weights and biases on the clamped chip are clamped such that they remain stationary through time. Inference is then performed by clamping the visible nodes of the clamped chip to a new data point given in the test set, and letting the visible neurons of the clamped chip evolve following Langevin dynamics such as described herein. After some time T, the visible neurons of the clamped chip are then measured to read out their values. Alternatively, values (either sample values or expectation values) may be transferred, for example as inputs into another energy-based model, without necessarily requiring measurement. The required evolution time may be determined from simulations given the hyperparameter configuration of the system, and the particular energy-based model used to obtain the trained weights and biases. Alternatively, online learning can be performed if clamping the weights during the training phase is omitted, as the system will continue to learn during inference.
Several techniques can be used to clamp the weights and biases of the clamped chip prior to performing inference. One method would be to simply increase the masses of the weights and biases of the clamped chip such that they remain nearly stationary during inference. Alternatively, if the server or un-clamped chip are coupled to an auxiliary system which allows their weights and biases to be measured, a time average measurement of the weights and biases could be performed. Such a measurement would allow one to obtain the statistics (mean and covariance) of the weights and biases. The parameters {tilde over (φ)}L(w) and {tilde over (φ)}L(b) in a Hamiltonian could be tuned to the mean obtained from the time average measurements, along with increasing EL(w) and EL(b) to ensure strong clamping. In particular, the architecture described herein generates weights and biases which are approximate samples from the Bayesian posterior of the parameters. In other words, through stochastic gradient Langevin dynamics, weights and biases θ are obtained which are samples from p(θ|) for some data set
. Suppose now there is a data
={x,y} where x is known, and y is unknown (where
can be encoded using the visible neurons of the three-chip architecture). The unknown visible neurons y can be sampled from,
where in computing p(y|θ,x), visible neurons are partially such that some of the visible neurons are clamped to the known data x for a given set of parameters θ while other visible neurons corresponding to values to be inferred are left un-clamped. Furthermore, consider the case where there is no data clamping, so that the data is simply =y for some unknown y. In this case, sampling can be performed from the posterior predictive distribution for generative modeling using the trained weights as,
A person having ordinary skill in the art should understand that
Next, consider a Hamiltonian with the position and momentum coupling terms. At thermal equilibrium, the Boltzmann distribution is given by,
where Z=∫e−βH
Using the Boltzmann distribution, the expectation values for the position degrees of freedom are given by,
where the following is defined,
Note also that Zs=Zu=Zc. Similarly, note that qk(u)
=Is and
qk(s)
=Iu=Ic. It is straightforward to evaluate the integrals to show that Ic=Is=Iu=0. As such, it can be seen that a strong position coupling term forces the position degrees of freedom of the server, clamped and un-clamped chip to be strongly correlated. Alternatively, the system could also operate in the low temperature regime, which would force the expectation values of the position operators to be close to zero. In such a setting, the temperature could then quickly be increased right before starting the time evolution described for the self-learning algorithm described herein. Lastly, analogous calculations can be performed to show that the expectation values of the momentum operators are also zero in this regime.
As shown, a training algorithm such as shown herein can be combined along with the inference protocols in order to implement a meta-learning scheme for the hyperparameters of a three-chip architecture. After training the system, the accuracy can be computed by performing inference on a validation set. Such a protocol can be repeated, with a different hyperparameter configuration until the desired validation accuracy is achieved. An FPGA or other classical device, such as classical computing device 114 shown in
In some embodiments, multiple clamped and un-clamped chips may be coupled, respectively, to a single server chip, such as un-clamped chips 902, 904, and 906 and clamped chips 910, 912, and 914, which are coupled to server chip 908. As shown via the dashed vs dotted lines in
A person having ordinary skill in the art should understand that
In a Welling and Teh update rule such as shown above, it can be seen that one can choose a mini-batch of size n≤N in order to compute the gradients over a subset of the full data set. In
In addition to having multiple clamped chips coupled to the server chip, it is also possible to have multiple un-clamped chips coupled to the server chip using the same coupling terms as described in the above discussed Hamiltonians, such as Htot. Having multiple un-clamped chips coupled to the server chip in parallel could allow for a space average to be used instead of the time average
By space average, there would instead be an approximation to,
where x is the set of all visible neurons, and θ represents the set of all weights and biases. The sum in the space average is over visible neurons x′ sampled from each of the un-clamped chips. In such a setting, the space average may be used in the unclamped and server position degree of freedom instead of
In some embodiments, architectures may be generalized to include hidden neurons. Hidden neurons may be added to the network, and coupled to other nodes of the network using weights, in the same way coupling terms were used for the fully visible network. In particular, for a given chip (either clamped or un-clamped), generalize the Hamiltonian in (e.g. H) to include hidden neurons as follows:
where the set of all visible and hidden neurons is partitioned as =
∪
. The hidden neurons may have different potential wells and masses relative to the visible neurons. The coupling terms between clamped, un-clamped and server chips are identical to those described in previous sections.
For the clamped chip, the hidden neurons evolve freely, with the visible neurons being clamped to some data set. The goal is to obtain gradients which are averaged over samples from the distribution pθ(z|xc) for a given set of parameters θ (the visible nodes clamped to the data are denoted as xc). Similarly to an approximate time average, the gradient steps taken on the clamped chip can be approximated by a time average as follows,
where the superscript (z) are added to indicate that the hidden neurons are un-clamped during the time evolution of size δt.
For the un-clamped chip, both the visible and hidden neurons are un-clamped and thus evolve freely. The goal is to obtain gradients which are averaged over samples from the joint distribution pθ(x,z) for a given set of parameters θ. Since both the visible and hidden neurons evolve over some time δt, the gradient updates may be approximated as
where the superscript (x,z) are added to indicate that both the visible and hidden neurons are un-clamped during the evolution.
In some embodiments, a Euler-Maruyama method for a self-learning architecture described above using the server chip may be used. Consider the following initial conditions: qx(c)(0)=qk(u)(0)=qk(s)(0)=q0, and pk(c)(0)=pk(u)(0)=pk(s)(0)=0. Note that the labels “(u)”, “(c)” and “(s)” are used to represent the un-clamped, clamped and server degrees of freedom. In what follows, the following noise terms are defined as,
where ξt˜N(0,δt). Using the structure of Htot, the update equations for the position and momentum of the three chips can be written as follows,
Next, solution are given to the time evolution of the synapse degrees of freedom for several increments of time δt, starting at time t=0.
At this stage, note that qk(s)(3δt) has the correct form if it is assumed γδt≈1. Since δt must be small, values in the large γ limit (also known as the overdamped limit) are used. In what follows, it is assumed that the system is in a regime where such a condition is satisfied and thus all terms proportional to (1−γδt) are removed.
Note that (ignoring the noise term), pk(s)(3δt)∝(δt)3, whereas the other non-zero terms above are proportional to (δt)2. And, only including the leading order noise term in the expression for pk(s)(3δt). In the following analysis, only leading order terms are kept which are proportional to (δt)2 for the position degrees of freedom and St for the momentum degrees of freedom.
As can be shown by recursively applying the Euler-Maruyama update rules, the momentum degrees of freedom for the server chip will continue to be proportional to (δt)3 plus some noise term. As such, the position and momentum variables for the clamped and un-clamped chip will continue to have the same structure as shown above.
As can be seen from the equations above, qk(s) has a form analogues to a parameter update rule described herein, albeit with a modified noise term (which includes the sum of Gaussian noise terms). The coupling parameters η1 and η2 can be tuned numerically given a particular problem of interest to yield the best results. It is also noted that the terms proportional to λ1 and λ2 (e.g., the position coupling terms) only appear at higher orders, and thus appear to have no impact on the lower order equations of motion. However, this is a result of the chosen initial conditions qk(c)(0)=qk(u)(0)=qk(s)(0)=q0. In order to get such initial conditions, the position couplings terms can be used to force all three chips to take on the same initial values. This can be achieved by using a large value of λ relative to the temperature of the system.
In this section a gate-based approach was provided to implement a self-learning algorithm. Before moving forward, consider weekly second order equations of motion for a physical system undergoing Langevin dynamics. Such equations will be used to describe the system.
The equation of motion for a system of particles governed by some Hamiltonian H undergoing Langevin dynamics is given by,
where σ=√{square root over (2kbTγ)}, γ is a friction term and Wt is a Weiner process. The k′th position and momentum are written as qk and pk, with 1≤k≤N for a system of N particles. Using the G-JF method, the equations are given herein.
Now, the core gate used in the learning algorithm is given as,
To keep the equations more compact, the noise terms are omitted. Now, start with the initial conditions q_k(0)=q{circumflex over ( )}{(0)}_k and pk(0)=0. Also, denote the Hamiltonian during the clamped phase as Hc and the Hamiltonian during the un-clamped phase (e.g., when the visible neurons are not clamped to the data) as Huc. Lastly, to make the presentation as concise as possible, consider the evolution steps in the following.
Evolve (clamped phase) for time δt.
As can be seen from the above, with the appropriate choice of friction γ (which affects the α and b parameters) and ∈, update rules may be achieved.
In some embodiments, a thermodynamic computing system 1000 (as shown in
In some embodiments, a two-chip architecture may resemble that which is shown in
At block 1202, oscillators of a first thermodynamic chip (e.g. clamped chip 104) are clamped to training data values, wherein the oscillators of the first thermodynamic chip clamped to the training data values represent visible neurons, and wherein the first thermodynamic chip comprises other oscillators representing hidden neurons, weights and biases.
At block 1204, a set of thermodynamic chips (e.g. clamped chip 104, un-clamped chip 108, and server chip 106), such as are included in a self-learning neuro-thermodynamic computing device are allowed to evolve while clamped to the training data. This causes the weights and biases to be learned, as described above. In some embodiments, both clamped and un-clamped chips may have latent (hidden) neurons.
At bock 1206, at least some of the oscillators of the first thermodynamic chip (e.g. clamped chip 104) are clamped to test data values, while other ones of the oscillators corresponding to neurons for which values are to be inferred are left un-clamped.
At block 1208, the set of thermodynamic chips are allowed to further evolve such that the other ones of the oscillators corresponding to neurons for which values are to be inferred are left un-clamped take on values that can be measured to generate inference values.
At block 1210, the other ones of the oscillators are then sampled to generate the inference values.
In some embodiments, a process described via blocks 1302-1310 may resemble that which is described herein with regard to blocks 1202-1210 but for a self-learning neuro-thermodynamic computing device 1002 containing direct coupling between a clamped chip and an un-clamped chip.
Embodiments of the present disclosure may be described in view of the following clauses:
Clause 1. A system, comprising:
Clause 2. The system of clause 1, wherein positive and negative phase terms of the engineered Hamiltonian are used for the position and momentum couplings between the server thermodynamic chip and the first thermodynamic chip and between the server thermodynamic chip and the second thermodynamic chip,
Clause 3. The system of clause 1 wherein the evolution of the oscillators of the first and second thermodynamic chip coupled via the server thermodynamic chip is an evolution according to Langevin dynamics.
Clause 4. The system of clause 1, wherein the first thermodynamic chip, the server thermodynamic chip and the second thermodynamic chip are arranged in a stacked configuration with the server thermodynamic chip positioned between the first thermodynamic chip and the second thermodynamic chip.
Clause 5. The system of clause 1, wherein the engineered Hamiltonian comprises a three-body coupling term that couples, for a respective one of the thermodynamic chips, the visible neurons, the weight values, and the bias values.
Clause 6. The system of clause 1, wherein:
Clause 7. The system of clause 1 wherein the oscillators are implemented using single-well or double-well protentional resonators.
Clause 8. The system of clause 1, wherein the inference values represent distributional values.
Clause 9. A method of performing training and inference generation using thermodynamic chips, the method comprising:
Clause 10. A system, comprising:
Clause 11. A system, comprising:
Clause 12. The system of clause 11, wherein positive and negative phase terms of the engineered Hamiltonian are used for position and momentum couplings between the first thermodynamic chip and the second thermodynamic chip,
Clause 13. The system of clause 11 wherein the evolution of the coupled oscillators of the first and second thermodynamic chip is an evolution according to Langevin dynamics.
Clause 14. The system of clause 11, wherein the first thermodynamic chip and the second thermodynamic chip are arranged in a stacked configuration.
Clause 15. The system of clause 11, wherein the engineered Hamiltonian comprises a three-body coupling term that couples, for a respective one of the thermodynamic chips, the visible neurons, the weight values, and the bias values.
Clause 16. The system of clause 11, wherein:
Clause 17. The system of clause 11 wherein the oscillators are implemented using single-well or double-well protentional resonators.
Clause 18. The system of clause 11, wherein the inference values represent distributional values.
Clause 19. A method of performing training and inference generation using thermodynamic chips, the method comprising:
In the illustrated embodiment, computer system 1400 includes one or more processors 1410 coupled to a system memory 1420 (which may comprise both non-volatile and volatile memory modules) via an input/output (I/O) interface 1430. Computer system 1400 further includes a network interface 1440 coupled to I/O interface 1430. Classical computing functions may be performed on a classical computer system, such as computing computer system 1400.
Additionally, computer system 1400 includes computing device 1470 coupled to thermodynamic chip 1480. In some embodiments, computing device 1470 may be a field programmable gate array (FPGA), application specific integrated circuit (ASIC) or other suitable processing unit. In some embodiments, computing device 1470 may be a similar computing device as described in
In various embodiments, computer system 1400 may be a uniprocessor system including one processor 1410, or a multiprocessor system including several processors 1410 (e.g., two, four, eight, or another suitable number). Processors 1410 may be any suitable processors capable of executing instructions. For example, in various embodiments, processors 1410 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 1410 may commonly, but not necessarily, implement the same ISA. In some implementations, graphics processing units (GPUs) may be used instead of, or in addition to, conventional processors.
System memory 1420 may be configured to store instructions and data accessible by processor(s) 1410. In at least some embodiments, the system memory 1420 may comprise both volatile and non-volatile portions; in other embodiments, only volatile memory may be used. In various embodiments, the volatile portion of system memory 1420 may be implemented using any suitable memory technology, such as static random-access memory (SRAM), synchronous dynamic RAM or any other type of memory. For the non-volatile portion of system memory (which may comprise one or more NVDIMMs, for example), in some embodiments flash-based memory devices, including NAND-flash devices, may be used. In at least some embodiments, the non-volatile portion of the system memory may include a power source, such as a supercapacitor or other power storage device (e.g., a battery). In various embodiments, memristor based resistive random-access memory (ReRAM), three-dimensional NAND technologies, Ferroelectric RAM, magneto resistive RAM (MRAM), or any of various types of phase change memory (PCM) may be used at least for the non-volatile portion of system memory. In the illustrated embodiment, program instructions and data implementing one or more desired functions, such as those methods, techniques, and data described above, are shown stored within system memory 1420 as code 1425 and data 1426.
In some embodiments, I/O interface 1430 may be configured to coordinate I/O traffic between processor 1410, system memory 1420, computing device 1470, and any peripheral devices in the computer system, including network interface 1440 or other peripheral interfaces such as various types of persistent and/or volatile storage devices. In some embodiments, I/O interface 1430 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 1420) into a format suitable for use by another component (e.g., processor 1410).
In some embodiments, I/O interface 1430 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 1430 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 1430, such as an interface to system memory 1420, may be incorporated directly into processor 1410.
Network interface 1440 may be configured to allow data to be exchanged between computing device 1400 and other devices 1460 attached to a network or networks 1450, such as other computer systems or devices. In various embodiments, network interface 1440 may support communication via any suitable wired or wireless general data networks, such as types of Ethernet network, for example. Additionally, network interface 1440 may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
In some embodiments, system memory 1420 may represent one embodiment of a computer-accessible medium configured to store at least a subset of program instructions and data used for implementing the methods and apparatus discussed in the context of
Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Generally speaking, a computer-accessible medium may include storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g., SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
The various methods as illustrated in the Figures above and described herein represent exemplary embodiments of methods. The methods may be implemented in software, hardware, or a combination thereof. The order of method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.
It will also be understood that, although the terms first, second, etc., may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the present invention. The first contact and the second contact are both contacts, but they are not the same contact.
Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended to embrace all such modifications and changes and, accordingly, the above description is to be regarded in an illustrative rather than a restrictive sense.
This application claims benefit of priority to U.S. Provisional Application Ser. No. 63/508,718, entitled “Self-Learning Thermodynamic Computing System,” filed Jun. 16, 2023, and which is incorporated herein by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 63508718 | Jun 2023 | US |