This application claims priority under 35 U.S.C. § 119 to United Kingdom Application GB 2309523.5, filed Jun. 23, 2023, the entire contents of which are incorporated herein by reference.
The present application relates to a system and method for performing machine learning using a quantum computer.
Machine learning (ML) research has developed into a mature discipline with applications that impact many different aspects of society. Neural network and deep learning architectures have been deployed for tasks such as facial recognition, recommendation systems, time series modelling, and for analysing highly complex data in science. In addition, unsupervised learning and generative modelling techniques are widely used for text, image, and speech generation tasks, which many people encounter regularly via interaction with chat bots and virtual assistants. Thus, the development of new machine learning models and algorithms can have significant consequences for a wide range of industries, and more generally, for society as a whole.
Recently, researchers in quantum information science have started to investigate whether quantum algorithms which are implemented on quantum computing hardware may offer advantages over conventional machine learning algorithms implemented on classical computing devices. This has led to the development of quantum algorithms for computational tasks associated with various aspects of ML, such as gradient descent, classification, generative modelling, reinforcement learning, as well as many other tasks. Further examples of the development of quantum systems for use in ML can be found in U.S. Pat. No. 11,157,828 and U.S. Patent Publication 2020/0279185.
However, in most cases it is not straightforward to generalize results from the conventional (classical) ML realm into the quantum ML realm. Rather, various factors must be reconsidered, such as data encoding, training complexity and sampling in the quantum machine learning (QML) setting. For example, there are some questions relating to how large data sets (such as may occur in many ML contexts) may be efficiently embedded into quantum states in such a way that a genuine quantum speedup is achieved. Furthermore, as quantum states prepared on quantum devices can only be accessed via sampling, one cannot estimate properties with arbitrary precision. One particular problem is gradient vanishing in the training of variational quantum algorithms, this is also known as “barren plateaus”. Accordingly, there is ongoing interest in further developing systems that include quantum computing platforms (also referred to herein as quantum hardware, quantum devices, quantum computers, quantum computing hardware and so on) to provide enhanced support for ML.
A system and method are provided for performing machine learning using a quantum computer. The method includes providing a model comprising a Quantum Boltzmann machine with a Hamiltonian ansatz having a set of operators and a set of parameters. The method further includes performing a first stage of training the model against data from a target using a selected subset of the set of operators to obtain optimized values for a subset of the set of parameters. The first stage of training is performed on classical computing hardware to provide a partly trained model. The method further includes performing a second stage of training the model against data from the target using a larger subset of the set of operators to obtain optimized values for a larger subset of the set parameters for the model. The second stage of training is at least partly performed using quantum computer hardware. The optimized parameter values from the first stage of training are used to initialize the corresponding parameters for the second stage of training.
The second stage of the training can be performed iteratively with a larger subset of operators and/or a larger subset of parameters in each iteration, to provide a trained Quantum Boltzmann Machine (hereinafter “QBM”) in which the difference in expectation values between parameter values of a target probability distribution and values of parameters output by the model is reduced. The first stage and second stage training, and potential further iterations, can provide a trained QBM that more accurately represents the Hamiltonian. We show that performance of incremental QBM learning can take advantage of recent and expected future advances in quantum computing hardware, as described below with reference to example quantum computing hardware.
By performing a first stage of training of a Quantum Boltzmann Machine (for example “pre-training” on a classical computing device, the second stage of training (and any subsequent iteration) starts with parameters that have been initialized to enable optimisation in the next stage of the training. A computer system for implementing embodiments may comprise classical binary computer hardware coupled to a quantum computer, making use of the resources of the classical computer for the first stage of training and then exploiting the quantum computer's probabilistic representation of quantum states of a target real-world quantum system for a second stage of training that improves the model.
Various examples and implementations of the disclosure will now be described in detail by way of example only with reference to the following figures:
Quantum Boltzmann machines (QBMs) are machine-learning models which can be used with both classical and quantum data. An operational definition of QBM learning is presented in terms of the difference in Gibbs expectation values between the model and target, taking into account the polynomial size of the data set.
In other words, the QBM acts as a model which is trained to emulate a target. The target in effect defines a system and associated behaviour. In general, the target is not known per se, but samples of the system behaviour may be obtained. The QBM learning (training) involves obtaining samples of the target and corresponding samples from the QBM (model), and updating the model such that latter becomes more closely aligned with the former.
It is shown herein that with stochastic gradient descent, a machine learning solution may be obtained using at most a polynomial number of Gibbs states (where the Gibbs states can be regarded as providing samples of the model). One implication of this finding is that there are no barren plateaus in QBM learning (such as those without hidden units). It is also shown that pre-training on a subset of the QBM parameters can lower the sample complexity bounds. Various pre-training strategies are proposed based on mean-field, Gaussian Fermionic, and geometrically local Hamiltonians (additional models are available that likewise support training on a classical computer). The models and theoretical findings proposed herein have been verified numerically on a quantum and a classical data set. The results presented herein show that QBMs may provide promising machine learning models for training on present and future quantum devices.
In some implementations, a Hamiltonian ansatz is prepared that is very well suited for a particular quantum computing device. After exhausting all available classical computing resources during a first training phase (also referred to herein as pre-training), the model may be enlarged to continue the training on the quantum computing device to further enhance overall performance. As quantum hardware steadily matures, this supports the execution of deeper circuits and further increases of the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map towards training ever larger and more expressive quantum machine learning models.
As described herein, a system and method have been developed for training a quantum Boltzmann machine (QBM) to obtain optimal parameter values. The QBM training results in a model that emulates a target data set, and helps to address some of the issues identified above for implementing ML in a quantum environment. A QBM can be regarded as a generalisation of a classical Boltzmann machine, which is a form of stochastic neural network with nodes linked by weighted connections.
In particular, QBMs are physics-inspired ML models that generalize a classical Boltzmann machine to a quantum Hamiltonian ansatz (an ansatz can be considered as a trial solution to a given problem). A QBM can therefore be considered as providing a certain generic type of ML model, while the Hamiltonian ansatz particularizes the system to the given problem, for example by defining the input parameters for the QBM.
The (quantum) Hamiltonian ansatz can be defined on a graph where each vertex represents a qubit (or a qudit) and each edge represents an interaction (broadly, a qubit is a quantum computing counterpart of a hardware bit in a conventional/classical machine, whereas qudits can represent multi-level systems). The task is to learn the strengths of the interactions (weights), such that samples from the output quantum state of the QBM mimic samples taken from the target data set. For the present approach, the QBM may be trained with polynomial sample complexity on quantum computers. The power and benefits of such an approach will grow in parallel with the rapid development of quantum computing platforms (such as hardware systems that support increasing numbers of qubits and implement error detection or correction for fault tolerance).
The development of quantum generative models of this kind is expected to be useful in machine learning, for addressing (for example) science problems by learning approximate descriptions of the experimental data. QBMs may also play an important role as components of larger QML models (this is similar to how classical BMs can provide good weight initializations for the training of deep neural networks). One advantage of using a QBM rather than a classical BM is that a QBM is more expressive, since the Hamiltonian ansatz can contain more general non-commuting terms. This means that in some settings the QBM outperforms a classical BM, even for classical target data.
In order to help obtain results which have good practical relevance, an operational definition of QBM learning is adopted. Instead of focusing on an information-theoretic measure, we assess the QBM learning performance by the difference in Gibbs expectation values between the target and the model. This takes into account that the (classical) target data set comprises polynomially many data samples, hence its properties have a polynomial precision. Stochastic gradient descent methods are employed in combination with shadow tomography to show that this problem can be solved using polynomially many evaluations of the QBM model. Each evaluation of the model requires the preparation of one Gibbs state and, therefore, we refer to the sample complexity as the required number of Gibbs state preparations.
The Gibbs states used for QBM learning may be prepared and sampled on a quantum computer by a variety of methods. For present purposes, the focus is on the sampling complexity, rather than any specific Gibbs sampling implementation.
In practice, QBM learning allows for great flexibility in model design, and therefore time complexity. It is also shown below that the required number of Gibbs samples for QBM learning can be improved by pre-training on a subset of the parameters of the QBM. In other words, classically pre-training a simpler model can potentially reduce the (quantum) training complexity. For instance, it is possible to analytically pre-train a mean-field QBM and a Gaussian Fermionic QBM. In addition, it is shown below that a geometrically local QBM with gradient descent may be pre-trained, which provides some improved performance guarantees. As described herein, these exactly solvable models may be used for training and/or pre-training of QBMs. Further, classical numerical simulation results are presented which confirm the analytical findings.
We start by formally setting up the quantum Boltzmann machine (QBM) learning problem, providing the definitions of the target and model, and a description of how to assess the performance based on the precision of the expectation values. These definitions and assumptions help to obtain the results described herein, and are introduced below, along with their motivation. In addition, the problem definition described herein is compared to other related problems in the literature, such as quantum Hamiltonian learning.
We consider an n-qubit density matrix r as the target of the machine learning problem. If the target is classical, n could represent the number of features, e.g., the pixels in black-and-white pictures, or more complex features that have been extracted and embedded in the space of n qubits. If the target is quantum, n could represent spin-½ particles, but again more complex many-body systems can be embedded in the space of n qubits. In the literature, it is often assumed that algorithms have direct and simultaneous access to copies of η, however, this assumption is not adopted herein. Instead, a setup is considered in which access is limited to classical information about the target. For a data set ={sμ} of N independent data samples sμ, and assuming the data set can be efficiently stored in a classical memory, the amount of memory required to store each data sample is polynomial in n, and there are polynomially many samples. For example, s may be bitstrings—this includes data sets like binary images and time series data, categorical and count data, and binarized continuous data. As another example, the data may originate from measurements on a quantum system. In this case s identifies an element of the positive operator-valued measure describing the measurement.
Next, we define the machine learning model which is used herein for data fitting. The fully-visible QBM is an n-qubit mixed quantum state of the form
where Z=Tr[] is the partition function. The parameterized Hamiltonian is defined as
where θ∈m is the parameter vector, and {Hi} is a set of m Hermitian and orthogonal operators acting on the 2n-dimensional Hilbert space. For example, these could be n-qubit Pauli operators, Fermionic operators, or any other suitable operator. As the true form of the target density matrix is unknown, the choice of operators {Hi} in the Hamiltonian is chosen without certainty that the choice is optimal. It is possible that, once the Hamiltonian ansatz is chosen, the space of QBM models does not contain the target, i.e., ρ0≠η, ∀θ. This is called a model mismatch, and it may be unavoidable in machine learning. In particular, since we require the number of operators m to be polynomial in n, ρθ cannot encode an arbitrary density matrix.
A natural measure to quantify how well the QBM ρθ approximates the target η is the quantum relative entropy:
This measure generalizes the classical Kullback-Leibler divergence to density matrices. The relative entropy is exactly zero when the two densities are equal, η=ρθ, and S>0 otherwise. In addition, when S(η∥ρθ)≤ϵ, by Pinsker's inequality, all possible Pauli expectation values are within (√{square root over (ϵ)}), see Appendix C.
In theory one can minimize the relative entropy S(η∥ρθ) in order to find the optimal model parameters θopt=argminθS(η∥ρθ). The form of the partial derivatives of the relative entropy can be computed analytically and reads
This is the difference between the target and model expectation values of the operators that are chosen in the ansatz. A stationary point of the relative entropy is obtained when Hi
τ
Hi
η for i∈{1, . . . , m}. Since S is strictly convex, see
Quantifying how well the QBM is trained by means of the relative entropy has some issues in practice. An accurate estimate of S(η∥ρθ) generally involves access to the entropy of the target and the partition function of the model. Due to the model mismatch, which is expected because we are choosing m operators out of exponentially many potential operators, the optimal QBM may have S(η∥ρθ
Definition 1 (QBM learning problem). Given a polynomial-space data set {sμ} obtained from an n-qubit target density matrix η, a target precision ϵ>0, and a fully-visible QBM with Hamiltonian θ=Σi=1mθiHi, find a parameter vector θ such that with high probability
A solution to the QBM learning problem always exists by Jaynes' principle: given a set of target expectations {Hi
η}, there exists a Gibbs state ρθ
Hi
ρ
Hi
η|=0, ∀i. However, due to the polynomial size of the data set we can only compute properties of the target (and model) to finite precision. (For example, suppose that sμ are data samples from some unknown probability distribution P(s) and that we are interested in the sample mean. An unbiased estimator for the mean is
The variance of this estimator is σ2/M, where σ2 is the variance of P(s). By Chebyshev's inequality, with high probability the estimation error is of order σ/√{square root over (M)}. The polynomial size of the data set implies that the error decreases polynomially in general). Therefore, we say the QBM learning problem is solved for any precision ε>0 in Equation (5), whereby the expectation values of the QBM and the target should be close enough that one cannot distinguish them without enlarging the data set.
The expectation values of the target can be obtained from the data set in various ways. For example, for the generative modeling of a classical binary data set one can define a pure quantum state and obtain its expectation values (see Appendix E). For the modeling of a target quantum state (density matrix) one can estimate expectation values from the outcomes of measurements performed in different bases.
As shown in Appendix C3, the solution to the QBM learning problem implies a bound on the optimal relative entropy, namely
This indicates that if the QBM learning problem can be solved to precision ϵ≤ϵ′/(2∥θ−θopt∥1), one can also solve a stronger learning problem based on the relative entropy to precision ϵ′ (this involves bounding∥θ−θopt∥1)
We approach the QBM learning problem by iteratively minimizing the quantum relative entropy, see Equation (3), in this example using stochastic gradient descent (SGD). This involves access to a stochastic gradient ĝθ
It is assumed that the stochastic gradient is unbiased, i.e., [ĝθ
where γt is the learning rate.
With this method, the QBM learning problem may be solved with polynomial sample complexity. We state this in the following theorem, which is an important aspect of the approach described herein.
Theorem 1 (QBM training). We have a QBM defined by a set of n-qubit Pauli operators {Hi}i=1m, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ2+ξ2≥ϵ/2m. After
iterations of stochastic gradient descent on the relative entropy S(η∥ρθ) with constant learning rate
we have
where [ . . . ] denotes the expectation with respect to the random variable θt. Each iteration t∈{0, . . . , T} involves
preparations of the Gibbs state pet, and the success probability of the full algorithm is λ. Here, δ0=S(η∥ρθ
The success probability is the probability that the QBM expectation values are determined correctly. It is a free parameter which can be set to a value for performing the experiment and determines how many measurements are to be performed.
A detailed proof of this theorem is given in Appendix C2 and includes carefully combining three important observations and results. First, it is shown that the quantum relative entropy for any QBM ρθ is L-smooth with L=2m maxj∥Hj∥22. This is then combined with SGD convergence results from the machine learning literature to obtain the number of steps T. Finally, sampling bounds from quantum shadow tomography are used to obtain the number of preparations N. This last step focuses on the shadow tomography protocol, which normally restricts the results to Pauli observables Hi≡Pi, thus ∥Hi∥2=1. It is possible to extend this to generic two-outcome observables with a polylogarithmic overhead compared to Equation (10), see Appendix C2. Furthermore, for k-local Pauli observables, we can improve the result to
with classical shadows constructed from randomized measurement or by using pure thermal shadows.
By combining Equations (8) and (10), we see that the final number of Gibbs state preparations Ntot≥T×N scales polynomially with m, the number of terms in the QBM Hamiltonian. According to our assumption of classical memory, we can only have m∈(poly(n)). This means that the number of required measurements to solve QBM learning scales polynomially with the number of qubits (features). Consequently, there are no barren plateaus in the optimization landscape for this problem, where a barren plateau of a loss function f(θ) is defined by the vanishing of its gradient E[∇θf(θ)]=0, and also by an exponentially decreasing variance of the gradient var[∇θf(θ)]<O(2−n).
The following theorem is proved in Appendix C2.
Theorem 2 (α-strongly convex QBM training). We have a QBM defined by a Hamiltonian ansatz θ such that S(ƒ∥ρθ) is α-strongly convex, a precision κ for the QBM expectations, a precision for the data expectations, and a target precision ϵ such that κ2+ξ2≥ϵ/2m.
After
iterations of stochastic gradient descent on the relative entropy S(η∥ρθ) with learning rate
(see Appendix C.2 for the specific learning rate schedule), we have:
Each iteration involves the number of samples given in Equation (10).
The sample bound in Theorem 1 depends on δ0, the relative entropy difference of the initial and optimal QBMs. This means that if we can lower the initial relative entropy, we also tighten the bound on the QBM learning sample complexity. In this respect, it is shown that δ0 can be reduced by pre-training a subset of the parameters in the Hamiltonian ansatz. Thus, pre-training reduces the number of steps to reach the global minimum.
Theorem 3 (QBM pre-training). Assume a target η and a QBM model ρθ=eΣ
where θpre=[χpre,0m-{tilde over (m)}] and the vector χpre of length {tilde over (m)} contains the parameters for the terms {Hi}i=1{tilde over (m)} at the end of pre-training. More precisely, starting from
and minimizing S(η∥ρθ) with respect to χ ensures Equation (14) for any S(η∥ρχ
We provide a detailed proof of Theorem 3 in Appendix D.1 which applies to any method that is able to minimize the relative entropy with respect to a subset of the parameters. All the other parameters are fixed to specific particular values, generally (but without limitation) zero, and the pre-training starts from the maximally mixed state II/2n. For example, one could use SGD as described above, and apply updates only to the chosen subset of parameters. With a suitable learning rate, this ensures that pre-training lowers the relative entropy compared to the maximally mixed state S(η|II/2n). As a consequence, it is possible to add additional, linear independent, terms to the QBM ansatz without having to retrain the model from scratch. The performance is guaranteed to improve, specifically towards the global optimum due to the strict convexity of the relative entropy. This is in contrast to other QML models which do not have a convex loss function. This is particularly useful if a certain subset of the QBM ansatz is pre-trained classically before training the full model on a quantum device. For example, in Appendix D.2 mean-field and Gaussian Fermionic QBM pre-training models are presented with closed-form expressions for the optimal subset of parameters.
In particular, the input data comprises two components. The first component is a QBM associated with a Hamiltonian ansatz. This first component in effect represents the ML model which is to be trained. The second component comprises a set of data values (samples), which represent Hamiltonian expectation values for the target. For example, the data may represent measurements performed on a target quantum system. This set of samples has a polynomial size with respect to the size of the QBM (which corresponds to the number of qubits used for a quantum-based implementation of the QBM). This polynomial scaling converges using accessible levels of computing resources.
The output data (right-hand image) corresponds to the QBM and ansatz Hamiltonian model shown as the input data (left-hand image) after training the QBM mode using the target data set also shown in the input data. The right-hand image also depicts new samples s˜ρθ
The central box in
As discussed herein, in the procedure shown in
The central box of Hi
ρ
Hi
η|≤ϵ, ∀i, whereby the respective model and target expectations must be close to within a polynomial precision ϵ (see Theorems 1 and 2). As shown herein, the QBM learning can then be solved by minimizing the quantum relative entropy S(η∥ρθ) with SGD and using a polynomial number of Gibbs states (see Theorems 1 and 2). It is further shown with Theorem 3 that the pre-training strategies which optimize a selected subset of the QBM parameters are guaranteed to lower the initial quantum relative entropy. The SGD algorithm outputs the QBM model parameters θT in a polynomial number of steps (iterations) T, and these can be used as a trained system for using samples of new data to provide predicted outcomes.
Accordingly,
To further investigate the above theoretical findings, numerical experiments were performed of QBM learning on data sets constructed from a quantum source and a classical source. First, we focus on reducing the initial relative entropy S(η∥ρθθGF=Σi,j{tilde over (θ)}im{right arrow over (C)}i†{right arrow over (C)}j of Fermionic creation and annihilation operators, where {tilde over (θ)}ij is the 2n×2n Hermitian parameter matrix which has n2 free parameters. Here {right arrow over (C)}†=[c1†, . . . , cn†, c1, . . . , cn], with the operators satisfying {ci,cj}=0 and {ci†,cj}=δij, where {A,B}=AB+BA is the anti-commutator. The advantage of the MF and GF pre-training is that there exists a closed-form solution given target expectation values
Hi
η. This is shown in Appendix D.
In contrast, the GL models are defined with a Hamiltonian ansatz
for which, in general, the parameter vector {right arrow over (θ)}≡{λ,σ} cannot be found analytically. Here the sum i,j
imposes some constraints on the (geometric) locality of the model, i.e., summing over all possible nearest neighbors in some d-dimensional lattice. In particular we choose one- and two-dimensional locality constraints suitable with the assumptions given in the literature. In these specific cases the relative entropy is strongly convex, thus pre-training with SGD has the improved performance guarantees from Theorem 2.
In the left-hand portion of /Z is used as the Gibbs state of the one-dimensional XXZ model for the Quantum Data, and a target η which coherently encodes the binary salamander retina data was adopted for the Classical Data.
As mentioned above,
Accordingly, it is observed for both targets (quantum data and classical data) that all pre-training strategies are successful in reducing S(η∥ρθ
The effect of using the pre-trained models as a starting point for QBM learning with exact gradient descent was investigated for a fully-connected QBM with
In this context (compared to Equation (15)), any qubit can be connected to any other qubit, and there is no constraint on the geometric locality. This is a QBM Hamiltonian known in the literature. We consider data from the quantum target η=/Z for 8 qubits.
In
The performance of the MF pre-trained model (middle line) is better than the top line corresponding to no pre-training at all iterations, but the improvement is relatively modest. Using a 2D GL model (bottom line) for pre-training has a much more significant effect, with S(η∥ρθ
Therefore, choosing a larger learning rate might reduce the benefits of pre-training.
The bound on the number of SGD updates, as per Equation (8) for Theorem 1, was numerically confirmed. This involved considering data from the classical salamander retina target with 8 variables and a fully-connected QBM model on 8 qubits. As noted above, (109) on the number of steps in Theorem 1, which is the worst case scenario.
An operational definition of quantum Boltzmann machine (QBM) learning has been developed and it is shown that this problem can be solved with polynomially many preparations of quantum Gibbs states. To prove the relevant bounds, the properties of the quantum relative entropy are used in combination with the performance guarantees of stochastic gradient descent (SGD). There is no assumption as to the form of the QBM Hamiltonian, other than that it consists of polynomially many terms. This is in contrast with some earlier works that looked at the somewhat related Hamiltonian learning problem only for geometrically local models. In that context, strong convexity is required in order to relate the optimal Hamiltonian parameters to the Gibbs state expectation values. In the machine learning setting described herein, the form of the target Hamiltonian is not known a priori. Therefore, learning the exact parameters is not as directly relevant, and instead the focus is directly on the expectation values. For this reason, the bounds for the approach described herein only involve L-smoothness of the relative entropy and may be applied to all types of QBMs without hidden units.
It is also shown herein that the theoretical sampling bounds may be tightened by lowering the initial relative entropy of the learning process. Typically, one would start QBM learning from the maximally mixed state, i.e., the state with no prior information. It is shown herein that pre-training on any subset of the parameters performs better than (or equal to) the maximally mixed state. This is beneficial if one can efficiently perform the pre-training, as is shown herein to be the case for mean-field, Gaussian Fermionic, and geometrically local QBMs. The performance of these models and the theoretical bounds are verified with classical numerical simulations. These simulations also indicate that knowledge about the target (e.g., its dimension, degrees of freedom, etc.) can significantly improve the training process. Furthermore, it is found that the generic bounds adopted herein are quite loose, and in practice it may be feasible to use a much smaller number of samples.
In some implementations, the sample bound may be tightened by going beyond the plain SGD method described so far. This could be done in various ways, such as by adding momentum, by using other advanced update schemes, and/or by exploiting the convexity of the relative entropy. This may improve the
scaling in our bounds, and generally conforms to the approach described herein, whereby the QBM learning problem can be solved with polynomially many preparations of Gibbs states.
Another point of interest concerns training performance of different ansatz. Generative models are often assessed in terms of training quality and generalization capabilities have recently been investigated by both classical and quantum machine learning researchers. For the case of QBMs, generalization may offer a path for further development.
The operations and results described herein may also be generalized to QBM models with hidden units. This generalization could involve showing L-smoothness of the relative entropy for a more general and challenging setup, and a positive result would provide a facility to train highly expressive models. In this respect, it is noted that the results presented herein already hold for the special case of a QBM with fixed hidden units, since this problem reduces to the one discussed above.
The pre-training result described herein may be useful for implementing QBM learning on near-term and early fault-tolerant quantum devices. To this end, a quantum computer may be used as a Gibbs sampler. There are many quantum algorithms in the literature that produce Gibbs states with a quadratic improvement in time complexity over the best existing classical algorithms. Moreover, the use of a quantum device gives an exponential reduction in space complexity in general. For example, Motta et al. implemented a 2 qubit Gibbs state for an anti-ferromagnetic Ising model Hamiltonian on the Aspen-1 quantum computer. It is anticipated that improved quantum processing devices with higher gate fidelities and higher qubit counts, such as (but not limited to) Quantinuum's system model H2 or the Aspen-M-3, may be able to prepare similar Gibbs states and potentially for more complex Hamiltonians (i.e. with more operators in the ansatz). A further possibility is to sidestep the Gibbs state preparation and use algorithms that directly estimate Gibbs-state expectation values, e.g., by constructing classical shadows of pure thermal quantum states. This reduces the number of qubits and, potentially, the circuit depth.
The results presented herein support a range of methods for incremental learning QBMs driven by the availability of both training data and quantum hardware. For example, one could select a Hamiltonian ansatz that is very well suited for a particular quantum device. After exhausting all available classical resources during the pre-training, the model may be enlarged, and the training then continues on a quantum device, which therefore improves the overall performance. As quantum hardware matures, it allows the execution of deeper circuits and so supports a further increase of the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map towards training larger and more expressive quantum machine learning models.
The results presented herein support the development of methods for incremental learning by QBMs driven by the availability of both training data and quantum hardware. For example, one could select a Hamiltonian ansatz that is very well suited for a particular quantum device. After exhausting all available classical resources during the pre-training phase on selected components of the model (such as by selecting subsets of the operators and parameters), the model is enlarged and continues the training on the quantum device, which is guaranteed to improve the performance (compared to the output at the end of the pre-training phase). As quantum hardware continues to develop further, this allows the execution of deeper circuits and a further increase in the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map, towards training larger and more expressive quantum machine learning models.
The Quantum Boltzmann machine with an ansatz Hamiltonian may be further provided with target expectation values for performing the first stage of training. For example, a QBM ρθ with ansatz Hamiltonian may be given by a set of operators {Hi}i=1m, parameters {θi}i=1m, and the target expectation values {Hi}η.
Operation 320 performs a first stage of training the model against data from a target using a selected subset of the operators to obtain optimized values for a subset of the parameters. The first stage of performing is performed on classical computing hardware to provide a partly trained model.
In the first stage of training, a subset {tilde over (m)} may be selected of the operators {Hi}i=1{tilde over (m)}, that can be trained classically. The selection of a subset of operators of a Hamiltonian may have regard to the operators that can be efficiently trained in a classical context. The relative entropy may be optimized on classical hardware with respect to a selected subset of the parameters while keeping other parameters set to zero (or any other suitable values). The optimal parameters obtained after the classical pre-training (having exhausted the available classical resources) may be saved. The pre-training may be iterated over t=1 to Tpre, where Tpre represents a maximum number of iterations (if convergence does not occur beforehand). This pre-training seeks to optimize the relative entropy S(ηρθ) with respect to the subset {θi}i=1{tilde over (m)} of parameters while keeping the other (non-selected) parameters set to a fixed value such as zero.
Operation 330 performs a second stage of training the model against data from the target using the full set of operators to obtain optimized values for a larger subset of the set of parameters for the model. The second stage of performing is performed on quantum computer hardware to provide a further trained model. The optimized parameter values saved from the first stage may be used to initialize the corresponding parameters for the second stage of training.
The larger subset of the set of parameters for the model may, in some implementations, comprise the full set of parameters for the model. Accordingly, the second phase of training may encompass all the set of parameters for the model. (It is implicit that the first phase of training does not involve training all the parameters of the set, because this would not allow the second phase of training to involve a larger subset.
The second phase of training may be iterated over t=1 to Tq1, where Tq1 represents a maximum number of iterations (if convergence does not occur beforehand). In this second phase of training, the relative entropy may be optimized with respect to all the parameters in the model and target by computing Gibbs state expectation values on a quantum device. Before performing this optimization, the ansatz Hamiltonian is extended with a further set of operators and parameters (enlarging the model). These further operators and parameters are those that were not included in their respective subsets during the pre-training phase (and so have not yet been incorporated into the model).
Accordingly the second phase optimizes the relative entropy S(η|ρθ) with respect to all of the parameters {θi}i=1m by computing the Gibbs state expectation values {Hi}ρθ on a quantum device, such as by using thermal shadows. The parameters of the extended (complete) QBM may be initialized using the optimal values obtained at the end of the previous quantum optimization loop (iteration). As noted above, for the first iteration, the parameters are initialized using the optimal values from the first stage of training (the pre training). For each iteration on the quantum hardware, the additional target expectation values are determined to optimize the relative entropy with respect to all the extended QBM parameters by obtaining the required Gibbs state expectation values on the quantum hardware.
Depending on the quantum computing resources available, the above approach may be developed further such that in a third training phase, the ansatz Hamiltonian is further extended with a set of (orthogonal) operators {{tilde over (H)}i}i=1n and parameters {{tilde over (θ)}i}i=1n. The parameters of the extended QBM are initialized as λ≡{θ,{tilde over (θ)}}={θopt,0}, where θopt are the optimal parameters obtained at the end of the previous quantum optimization loop. The additional target expectation values {tilde over (H)}i
η are computed and used for the training.
In this further development, the third phase of training may be iterated over t=1 to Tq2, where Tq2 represents a maximum number of iterations. Each iteration then involves an optimization of the relative entropy S(η∥ρλ) with respect to all the extended QBM parameters λ by obtaining the required Gibbs state expectation values on a quantum device.
In some implementations, the second stage of the training (and/or the third stage of the training if relevant) may be performed on a hybrid system which includes both quantum computing hardware and classical computing hardware. For example, Gibbs states for the Quantum Boltzmann machine may be used to provide samples for machine learning. The Gibbs states may be prepared and sampled on the quantum computing hardware, whereas the parameters for the model may be maintained on classical computing hardware. Various other configurations of a hybrid system may also be used for the second and/or third training stages.
The classical computing platform 410 further includes an optimization (minimization) program 420, for example a program which performs stochastic gradient descent (SGD). In broad terms, the optimization program 420 may obtain samples, as represented by expectation values of the operators in the Hamiltonian ansatz 415, for comparison with training data, namely target data 480. The optimization program uses the results of these comparisons to update the parameters of the Hamiltonian ansatz 415 so as to reduce quantum relative entropy. The optimization program 420 performs multiple iterations of this machine learning process to reach a configuration of the model parameters which has a low (minimal) quantum relative entropy.
The first stage of the process (pre-training) is performed solely on the classical computing device 410. Such a device may not have enough processing capability to perform the whole optimization procedure. Thus, as described herein, the pre-training may be performed, for example, with respect to a subset of the model parameters. The remaining parameters (those not in subset) may be held at a fixed value, such as zero. Using a subset of the parameters for the optimization such as using SGD generally reduces the computational resources used for this second stage of training.
The second stage of the process (after the pre-training) involves the use of the quantum computing platform 450. The quantum computing platform 450 includes a quantum circuit 452 associated with one or more qubits 455 to support computations running on the quantum computing platform 450. The quantum computing platform 450 also includes a QBM 425 associated with the Hamiltonian ansatz. This Hamiltonian on the quantum computer 450 generally matches the Hamiltonian ansatz 415 on the classical computing device 410, especially in terms of the associated model, but they are adapted to run on different hardware platforms as shown in
In the example of
By measuring the physical properties of the QBM (425) prepared on the quantum device (450), the optimizer (SGD) can search in parallel across parameter space to find parameter values that have the lowest relative entropy. This ability may offer the potential of performing machine learning on the quantum computer 450 that is not computationally feasible on a classical computer 410 (or is more computationally expensive on a classical computer). For example, the second phase of the searching may be performed with a larger subset (or complete set) of the parameters for the model. Accordingly, the approach described herein exploits the different properties and characteristics of classical and quantum computing devices to support an efficient approach for machine learning with respect to complex systems.
Various implementations and examples have been disclosed herein. It will be appreciated that these implementations and examples are not intended to be exhaustive, and the skilled person will be aware of many potential variations and modifications of these implementations and examples that fall within the scope of the present disclosure. It will also be understood that features of particular implementations and examples can typically be incorporated into other implementations and examples (unless the context clearly indicates to the contrary). In summary, the various implementations and examples herein are disclosed by way of illustration rather than limitation, and the scope of claimed embodiments is defined in the appended claims.
Here we identify some useful mathematical facts and relations and derive some useful results that are used in the proofs in later appendices.
Definition 2 (Convexity). A multivariate function ƒ: m
is said to be convex when
If additionally the gradient ∇ƒ(x*) is zero only for one unique vector x*∈m, then ƒ is said to be strictly convex.
The following Lemma can be deduced from the standard definition of convexity; see Garrigos et al., “Handbook of Convergence Theorems for (Stochastic) Gradient Methods”, arXiv:2301.11235, 2023 (hereinafter “Garrigos”).
Lemma 1. Let ƒ be twice continuously differentiable. Then ƒ is convex if
A stronger version of convexity is used in some of our discussions.
Definition 3 (α-Polyak-Lojasiewicz). Let ƒ: M→
, and α>0. We say that ƒ is α-Polyak-Lojasiewicz if
Where ∥⋅∥ is the Euclidean norm.
An even stronger convexity condition is the following.
Definition 4 (α-strong convexity). Let ƒ: m→
, and α>0. We say that ƒ is α-strongly convex if
The former implies the latter
Lemma 2. If ƒ is α-strongly convex then ƒ is α-Polyak-Lojasiewicz.
The strong convexity of a function can be tested as follows.
Lemma 3. Let ƒ be twice continuously differentiable. Then ƒ is α-strongly convex if
Besides convexity, we also need to characterize the smoothness of a function.
Definition 5 (L-smoothness). Let ƒ: m→
and L>0. We say that ƒ is L-smooth if it is differentiable and if the gradient ∇ƒ is L-Lipschitz:
For L-smooth functions we have the following useful property (see Garrigos).
Lemma 4 (Descent lemma). Let ƒ: m→
be a twice differentiable, L-smooth function, then
The derivative of the matrix exponential eH with respect to a parameter is given by Duhamel's formula
Taking H=W+θV, with simple manipulations we find a useful alternative expression
Here we use the basis diagonalizing the Hamiltonian, H=Σjλj|j|, and we introduce the notation Vjk=
j|V|k
The above expression is valid also for the diagonal entries, k=j, since
With the notation
we can write
Let us interpret {circumflex over (ƒ)}(ω) as the Fourier transform of another function: {circumflex over (ƒ)}(ω)=∫−∞∞ƒ(t)e−itωdt. Plugging this in the previous expression we obtain
Here {A, B}=AB+BA is the anti-commutator, and we have defined ΦV)=∫−∞∞ƒ(t)eitHVe−itHdt.
We have recovered, by different means, a result that is achievable via the method described in Hastings, “Quantum belief propagation: An algorithm for thermal quantum systems”, Phys. Rev. B 76, 201102 (2007).
Set out below is a proof of some properties of the quantum relative entropy S(η∥ρθ) of a generic QBM ρθ with respect to some arbitrary target η. These properties are used for the proof of the theorems in the main text. We start by showing the convexity and afterward we show the L-smoothness.
In order to show (strict) convexity of S, we can use Lemma 1 above. We first show that the Hessian of the quantum relative entropy with respect to the QBM parameters, ∇2S, is positive semidefinite. Afterwards, we show that S has only one unique global optimizer θ* for which ∇S(η∥ρθ*)=0, and apply the Lemma.
We recall from the main text that the QBM Hamiltonian, θ=ΣiθiHi, is a sum over Hermitian, in general non-commuting, operators Hi. Using the derivative of the matrix exponential in Equation (A12), we have:
In the last step we use the cyclic property of the trace. This is Equation (4) in the main text that precedes the appendix. We now take the second derivative starting from Equation (B1):
In the last step we used Tr[A{B, C}]=Tr[C{A, B}] to rearrange the terms.
As Φ(V) is a Hermitian operator for any Hermitian V we see that the Hessian has the form of a covariance matrix.
It is then readily shown to be positive semidefinite and satisfies Equation (A2). For any vector v∈
Here we define Hermitian operator W=ΣnvnHn. The last line is the expectation value of the square of a Hermitian operator, and as such it must be non-negative.
This means that the quantum relative entropy is convex. We now show strict convexity by a contradiction argument, following Proposition 17 in Anshu et al., “Sample-efficient learning of interacting quantum systems”, Nature Physics 17, 931 (2021) (hereinafter “Anshu”). Assume we have found one set of parameters θ* with ∇(η∥ρθ*)=0. Then from Equation (B1) we have
H
i
η
=
H
i
ρ
for all Hi. Note that we can always find at least one such θ* by Jaynes' principle, see Jaynes, “Information Theory and Statistical Mechanics”, Phys. Rev. 106, 620 (1957). Next, assume there exists a different set of parameters, χ≠θ*, with
Hi
η=
Hi
ρ
Similarly, by swapping ρχ and ρθ*, we find
It follows that S(ρθ*∥ρχ)=0, implying ρθ*=ρχ. Now because the operators Hi are orthogonal we have θ*=χ. This contradicts the assumption in the beginning (θ*≠χ), and we can have only one unique θ* with ∇S(η∥ρθ*)=0. Hence S is strictly convex by Definition 2.
To show α-strong convexity of S one can use Lemma 3. To the best of our knowledge there is no proof in the literature showing that quantum relative entropy of Gibbs states is strongly convex in general. On the other hand, this property has been proven for particular classes of Hamiltonians. Anshu et al. prove strong convexity for k-local Hamiltonians defined on a finite dimensional lattice. They show that in this case
a polynomial decrease with respect to the system size. Strong convexity for the more general class of low-intersection Hamiltonians was proved in Haah et al., “Optimal learning of quantum Hamiltonians from high temperature Gibbs states”, IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 135-146 (2022) (hereinafter “Haah”). Low-intersection Hamiltonians have terms that act non-trivially only on a constant number of qubits, and each term intersects non-trivially with a constant number of other terms.
In this section, we use differentiable programming to numerically analyze the smallest eigenvalue of the Hessian, λmin(∇2S), seeking evidence for strong convexity; see Baydin et al., “Automatic differentiation in machine learning: A survey”, arXiv:1502.05767 (2018). We consider a 1D nearest-neighbor Hamiltonian:
and a fully-connected one:
We randomly sample coefficients uniformly in [−μ,μ] where μ is a scale parameter and determines the maximum size of random parameters of the vector of the coefficients.
We show that the quantum relative entropy S(η∥ρθ) is an L-smooth function of θ. To do so we need an upper bound on the largest eigenvalue of the Hessian in Equation B2. We begin with the following property:
where we use that {circumflex over (ƒ)}>0 and {circumflex over (ƒ)}max=1. In what follows we use the above result with ∥⋅∥2, the operator norm induced by the Euclidean vector norm (p=2). Let us bound the entries of the Hessian
Here we use that expectations are bounded by the largest eigenvalue or, alternatively, by the p=2 operator norm. We also use the sub-multiplicative property of the operator norm, and Equation (B6) above. We are now able to put an upper-bound on the largest eigenvalue of the Hessian matrix:
The first equality uses the fact that the Hessian is a symmetric matrix, the first inequality is a consequence of the Gershgorin circle theorem.
We can use this result to prove the L-smoothness.
Let us define a function h(t)=∇S(η∥ρy+t(x-y)). Then we have
where in the last step we used Equation B8. Thus, the quantum relative entropy is L-smooth with L=2m maxj∥Hj∥22.
In this appendix we first review useful results from the machine learning literature, then prove Theorems 1 and 2 in the main text that precedes the appendix. We also discuss a few upper bounds for the relative entropy in the context of QBM learning.
We begin by stating three convergence results from the SGD literature. Consider a loss function ƒ: m→
that is L-smooth (Definition 5 and bounded from below by βinf∈
. The stochastic gradient is unbiased, i.e.,
[ĝ]=∇ƒ, and satisfies
for some A, B, C≥0 and all x∈m. SGD iteratively minimizes ƒ according to the update rule xt=xt-1γtĝx
Lemma 5 (restatement of Corollary 1 in Khaled). Choose precision ϵ>0 and step size
we have that SGD converges with
Here E[⋅] denotes the expectation with respect to xt, which is a random variable due to the stochasticity in the gradient. Let us now consider a loss function which, in addition to the previous conditions, is also α-Polyak-Lojasiewicz (Definition 3). We consider the following iterative learning rate scheme for γt.
Lemma 6 (restatement of Lemma 3 in Khaled). Considera sequence (rt)t satisfying
r
t+1(1−αγt)rt+cγt2,
where γt≤1/b for all t≥0 and a,c≥0 with a≤b. Fix T>0 and let
Then choosing the step size as
For this learning rate scheme, Khaled et al. proved the following SGD convergence result.
Lemma 7 (restatement of Corollary 2 in Khaled). Choose precision ϵ>0 and step size γt following Lemma 6 with
Then provided that
we have that SGD converges with
We prove Theorem 1, which is repeated here for completeness.
Theorem 1 (QBM training). Given a QBM defined by a set of n-qubit Pauli operators {Hi}i=1m, a precision κ for the QBM expectations, a precision for the data expectations, and a target precision ϵ such that
iterations of stochastic gradient descent on the relative entropy S(η∥ρθ) with constant learning rate
we have
where [⋅] denotes the expectation with respect to the random variable t. Each iteration t∈{0, . . . , T} requires
preparations of the Gibbs state ρθ
Proof. The quantum relative entropy is L-smooth with L=2m maxi∥Hi∥22, and for Pauli operators ∥Hi∥2=1. Then, we can minimize the relative entropy by SGD and apply the convergence result in Lemma 5.
For the SGD algorithm we need an unbiased gradient estimator with bounded variance. We recall that the gradient of the relative entropy is given by ∂θHi
ρ
Hi
θ. The target expectation values
Hi
θ are estimated as from the data set, as described in Appendix E below. Note that |
Hi
θ−ĥi,θ|≤ξ, where ξ>0 is limited by the size of the data set. One can improve on by collecting more data, as long as the amount of samples is polynomial in n.
For estimating the QBM expectation values Hi
ρ
preparations of ρθ1. The success probability of the procedure is 1−{tilde over (λ)}. Thus, we can obtain estimators ĥi,ρ
We then use ĝθ
Since the variance can also be written as ∥ĝθ∥2−∥∇S∥(η∥ρθ)∥2 we find that our setup is compatible with Equation (C5) for A=0, B=1, C=m(κ2+ξ2). We choose
in Lemma 5. This yields a learning rate of
We conclude that after
iterations of SGD we have
Here δ0=S(η∥ρθ
in the shadow tomography protocol. This result, together with the sampling bound on the number of measurements of the shadow tomography
completes the proof of Theorem 1.
We now provide a proof for Theorem 2, which we restate here.
Theorem 2 (α-strongly convex QBM training). Given a QBM defined by a Hamiltonian ansatz θ such that S(η∥ρθ) is α-strongly convex, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that
iterations of stochastic gradient descent on the relative entropy S(η∥ρθ) with learning rate
(see Appendix C.2 for the specific learning rate schedule), we have
Each iteration requires the number of samples given in Equation (C9).
In order to prove this theorem, we first show that η, ρopt and ρθ are ‘collinear’ with respect to the relative entropy.
Here, in the going from the second to the third line, we used the fact that Tr[ηHi]=Tr[ρθ
Proof. S(ρθ
In addition, the α-strong convexity assumed by the theorem implies that S(ρθ
Looking at the case2 where
we find that after
iterations the expected relative entropy is S(ρθ
where we apply Pinsker's inequality in the first step, and we use the variational definition of trace distance in the last step. The maximization is over unitary matrices. Let us now consider unitary matrices defined as
These have the property that
Thus, we obtain 2 Note that, depending on the problem specific parameter δ0, and the free parameters κ and ξ, one could be in the other case of Lemma 7. One then follows the same steps shown here, and arrives at a slightly different, yet polynomial in n, number of steps.
Since ∥Hi∥=1 for Pauli operators. To solve the QBM learning to precision ϵ we choose
and conclude that
In this section we study the scenario where the user is interested in obtaining a certain precision on the quantum relative entropy, rather than on the difference in the expectation values. Again, due to a potential model mismatch, we discuss the relative entropy S(ρθ
We begin by training the QBM ρθ with SGD. Using Theorem 1, we can achieve |Hi
η−
Hi
ρ
Hi
η−
Hi
ρ
Hi
ρθ−
Hi
ρ
Similarly,
Thus
To minimize the quantum relative entropy to precision ϵ′, we choose
This determines the number of SGD iterations via Theorem 1. Note that the number of iterations remains polynomial in the system size n.
Finally we combine this result with Equation (C16) and obtain the implication
This proves Equation (6) in the main text.
In this appendix we first prove Theorem 3 in the main text, and then discuss various pre-training models.
For completeness we start by restating Theorem from the main text.
Theorem 3 (QBM pre-training). Assume a target η and a QBM model ρθ=eΣ
where θpre=[χpre,0m-{tilde over (m)}] and the vector χpre of length i contains the parameters for the terms {Hi}i=1{tilde over (m)} at the end of pre-training. More precisely, starting from px=eΣi=1{tilde over (m)}xiHi/Z and minimizing S(η∥ρχ) with respect to χ ensures Equation (D1) for any S(η∥ρχ
Proof. First we relate the difference in relative entropy between two parameter vectors in the full space to the difference in relative entropy of the pre-trained parameter space. In particular, for any real parameter vectors θ=[χ,0m-{tilde over (m)}] and θ′=[χ′,0m-{tilde over (m)}] we have
Now using pre-training vectors θpre=[χpre,0m-{tilde over (m)}] and θ0=[χ0,0m-{tilde over (m)}]=0 we see that S(η∥ρχ
While conclusive, the above proof does not provide us with a method to find such a χpre, i.e., it is agnostic to the specific pre-training method. As a constructive example, let us consider minimizing χpre with noiseless gradient decent on a subset of {tilde over (m)} parameters. This means we update the subset parameters as χt=χt-1−γ{tilde over (∇)}S(η{tilde over (∥)}ρχ
Setting
we obtain S(η∥ρχ
S(η∥ρχ
which by our theorem above ensures Equation (D1). Note that the smoothness L here is the smoothness on the subset of parameters, which can be bounded by L≤2{tilde over (m)} maxi∥Hi∥22.
Here we discuss possible pre-training models and strategies to optimize them. We focus on the models discussed in the main text: 1) a mean-field model, 2) a Gaussian Fermionic model, 3) nearest-neighbor quantum spin models. The advantage of the first two models is that they can be trained analytically. While for the nearest-neighbor models this is not possible, they satisfy the locality assumptions in Anshu and Haah, and hence have a strongly convex relative entropy.
2a. Mean-field Quantum Boltzmann Machine
We define the mean-field QBM by the parameterized Hamiltonian
Since this Hamiltonian has a simple structure, in which many terms commute, we can find the optimal parameters analytically. First, recall that the QBM expectation values are given by
For the mean-field Hamiltonian, we find
where we have defined ∥θi∥2=√{square root over (θix
From which the derivative follows as
In order to find the optimal QBM parameters for each qubit, i, we then solve the three coupled equations,
which corresponds to setting the QBM derivative in Equation (4) in the main text to zero. From the strict convexity of the relative entropy, we know this has one unique solution provided the target expectation values, σix,y,z
η form a consistent set, i.e. it comes from a density matrix. We can find the solution by squaring the three equations, and adding them together, giving
Here we used that the argument of the tanh is always positive. Substituting this into Equation (D9) we then find the closed-form solution of the QBM parameters
In practice, the optimal parameters for an arbitrary mean-field QBM can be obtained by numerically evaluating this expression for the given target expectation values.
2b. Gaussian Fermionic Quantum Boltzmann Machine
The Gaussian Fermionic QBM has a parameterized, quadratic, Fermionic Hamiltonian
Here, {right arrow over (C)}†=[c1†, . . . , cN†, c1, . . . cn] is a vector containing n Fermionic mode creation and annihilation operators, which satisfy the Fermionic commutation relations {ci,cj†}=δi,j and {ci,cj}=0. These Fermionic operators can be expressed as strings of Pauli operators by the Jordan-Wigner transformation. {tilde over (Θ)} is the 2n×2n dimensional matrix containing the QBM model parameters θ, which can be identified as a Fermionic single-particle Hamiltonian. Note that this matrix needs to be Hermitian, and since terms like c1†c1† are zero it has in total n2 free parameters.
In order to find the optimal parameters, we use that the single-particle correlation matrix with entries [Γρθ]ij{right arrow over (C)}i†{right arrow over (C)}j
ρ
We can solve this by first determining the target expectation values {right arrow over (C)}i†{right arrow over (C)}j
η and setting
{right arrow over (C)}i†{right arrow over (C)}j
ρ
{right arrow over (C)}i†{right arrow over (C)}j
η. Then we use the fact that the Hamiltonian of a Gaussian Fermionic system can be written in the eigenbasis of the correlation matrix as
where Wη and Λη is given by the eigen decomposition Γη=WηΛηWη†, and σ−1(X) the inverse sigmoid function. Thus, we (numerically) diagonalize Γη and set the optimal Gaussian Fermionic QBM Hamiltonian equal to
Since the eigen decomposition of a Hermitian matrix is unique, we find one unique solution. This is in agreement with the strict convexity of the quantum relative entropy.
2c. Geometrically-Local Quantum Boltmann Machine
The last type of restricted QBM model we discuss are the geometrically-local QBMs. We consider the same Hamiltonian as for a generic fully connected 2-local QBM [Equation (16)], but then with additional constraints on the locality of the Pauli operators. In particular, we focus on nearest-neighbor models on some d-dimensional lattice, e.g. a one-dimensional chain where each Pauli operator only acts on two neighbouring qubits. In full generality, the parameterized QBM Hamiltonian is given by
where we sum over the nearest-neighbour sites i,j
of the lattice with periodic boundary conditions. In the main text we consider for example a d=1 lattice (a ring), and a d=2 square lattice.
In order to use these models for pre-training, we train them with SGD on the relative entropy until a fixed precision is reached. Importantly, as these Hamiltonians only have m=(n) terms and a finite interaction range, Anshu and Haah show that the quantum relative entropy is strongly convex. Therefore, the optimization is guaranteed to converge quickly to the global optimum, recall Theorem 2. However, this includes obtaining Gibbs state expectation values of geometrically local Hamiltonians. This can be done with a quantum computer, or potentially classically with tensor networks; see Kuwahara et al., “Improved thermal area law and quasilinear time algorithm for quantum Gibbs states”, Phys. Rev. X 11, 011047 (2021) and Alhambra et al., “Locally accurate tensor networks for thermal states and time evolution”, PRX Quantum 2, 040331 (2021).
In this appendix we review how to embed classical data into a target density matrix η. We will follow the approach for quantum spin models in Kappen, “Learning quantum models from quantum or classical data”, Journal of Physics A: Mathematical and Theoretical 53, 214001 (2020) (hereinafter “Kappen”). We also show how to extend this formalism to Fermionic quantum models needed for the pre-training of our Gaussian Fermionic QBM. Lastly, we describe the two different targets used for the numerical simulations in the main text.
Following the approach in Kappen, one way to encode a classical dataset consisting of N bit strings {∈{0,1}n}μ=1M into a quantum state is by defining the pure state
with
Here
is the classical empirical probability for bitstring {right arrow over (s)}, and |{right arrow over (s)} is a computational basis state indexed by {right arrow over (s)}. The q({right arrow over (s)}) can be found by counting the bitstrings in the data set {
}. From |ψ
one can compute expectation values such as
for the Pauli spin operator σiz. This can be efficiently computed classically for a polynomially sized dataset, i.e. for polynomially many . Computing such expectation values from η is possible for all 1- and 2-local Pauli operators as shown in Kappen.
We now show that we can generalize this encoding to Fermionic QBMs, i.e. the terms in the Hamiltonian ansatz consists of Fermionic creation ci† and annihilation operators ci. We define |{right arrow over (s)} to be equal to the Fermionic Fock basis. This is analogous to the computational basis in the spin-picture (by the Jordan-Wigner transformation), but the bit-strings {
∈{0,1}n}μ=1M in the data set should now be interpreted as occupation-number vectors of Fermions. Note that the occupation number basis is defined by the eigenstates of the Fermionic number operator Σici†ci.
The creation and annihilation operators act on the Fock-basis states as follows
where is the unit bit-string with a 1 at position i and zeros everywhere else. With these relations we can derive the required expectation values for the target η to train the (Gaussian) Fermionic QBM
where Fi flips the Fermion occupation number (from occupied to unoccupied and vice versa) of index i in the vector {right arrow over (s)}.
For the numerical simulations in the main text we use two different targets η: 1) a target constructed from a quantum source, and 2) a classical data set embedded into η 1 using the encoding above. For the quantum source we use the XXZ model Hamiltonian
Here J and Δ are the model parameters describing the Heisenberg interactions between the quantum spins on a one-dimensional lattice, and hz the strength of an external magnetic field. We set
with J=−0.5, Δ=−0.7 and hz=−0.8, and compute the expectation values Hi
η classically. This is intractable in general, but our aim is to replicate the scenario in which the expectation values are measured experimentally—for example, from a state prepared on a quantum device.
For the classical source, we use the classical salamander retina dataset given in Tkačik et al., “Searching for collective behavior in a large network of sensory neurons”, PLOS Computational Biology 10, 1 (2014). This data set consists of bit-string data of different features of the response of cells in salamander retina. We select the first 8 features and trim the data to the first 10 data recordings. We then construct the expectation values H
, from the procedure outlined above.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2309523.5 | Jun 2023 | GB | national |