SYSTEM AND METHOD FOR PERFORMING MACHINE LEARNING USING A QUANTUM COMPUTER

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority under 35 U.S.C. § 119 to United Kingdom Application GB 2309523.5, filed Jun. 23, 2023, the entire contents of which are incorporated herein by reference.

FIELD OF THE DISCLOSURE

The present application relates to a system and method for performing machine learning using a quantum computer.

BACKGROUND

Machine learning (ML) research has developed into a mature discipline with applications that impact many different aspects of society. Neural network and deep learning architectures have been deployed for tasks such as facial recognition, recommendation systems, time series modelling, and for analysing highly complex data in science. In addition, unsupervised learning and generative modelling techniques are widely used for text, image, and speech generation tasks, which many people encounter regularly via interaction with chat bots and virtual assistants. Thus, the development of new machine learning models and algorithms can have significant consequences for a wide range of industries, and more generally, for society as a whole.

Recently, researchers in quantum information science have started to investigate whether quantum algorithms which are implemented on quantum computing hardware may offer advantages over conventional machine learning algorithms implemented on classical computing devices. This has led to the development of quantum algorithms for computational tasks associated with various aspects of ML, such as gradient descent, classification, generative modelling, reinforcement learning, as well as many other tasks. Further examples of the development of quantum systems for use in ML can be found in U.S. Pat. No. 11,157,828 and U.S. Patent Publication 2020/0279185.

However, in most cases it is not straightforward to generalize results from the conventional (classical) ML realm into the quantum ML realm. Rather, various factors must be reconsidered, such as data encoding, training complexity and sampling in the quantum machine learning (QML) setting. For example, there are some questions relating to how large data sets (such as may occur in many ML contexts) may be efficiently embedded into quantum states in such a way that a genuine quantum speedup is achieved. Furthermore, as quantum states prepared on quantum devices can only be accessed via sampling, one cannot estimate properties with arbitrary precision. One particular problem is gradient vanishing in the training of variational quantum algorithms, this is also known as “barren plateaus”. Accordingly, there is ongoing interest in further developing systems that include quantum computing platforms (also referred to herein as quantum hardware, quantum devices, quantum computers, quantum computing hardware and so on) to provide enhanced support for ML.

BRIEF SUMMARY OF EMBODIMENTS

A system and method are provided for performing machine learning using a quantum computer. The method includes providing a model comprising a Quantum Boltzmann machine with a Hamiltonian ansatz having a set of operators and a set of parameters. The method further includes performing a first stage of training the model against data from a target using a selected subset of the set of operators to obtain optimized values for a subset of the set of parameters. The first stage of training is performed on classical computing hardware to provide a partly trained model. The method further includes performing a second stage of training the model against data from the target using a larger subset of the set of operators to obtain optimized values for a larger subset of the set parameters for the model. The second stage of training is at least partly performed using quantum computer hardware. The optimized parameter values from the first stage of training are used to initialize the corresponding parameters for the second stage of training.

The second stage of the training can be performed iteratively with a larger subset of operators and/or a larger subset of parameters in each iteration, to provide a trained Quantum Boltzmann Machine (hereinafter “QBM”) in which the difference in expectation values between parameter values of a target probability distribution and values of parameters output by the model is reduced. The first stage and second stage training, and potential further iterations, can provide a trained QBM that more accurately represents the Hamiltonian. We show that performance of incremental QBM learning can take advantage of recent and expected future advances in quantum computing hardware, as described below with reference to example quantum computing hardware.

By performing a first stage of training of a Quantum Boltzmann Machine (for example “pre-training” on a classical computing device, the second stage of training (and any subsequent iteration) starts with parameters that have been initialized to enable optimisation in the next stage of the training. A computer system for implementing embodiments may comprise classical binary computer hardware coupled to a quantum computer, making use of the resources of the classical computer for the first stage of training and then exploiting the quantum computer's probabilistic representation of quantum states of a target real-world quantum system for a second stage of training that improves the model.

BRIEF DESCRIPTION OF THE FIGURES

Various examples and implementations of the disclosure will now be described in detail by way of example only with reference to the following figures:

FIG. 1 presents a high-level schematic diagram of an example of a method as disclosed herein for performing machine learning using a quantum computer.

FIGS. 2A, 2B and 2C (collectively referred to herein as FIG. 2), present schematic diagrams showing various results from using an example of a method as disclosed herein for performing machine learning.

FIG. 3 presents a high-level flowchart of an example of a method as disclosed herein for performing machine learning using a quantum computer.

FIG. 4 is a schematic diagram showing various hardware and software components of an example of the system described herein.

FIG. 5 presents two plots showing the minimum eigenvalue of a Hessian, as a function of the number of qubits, for a 1D nearest neighbour Hamiltonian and (b) for a fully-connected Hamiltonian.

DETAILED DESCRIPTION OF EMBODIMENTS
Overview

Quantum Boltzmann machines (QBMs) are machine-learning models which can be used with both classical and quantum data. An operational definition of QBM learning is presented in terms of the difference in Gibbs expectation values between the model and target, taking into account the polynomial size of the data set.

In other words, the QBM acts as a model which is trained to emulate a target. The target in effect defines a system and associated behaviour. In general, the target is not known per se, but samples of the system behaviour may be obtained. The QBM learning (training) involves obtaining samples of the target and corresponding samples from the QBM (model), and updating the model such that latter becomes more closely aligned with the former.

It is shown herein that with stochastic gradient descent, a machine learning solution may be obtained using at most a polynomial number of Gibbs states (where the Gibbs states can be regarded as providing samples of the model). One implication of this finding is that there are no barren plateaus in QBM learning (such as those without hidden units). It is also shown that pre-training on a subset of the QBM parameters can lower the sample complexity bounds. Various pre-training strategies are proposed based on mean-field, Gaussian Fermionic, and geometrically local Hamiltonians (additional models are available that likewise support training on a classical computer). The models and theoretical findings proposed herein have been verified numerically on a quantum and a classical data set. The results presented herein show that QBMs may provide promising machine learning models for training on present and future quantum devices.

In some implementations, a Hamiltonian ansatz is prepared that is very well suited for a particular quantum computing device. After exhausting all available classical computing resources during a first training phase (also referred to herein as pre-training), the model may be enlarged to continue the training on the quantum computing device to further enhance overall performance. As quantum hardware steadily matures, this supports the execution of deeper circuits and further increases of the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map towards training ever larger and more expressive quantum machine learning models.

INTRODUCTION

As described herein, a system and method have been developed for training a quantum Boltzmann machine (QBM) to obtain optimal parameter values. The QBM training results in a model that emulates a target data set, and helps to address some of the issues identified above for implementing ML in a quantum environment. A QBM can be regarded as a generalisation of a classical Boltzmann machine, which is a form of stochastic neural network with nodes linked by weighted connections.

In particular, QBMs are physics-inspired ML models that generalize a classical Boltzmann machine to a quantum Hamiltonian ansatz (an ansatz can be considered as a trial solution to a given problem). A QBM can therefore be considered as providing a certain generic type of ML model, while the Hamiltonian ansatz particularizes the system to the given problem, for example by defining the input parameters for the QBM.

The (quantum) Hamiltonian ansatz can be defined on a graph where each vertex represents a qubit (or a qudit) and each edge represents an interaction (broadly, a qubit is a quantum computing counterpart of a hardware bit in a conventional/classical machine, whereas qudits can represent multi-level systems). The task is to learn the strengths of the interactions (weights), such that samples from the output quantum state of the QBM mimic samples taken from the target data set. For the present approach, the QBM may be trained with polynomial sample complexity on quantum computers. The power and benefits of such an approach will grow in parallel with the rapid development of quantum computing platforms (such as hardware systems that support increasing numbers of qubits and implement error detection or correction for fault tolerance).

The development of quantum generative models of this kind is expected to be useful in machine learning, for addressing (for example) science problems by learning approximate descriptions of the experimental data. QBMs may also play an important role as components of larger QML models (this is similar to how classical BMs can provide good weight initializations for the training of deep neural networks). One advantage of using a QBM rather than a classical BM is that a QBM is more expressive, since the Hamiltonian ansatz can contain more general non-commuting terms. This means that in some settings the QBM outperforms a classical BM, even for classical target data.

In order to help obtain results which have good practical relevance, an operational definition of QBM learning is adopted. Instead of focusing on an information-theoretic measure, we assess the QBM learning performance by the difference in Gibbs expectation values between the target and the model. This takes into account that the (classical) target data set comprises polynomially many data samples, hence its properties have a polynomial precision. Stochastic gradient descent methods are employed in combination with shadow tomography to show that this problem can be solved using polynomially many evaluations of the QBM model. Each evaluation of the model requires the preparation of one Gibbs state and, therefore, we refer to the sample complexity as the required number of Gibbs state preparations.

The Gibbs states used for QBM learning may be prepared and sampled on a quantum computer by a variety of methods. For present purposes, the focus is on the sampling complexity, rather than any specific Gibbs sampling implementation.

In practice, QBM learning allows for great flexibility in model design, and therefore time complexity. It is also shown below that the required number of Gibbs samples for QBM learning can be improved by pre-training on a subset of the parameters of the QBM. In other words, classically pre-training a simpler model can potentially reduce the (quantum) training complexity. For instance, it is possible to analytically pre-train a mean-field QBM and a Gaussian Fermionic QBM. In addition, it is shown below that a geometrically local QBM with gradient descent may be pre-trained, which provides some improved performance guarantees. As described herein, these exactly solvable models may be used for training and/or pre-training of QBMs. Further, classical numerical simulation results are presented which confirm the analytical findings.

Problem Definition

We start by formally setting up the quantum Boltzmann machine (QBM) learning problem, providing the definitions of the target and model, and a description of how to assess the performance based on the precision of the expectation values. These definitions and assumptions help to obtain the results described herein, and are introduced below, along with their motivation. In addition, the problem definition described herein is compared to other related problems in the literature, such as quantum Hamiltonian learning.

We consider an n-qubit density matrix r as the target of the machine learning problem. If the target is classical, n could represent the number of features, e.g., the pixels in black-and-white pictures, or more complex features that have been extracted and embedded in the space of n qubits. If the target is quantum, n could represent spin-½ particles, but again more complex many-body systems can be embedded in the space of n qubits. In the literature, it is often assumed that algorithms have direct and simultaneous access to copies of η, however, this assumption is not adopted herein. Instead, a setup is considered in which access is limited to classical information about the target. For a data set custom-character ={s^μ} of N independent data samples s^μ, and assuming the data set can be efficiently stored in a classical memory, the amount of memory required to store each data sample is polynomial in n, and there are polynomially many samples. For example, s may be bitstrings—this includes data sets like binary images and time series data, categorical and count data, and binarized continuous data. As another example, the data may originate from measurements on a quantum system. In this case s identifies an element of the positive operator-valued measure describing the measurement.

Next, we define the machine learning model which is used herein for data fitting. The fully-visible QBM is an n-qubit mixed quantum state of the form

$\begin{matrix} ρ_{θ} = \frac{e^{ℋ_{θ}}}{Z}, & (1) \end{matrix}$

where Z=Tr[ custom-character ] is the partition function. The parameterized Hamiltonian is defined as

$\begin{matrix} ℋ_{θ} = \sum_{i = 1}^{m} θ_{i} H_{i}, & (2) \end{matrix}$

where θ∈ custom-character ^mis the parameter vector, and {H_i} is a set of m Hermitian and orthogonal operators acting on the 2ⁿ-dimensional Hilbert space. For example, these could be n-qubit Pauli operators, Fermionic operators, or any other suitable operator. As the true form of the target density matrix is unknown, the choice of operators {H_i} in the Hamiltonian is chosen without certainty that the choice is optimal. It is possible that, once the Hamiltonian ansatz is chosen, the space of QBM models does not contain the target, i.e., ρ₀≠η, ∀θ. This is called a model mismatch, and it may be unavoidable in machine learning. In particular, since we require the number of operators m to be polynomial in n, ρ_θcannot encode an arbitrary density matrix.

A natural measure to quantify how well the QBM ρ_θapproximates the target η is the quantum relative entropy:

$\begin{matrix} S (η  ρ_{θ}) = Tr [η \log η] - Tr [η \log ρ_{θ}] . & (3) \end{matrix}$

This measure generalizes the classical Kullback-Leibler divergence to density matrices. The relative entropy is exactly zero when the two densities are equal, η=ρ_θ, and S>0 otherwise. In addition, when S(η∥ρ_θ)≤ϵ, by Pinsker's inequality, all possible Pauli expectation values are within custom-character (√{square root over (ϵ)}), see Appendix C.

In theory one can minimize the relative entropy S(η∥ρ_θ) in order to find the optimal model parameters θ^opt=argmin_θS(η∥ρ_θ). The form of the partial derivatives of the relative entropy can be computed analytically and reads

$\begin{matrix} \frac{\partial S (η  ρ_{θ})}{\partial θ_{i}} = H_{i} ρ_{θ} - H_{i} η . & (4) \end{matrix}$

This is the difference between the target and model expectation values of the operators that are chosen in the ansatz. A stationary point of the relative entropy is obtained when custom-character H_i_τ_θ=H_i_ηfor i∈{1, . . . , m}. Since S is strictly convex, see FIG. 3 below and Appendix B, this stationary point is the unique global minimum.

Quantifying how well the QBM is trained by means of the relative entropy has some issues in practice. An accurate estimate of S(η∥ρ_θ) generally involves access to the entropy of the target and the partition function of the model. Due to the model mismatch, which is expected because we are choosing m operators out of exponentially many potential operators, the optimal QBM may have S(η∥ρ_θ_opt)>0, and the optimal value is not known in advance. Therefore, in this work, an operational definition of QBM learning is based instead on the size of the gradient ∇S(η∥ ρ_θ).

Definition 1 (QBM learning problem). Given a polynomial-space data set {s^μ} obtained from an n-qubit target density matrix η, a target precision ϵ>0, and a fully-visible QBM with Hamiltonian custom-character _θ=Σ_i=1^mθ_iH_i, find a parameter vector θ such that with high probability

$\begin{matrix} ❘ H_{i} ρ_{θ} - H_{i} η ❘ \leq ϵ, \forall i . & (5) \end{matrix}$

A solution to the QBM learning problem always exists by Jaynes' principle: given a set of target expectations { custom-character H_i_η}, there exists a Gibbs state ρ_θ_opt=e^Σθⁱ^opt^Hⁱ/Z such that H_i_ρ_θ_opt−H_i_η|=0, ∀i. However, due to the polynomial size of the data set we can only compute properties of the target (and model) to finite precision. (For example, suppose that s^μare data samples from some unknown probability distribution P(s) and that we are interested in the sample mean. An unbiased estimator for the mean is

$\hat{s} = \frac{1}{M} \sum_{μ = 1}^{M} s^{μ} .$

The variance of this estimator is σ²/M, where σ²is the variance of P(s). By Chebyshev's inequality, with high probability the estimation error is of order σ/√{square root over (M)}. The polynomial size of the data set implies that the error decreases polynomially in general). Therefore, we say the QBM learning problem is solved for any precision ε>0 in Equation (5), whereby the expectation values of the QBM and the target should be close enough that one cannot distinguish them without enlarging the data set.

The expectation values of the target can be obtained from the data set in various ways. For example, for the generative modeling of a classical binary data set one can define a pure quantum state and obtain its expectation values (see Appendix E). For the modeling of a target quantum state (density matrix) one can estimate expectation values from the outcomes of measurements performed in different bases.

As shown in Appendix C3, the solution to the QBM learning problem implies a bound on the optimal relative entropy, namely

$\begin{matrix} S (η  ρ_{θ}) - S (η  ρ_{θ^{opt}}) \leq 2 ϵ { θ - θ^{opt} }_{1} . & (6) \end{matrix}$

This indicates that if the QBM learning problem can be solved to precision ϵ≤ϵ′/(2∥θ−θ^opt∥₁), one can also solve a stronger learning problem based on the relative entropy to precision ϵ′ (this involves bounding∥θ−θ^opt∥₁)

Results

We approach the QBM learning problem by iteratively minimizing the quantum relative entropy, see Equation (3), in this example using stochastic gradient descent (SGD). This involves access to a stochastic gradient ĝ_θ_tcomputed from a set of samples at time t, and the gradient has the form given in Equation (4) above. For the target expectation values in 4) above these are estimated from a random subset of the data set (sometimes referred to as a mini-batch). The mini-batch size is a hyper-parameter and determines the precision of ξ of each target expectation. Similarly, the QBM model expectations are estimated using classical shadows of the Gibbs state ρ_θ_tapproximately prepared on a quantum device. The number of measurements is also a hyper-parameter and determines the precision κ of each QBM expectation.

It is assumed that the stochastic gradient is unbiased, i.e., custom-character [ĝ_θ_t]=∇S(η∥ρ_θ)|_θ=θ_t, and that each entry of the vector has bounded variance. At iteration t, SGD updates the parameters as

$\begin{matrix} θ^{t + 1} = θ^{t} - γ^{t} {\hat{g}}_{θ^{t}}, & (7) \end{matrix}$

where γ^tis the learning rate.

With this method, the QBM learning problem may be solved with polynomial sample complexity. We state this in the following theorem, which is an important aspect of the approach described herein.

Theorem 1 (QBM training). We have a QBM defined by a set of n-qubit Pauli operators {H_i}_i=1^m, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ²+ξ²≥ϵ/2m. After

$\begin{matrix} T = \frac{48 δ_{0} m^{2} (κ^{2} + ξ^{2})}{ϵ^{4}} & (8) \end{matrix}$

iterations of stochastic gradient descent on the relative entropy S(η∥ρ_θ) with constant learning rate

$γ^{t} = \frac{ϵ}{4 m^{2} (κ^{2} + ξ^{2})},$

we have

$\begin{matrix} \min_{t = 1, \dots T} 𝔼 ❘ {〈 H_{i} 〉}_{ρ_{θ^{t}}} - {〈 H_{i} 〉}_{η} ❘ \leq ϵ, \forall i, & (9) \end{matrix}$

where custom-character [ . . . ] denotes the expectation with respect to the random variable θ^t. Each iteration t∈{0, . . . , T} involves

$\begin{matrix} N \in 𝒪 (\frac{1}{κ^{4}} \log \frac{m}{1 - λ^{\frac{1}{T}}}) & (10) \end{matrix}$

preparations of the Gibbs state pet, and the success probability of the full algorithm is λ. Here, δ₀=S(η∥ρ_θ₀)−S(η∥ρ_θ_opt) is the relative entropy difference with the optimal model ρ_θ_opt.

The success probability is the probability that the QBM expectation values are determined correctly. It is a free parameter which can be set to a value for performing the experiment and determines how many measurements are to be performed.

A detailed proof of this theorem is given in Appendix C2 and includes carefully combining three important observations and results. First, it is shown that the quantum relative entropy for any QBM ρ_θis L-smooth with L=2m max_j∥H_j∥₂². This is then combined with SGD convergence results from the machine learning literature to obtain the number of steps T. Finally, sampling bounds from quantum shadow tomography are used to obtain the number of preparations N. This last step focuses on the shadow tomography protocol, which normally restricts the results to Pauli observables H_i≡P_i, thus ∥H_i∥₂=1. It is possible to extend this to generic two-outcome observables with a polylogarithmic overhead compared to Equation (10), see Appendix C2. Furthermore, for k-local Pauli observables, we can improve the result to

$\begin{matrix} N \in 𝒪 (\frac{3^{k}}{κ^{_{2}}} \log \frac{m}{1 - λ^{1 / T}}) & (11) \end{matrix}$

with classical shadows constructed from randomized measurement or by using pure thermal shadows.

By combining Equations (8) and (10), we see that the final number of Gibbs state preparations N_tot≥T×N scales polynomially with m, the number of terms in the QBM Hamiltonian. According to our assumption of classical memory, we can only have m∈ custom-character (poly(n)). This means that the number of required measurements to solve QBM learning scales polynomially with the number of qubits (features). Consequently, there are no barren plateaus in the optimization landscape for this problem, where a barren plateau of a loss function f(θ) is defined by the vanishing of its gradient E[∇_θf(θ)]=0, and also by an exponentially decreasing variance of the gradient var[∇_θf(θ)]<O(2⁻ⁿ).

The following theorem is proved in Appendix C2.

Theorem 2 (α-strongly convex QBM training). We have a QBM defined by a Hamiltonian ansatz custom-character _θsuch that S(ƒ∥ρ_θ) is α-strongly convex, a precision κ for the QBM expectations, a precision for the data expectations, and a target precision ϵ such that κ²+ξ²≥ϵ/2m.

After

$\begin{matrix} T \geq \frac{18 m^{2} (κ^{2} + ξ^{2})}{α^{2} ϵ^{2}} & (12) \end{matrix}$

iterations of stochastic gradient descent on the relative entropy S(η∥ρ_θ) with learning rate

$γ^{_{t}} \leq \frac{1}{4 m^{2}}$

(see Appendix C.2 for the specific learning rate schedule), we have:

$\begin{matrix} \min_{t = 1, \dots T} 𝔼 ❘ {〈 H_{i} 〉}_{ρ_{θ^{t}}} - {〈 H_{i} 〉}_{η} ❘ \leq ϵ, \forall i . & (13) \end{matrix}$

Each iteration involves the number of samples given in Equation (10).

The sample bound in Theorem 1 depends on δ₀, the relative entropy difference of the initial and optimal QBMs. This means that if we can lower the initial relative entropy, we also tighten the bound on the QBM learning sample complexity. In this respect, it is shown that δ₀can be reduced by pre-training a subset of the parameters in the Hamiltonian ansatz. Thus, pre-training reduces the number of steps to reach the global minimum.

Theorem 3 (QBM pre-training). Assume a target η and a QBM model ρ_θ=e^Σⁱ^θⁱ^Hⁱ/Z for which we seek to minimize the relative entropy S(η∥ρ_θ). Initializing at θ⁰=0 and pre-training S(η∥ρ₀) on any subset {tilde over (m)}≤m of the parameters (Hamiltonian operators {H_i}_i=1^{{tilde over (m)}}) ensures that

$\begin{matrix} S (η || p_{θ^{p r e}}) \leq S (η || p_{θ^{0}}), & (14) \end{matrix}$

where θ^pre=[χ^pre,0_{m-{tilde over (m)}}] and the vector χ^preof length {tilde over (m)} contains the parameters for the terms {H_i}_i=1^{{tilde over (m)}}at the end of pre-training. More precisely, starting from

$ρ_{χ} = e^{\sum_{i = 1}^{\tilde{m}} \sum_{i} χ_{i} H_{i}} / Z$

and minimizing S(η∥ρ_θ) with respect to χ ensures Equation (14) for any S(η∥ρ_χ_pre)≤S(η∥ρ_χ₀).

We provide a detailed proof of Theorem 3 in Appendix D.1 which applies to any method that is able to minimize the relative entropy with respect to a subset of the parameters. All the other parameters are fixed to specific particular values, generally (but without limitation) zero, and the pre-training starts from the maximally mixed state II/2ⁿ. For example, one could use SGD as described above, and apply updates only to the chosen subset of parameters. With a suitable learning rate, this ensures that pre-training lowers the relative entropy compared to the maximally mixed state S(η|II/2n). As a consequence, it is possible to add additional, linear independent, terms to the QBM ansatz without having to retrain the model from scratch. The performance is guaranteed to improve, specifically towards the global optimum due to the strict convexity of the relative entropy. This is in contrast to other QML models which do not have a convex loss function. This is particularly useful if a certain subset of the QBM ansatz is pre-trained classically before training the full model on a quantum device. For example, in Appendix D.2 mean-field and Gaussian Fermionic QBM pre-training models are presented with closed-form expressions for the optimal subset of parameters.

FIG. 1 presents a high-level schematic diagram of an example of a method as disclosed herein for performing machine learning using a quantum computer. FIG. 1 comprises three boxes—the left-hand box depicts the inputs, the right-hand box depicts the outputs, and the central box depicts the processing used to derive the outputs from the inputs.

In particular, the input data comprises two components. The first component is a QBM associated with a Hamiltonian ansatz. This first component in effect represents the ML model which is to be trained. The second component comprises a set of data values (samples), which represent Hamiltonian expectation values for the target. For example, the data may represent measurements performed on a target quantum system. This set of samples has a polynomial size with respect to the size of the QBM (which corresponds to the number of qubits used for a quantum-based implementation of the QBM). This polynomial scaling converges using accessible levels of computing resources.

The output data (right-hand image) corresponds to the QBM and ansatz Hamiltonian model shown as the input data (left-hand image) after training the QBM mode using the target data set also shown in the input data. The right-hand image also depicts new samples s˜ρ_θ_Twhich are sample outputs provided by the trained QBM.

The central box in FIG. 1 represents training the QBM model based on the data set from the target. This training is, in this example, performed using stochastic gradient descent (SGD) based on the relative entropy level between the target and the model. Thus FIG. 1 shows a sequence of models θ⁰, θ¹. . . θ^Tin effect representing successive generations of the trained QBM model. A minimum is taken to occur when the difference between the model Hamiltonian expectation and the target Hamiltonian expectation is less than a set threshold, ϵ, for each sample i (see Definition 1). For θ^opt, ϵ=0, is the exact solution given by Jaynes principle. In theory this is the best solution that can ever be achieved with SGD, in reality SGD cannot get arbitrarily close and is happy to achieve a fixed (specified) precision ϵ>0 corresponding to θ^T.

As discussed herein, in the procedure shown in FIG. 1, pre-training on a classical computer may be utilized to lower the relative entropy, thereby facilitating subsequent full (quantum-based) training, which can be initialized according to the lower relative entropy configuration produced by the pre-training.

The central box of FIG. 1 shows an operational definition of the QBM learning problem in terms of expectation values, namely | custom-character H_i_ρ_θ−H_i_η|≤ϵ, ∀i, whereby the respective model and target expectations must be close to within a polynomial precision ϵ (see Theorems 1 and 2). As shown herein, the QBM learning can then be solved by minimizing the quantum relative entropy S(η∥ρ_θ) with SGD and using a polynomial number of Gibbs states (see Theorems 1 and 2). It is further shown with Theorem 3 that the pre-training strategies which optimize a selected subset of the QBM parameters are guaranteed to lower the initial quantum relative entropy. The SGD algorithm outputs the QBM model parameters θ^Tin a polynomial number of steps (iterations) T, and these can be used as a trained system for using samples of new data to provide predicted outcomes.

Accordingly, FIG. 1 shows a configuration in which the problem input is a data set of size polynomial in the number of features/qubits, and an ansatz for the QBM model with parameters θ. In Definition 1 an operational definition is provided of the QBM learning problem where the model and target expectations must be close to within a polynomial precision ϵ. A solution θ^optis guaranteed to exist by Jayne's principle. With Theorems 1 and 2 it is established that QBM learning can be solved by minimizing the quantum relative entropy S(η∥ρ_θ) with respect to θ using SGD. This involves a polynomial number of Gibbs state preparations. With Theorem 3, it is shown that pre-training strategies that optimize a subset of θ^preof the QBM parameters are guaranteed to lower the initial quantum relative entropy. The algorithm outputs a solution θ^Tto the problem in a polynomial number of steps T. The trained QBM can be used, for example, to generate new synthetic data.

Numerical Experiments

To further investigate the above theoretical findings, numerical experiments were performed of QBM learning on data sets constructed from a quantum source and a classical source. First, we focus on reducing the initial relative entropy S(η∥ρ_θ₀) by QBM pre-training, following Theorem 3. Mean-field (MF), Gaussian Fermionic (GF), and geometrically local (GL) models are considered as potential pre-training strategies. The Hamiltonian ansatz of an MF model includes all possible one-qubit Pauli terms {H_i}_i=1³ⁿ={σ_i^x,σ_i^y,σ_i^z}_i=1ⁿas per Equation (2) and hence has 3n parameters. The Hamiltonian of the GF model has a quadratic form custom-character _θ^GF=Σ_i,j{tilde over (θ)}_im{right arrow over (C)}_i^†{right arrow over (C)}_jof Fermionic creation and annihilation operators, where {tilde over (θ)}_ijis the 2n×2n Hermitian parameter matrix which has n²free parameters. Here {right arrow over (C)}^†=[c₁^†, . . . , c_n^†, c₁, . . . , c_n], with the operators satisfying {c_i,c_j}=0 and {c_i^†,c_j}=δ_ij, where {A,B}=AB+BA is the anti-commutator. The advantage of the MF and GF pre-training is that there exists a closed-form solution given target expectation values custom-character H_i_η. This is shown in Appendix D.

In contrast, the GL models are defined with a Hamiltonian ansatz

$\begin{matrix} ℋ_{θ}^{G L} = \sum_{k = x, y, z} \sum_{〈 i, j 〉} λ_{i j}^{k} σ_{i}^{k} σ_{j}^{k} + \sum_{i}^{n} γ_{i}^{k} σ_{i}^{k}, & (15) \end{matrix}$

for which, in general, the parameter vector {right arrow over (θ)}≡{λ,σ} cannot be found analytically. Here the sum custom-character i,j imposes some constraints on the (geometric) locality of the model, i.e., summing over all possible nearest neighbors in some d-dimensional lattice. In particular we choose one- and two-dimensional locality constraints suitable with the assumptions given in the literature. In these specific cases the relative entropy is strongly convex, thus pre-training with SGD has the improved performance guarantees from Theorem 2.

FIGS. 2A, 2B and 2C (collectively referred to herein as FIG. 2) present schematic diagrams showing various results from using an example of a method as disclosed herein for performing machine learning. FIG. 2A shows the initial relative entropy S(η∥ρ_θ_pre) (y-axis) after various forms of pre-training using models for two 8-qubit problems. In particular, the forms of training in FIG. 2A are a mean-field (MF) model, a one-dimensional and two-dimensional geometrically local (GL) model, and a Gaussian Fermionic (GF) model. FIG. 2A further shows a comparison to the situation without pre-training (a maximally mixed state).

In the left-hand portion of FIG. 2A, the pre-training is performed with quantum data (e.g. data produced by quantum hardware); in the right-hand portion of FIG. 2A, the pre-training is performed with classical data. An 8-qubit target η= custom-character /Z is used as the Gibbs state of the one-dimensional XXZ model for the Quantum Data, and a target η which coherently encodes the binary salamander retina data was adopted for the Classical Data.

As mentioned above, FIG. 2A also shows the results without any pre-training, i.e., starting from a maximally mixed state S(η∥ρ_θ=0). In all cases, it can be seen from FIG. 2A that pre-training provides a reduction in the initial relative entropy for subsequent training of the model on quantum hardware. This reduction is particularly strong for classical data. For quantum data, the situation is a little more mixed, in that the reduction in relative entropy is rather modest for pre-training based on a mean-field, but is much more prominent for the other forms of pre-training shown in FIG. 2A.

Accordingly, it is observed for both targets (quantum data and classical data) that all pre-training strategies are successful in reducing S(η∥ρ_θ_pre), with a slightly better performance for the classical target. For the GL 1D ansatz, the target state is contained within the QBM model space, which means that the relative entropy becomes zero after pre-training using quantum data. This shows that having knowledge about the target (e.g., the fact that it is one-dimensional) may help to inform QBM ansatz design and significantly reduce the complexity of QBM learning. The Fermionic model, which has completely different terms in the ansatz, manages to reduce S(η∥ρ_θ_pre) by a factor of ≈5 for the quantum target and ≈4 for the classical target. By the Jordan-Wigner transformation, a 1D quantum XXZ target can be expressed in the Fermionic basis. In this representation, the target only has a small perturbation compared to the model space of the GF model—this at least partially explains the good performance of pre-training with the GF model.

The effect of using the pre-trained models as a starting point for QBM learning with exact gradient descent was investigated for a fully-connected QBM with

$\begin{matrix} ℋ_{θ} = \sum_{k = x, y, z} \sum_{i, j > i} λ_{i j}^{k} σ_{i}^{k} σ_{j}^{k} + \sum_{i}^{n} γ_{i}^{k} σ_{i}^{k} . & (16) \end{matrix}$

In this context (compared to Equation (15)), any qubit can be connected to any other qubit, and there is no constraint on the geometric locality. This is a QBM Hamiltonian known in the literature. We consider data from the quantum target η= custom-character /Z for 8 qubits.

In FIG. 2B, the decay of quantum relative entropy (y-axis) is plotted against the number of learning iterations (t, x-axis) for training that starts from various pre-training strategies as per FIG. 2A. We define θ⁰as the parameter vector at the end of pre-training, whereby ρ_θ₀:=ρ_θ_pre. The lines in FIG. 2B match the form of pre-training shown in FIG. 2A—the top line represents no pre-training, the slightly lower middle line is MF (mean-field), and the lowest line is GL (geometrically local) 2D. The left-hand portion of FIG. 2B (hatched background) shows the pre-training phase, while subsequent training on (simulated) quantum hardware is shown in the centre and right-hand portion of FIG. 2B (unhatched light background). The quantum data (from the left portion of FIG. 2A) is used in FIG. 2B and noise is not taken into account, i.e., κ=ξ=0. Indeed, the GL 2D model requires pre-training with noise-free gradient descent, for which the relative entropy reduction is shown in the hatched area. A pre-training learning rate of γ=1/{tilde over (m)} was used along with a learning rate of γ=1/(2m) in order to satisfy the assumptions in Theorems 1 and 3.

The performance of the MF pre-trained model (middle line) is better than the top line corresponding to no pre-training at all iterations, but the improvement is relatively modest. Using a 2D GL model (bottom line) for pre-training has a much more significant effect, with S(η∥ρ_θ_t) being an order of magnitude smaller than for the model without pre-training at all steps t. Furthermore, the 2D GL pre-training strategy involves very few gradient descent steps (see the hatched area). This may potentially stem from the strong convexity of this particular pre-training model. Note that in general the benefits of pre-training should be assessed on a case by case basis, as the size of the improvement depends on the particular target and the particular pre-training model used. In this respect, it is noted that Theorem 1 has been proved for a learning rate of

$γ = \min {\frac{1}{L}, \frac{ϵ}{4 m^{2} (κ^{2} + ξ^{2})}} .$

Therefore, choosing a larger learning rate might reduce the benefits of pre-training.

FIG. 2C plots the maximum error in the expectation values (model compared to target) on the y-axis. This phase of the training is performed on simulated quantum hardware with no noise, and is plotted versus the number of iterations of SGD (x-axis). Classical input data is used (as per the right-hand component of FIG. 2A) and two different noise strengths compared—the lower line corresponds to less noise (0.01) while the upper line corresponds to greater noise (0.05). A learning rate is used of γ=ϵ/(2m²(κ²+ξ²)). The dashed line indicates the target precision of ϵ=0.1. Expectation values of the Gibbs state for the 1D quantum XXZ model in an external field and expectation values of a classical salamander retina data set are used as targets. The specifics of these models and how to compute the expectation values for classical data are given in Appendix E.

The bound on the number of SGD updates, as per Equation (8) for Theorem 1, was numerically confirmed. This involved considering data from the classical salamander retina target with 8 variables and a fully-connected QBM model on 8 qubits. As noted above, FIG. 2C compares training with two different noise strengths κ²+ξ². These settings were implemented by adding Gaussian noise, but in reality (rather than simulations) the noise strength would be determined by the number of data points and measurements of the Gibbs state on a quantum device. Using a standard Monte Carlo estimate, each update includes a mini-batch of data samples of size 1/ξ², and a number of measurements 1/κ²(assuming these measurements can be performed without additional hardware noise). Potentially mini batches of size 1 and a single measurement could be used as long as the Gibbs state expectation values are unbiased. For both noise strengths, the desired target precision of ϵ=0.1 was obtained within 104 steps. This is well within the bound custom-character (10⁹) on the number of steps in Theorem 1, which is the worst case scenario.

Discussion and Conclusion

An operational definition of quantum Boltzmann machine (QBM) learning has been developed and it is shown that this problem can be solved with polynomially many preparations of quantum Gibbs states. To prove the relevant bounds, the properties of the quantum relative entropy are used in combination with the performance guarantees of stochastic gradient descent (SGD). There is no assumption as to the form of the QBM Hamiltonian, other than that it consists of polynomially many terms. This is in contrast with some earlier works that looked at the somewhat related Hamiltonian learning problem only for geometrically local models. In that context, strong convexity is required in order to relate the optimal Hamiltonian parameters to the Gibbs state expectation values. In the machine learning setting described herein, the form of the target Hamiltonian is not known a priori. Therefore, learning the exact parameters is not as directly relevant, and instead the focus is directly on the expectation values. For this reason, the bounds for the approach described herein only involve L-smoothness of the relative entropy and may be applied to all types of QBMs without hidden units.

It is also shown herein that the theoretical sampling bounds may be tightened by lowering the initial relative entropy of the learning process. Typically, one would start QBM learning from the maximally mixed state, i.e., the state with no prior information. It is shown herein that pre-training on any subset of the parameters performs better than (or equal to) the maximally mixed state. This is beneficial if one can efficiently perform the pre-training, as is shown herein to be the case for mean-field, Gaussian Fermionic, and geometrically local QBMs. The performance of these models and the theoretical bounds are verified with classical numerical simulations. These simulations also indicate that knowledge about the target (e.g., its dimension, degrees of freedom, etc.) can significantly improve the training process. Furthermore, it is found that the generic bounds adopted herein are quite loose, and in practice it may be feasible to use a much smaller number of samples.

In some implementations, the sample bound may be tightened by going beyond the plain SGD method described so far. This could be done in various ways, such as by adding momentum, by using other advanced update schemes, and/or by exploiting the convexity of the relative entropy. This may improve the

$𝒪 [poly (m, \frac{1}{ϵ})]$

scaling in our bounds, and generally conforms to the approach described herein, whereby the QBM learning problem can be solved with polynomially many preparations of Gibbs states.

Another point of interest concerns training performance of different ansatz. Generative models are often assessed in terms of training quality and generalization capabilities have recently been investigated by both classical and quantum machine learning researchers. For the case of QBMs, generalization may offer a path for further development.

The operations and results described herein may also be generalized to QBM models with hidden units. This generalization could involve showing L-smoothness of the relative entropy for a more general and challenging setup, and a positive result would provide a facility to train highly expressive models. In this respect, it is noted that the results presented herein already hold for the special case of a QBM with fixed hidden units, since this problem reduces to the one discussed above.

The pre-training result described herein may be useful for implementing QBM learning on near-term and early fault-tolerant quantum devices. To this end, a quantum computer may be used as a Gibbs sampler. There are many quantum algorithms in the literature that produce Gibbs states with a quadratic improvement in time complexity over the best existing classical algorithms. Moreover, the use of a quantum device gives an exponential reduction in space complexity in general. For example, Motta et al. implemented a 2 qubit Gibbs state for an anti-ferromagnetic Ising model Hamiltonian on the Aspen-1 quantum computer. It is anticipated that improved quantum processing devices with higher gate fidelities and higher qubit counts, such as (but not limited to) Quantinuum's system model H2 or the Aspen-M-3, may be able to prepare similar Gibbs states and potentially for more complex Hamiltonians (i.e. with more operators in the ansatz). A further possibility is to sidestep the Gibbs state preparation and use algorithms that directly estimate Gibbs-state expectation values, e.g., by constructing classical shadows of pure thermal quantum states. This reduces the number of qubits and, potentially, the circuit depth.

The results presented herein support a range of methods for incremental learning QBMs driven by the availability of both training data and quantum hardware. For example, one could select a Hamiltonian ansatz that is very well suited for a particular quantum device. After exhausting all available classical resources during the pre-training, the model may be enlarged, and the training then continues on a quantum device, which therefore improves the overall performance. As quantum hardware matures, it allows the execution of deeper circuits and so supports a further increase of the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map towards training larger and more expressive quantum machine learning models.

Example Implementations

The results presented herein support the development of methods for incremental learning by QBMs driven by the availability of both training data and quantum hardware. For example, one could select a Hamiltonian ansatz that is very well suited for a particular quantum device. After exhausting all available classical resources during the pre-training phase on selected components of the model (such as by selecting subsets of the operators and parameters), the model is enlarged and continues the training on the quantum device, which is guaranteed to improve the performance (compared to the output at the end of the pre-training phase). As quantum hardware continues to develop further, this allows the execution of deeper circuits and a further increase in the model size. Incremental QBM training strategies may be designed to follow the quantum hardware road map, towards training larger and more expressive quantum machine learning models.

FIG. 3 presents a high-level flowchart of an example of a method as disclosed herein for performing machine learning using a quantum computer. Operation 310 comprises providing a model comprising a Quantum Boltzmann machine with an ansatz Hamiltonian having a set of operators and a set of parameters.

The Quantum Boltzmann machine with an ansatz Hamiltonian may be further provided with target expectation values for performing the first stage of training. For example, a QBM ρ_θwith ansatz Hamiltonian may be given by a set of operators {H_i}_i=1^m, parameters {θ_i}_i=1^m, and the target expectation values {H_i}_η.

Operation 320 performs a first stage of training the model against data from a target using a selected subset of the operators to obtain optimized values for a subset of the parameters. The first stage of performing is performed on classical computing hardware to provide a partly trained model.

In the first stage of training, a subset {tilde over (m)} may be selected of the operators {H_i}_i=1^{{tilde over (m)}}, that can be trained classically. The selection of a subset of operators of a Hamiltonian may have regard to the operators that can be efficiently trained in a classical context. The relative entropy may be optimized on classical hardware with respect to a selected subset of the parameters while keeping other parameters set to zero (or any other suitable values). The optimal parameters obtained after the classical pre-training (having exhausted the available classical resources) may be saved. The pre-training may be iterated over t=1 to T_pre, where T_prerepresents a maximum number of iterations (if convergence does not occur beforehand). This pre-training seeks to optimize the relative entropy S(ηρ_θ) with respect to the subset {θ_i}_i=1^{{tilde over (m)}}of parameters while keeping the other (non-selected) parameters set to a fixed value such as zero.

Operation 330 performs a second stage of training the model against data from the target using the full set of operators to obtain optimized values for a larger subset of the set of parameters for the model. The second stage of performing is performed on quantum computer hardware to provide a further trained model. The optimized parameter values saved from the first stage may be used to initialize the corresponding parameters for the second stage of training.

The larger subset of the set of parameters for the model may, in some implementations, comprise the full set of parameters for the model. Accordingly, the second phase of training may encompass all the set of parameters for the model. (It is implicit that the first phase of training does not involve training all the parameters of the set, because this would not allow the second phase of training to involve a larger subset.

The second phase of training may be iterated over t=1 to T_q1, where T_q1represents a maximum number of iterations (if convergence does not occur beforehand). In this second phase of training, the relative entropy may be optimized with respect to all the parameters in the model and target by computing Gibbs state expectation values on a quantum device. Before performing this optimization, the ansatz Hamiltonian is extended with a further set of operators and parameters (enlarging the model). These further operators and parameters are those that were not included in their respective subsets during the pre-training phase (and so have not yet been incorporated into the model).

Accordingly the second phase optimizes the relative entropy S(η|ρ_θ) with respect to all of the parameters {θ_i}_i=1^mby computing the Gibbs state expectation values {H_i}_ρθon a quantum device, such as by using thermal shadows. The parameters of the extended (complete) QBM may be initialized using the optimal values obtained at the end of the previous quantum optimization loop (iteration). As noted above, for the first iteration, the parameters are initialized using the optimal values from the first stage of training (the pre training). For each iteration on the quantum hardware, the additional target expectation values are determined to optimize the relative entropy with respect to all the extended QBM parameters by obtaining the required Gibbs state expectation values on the quantum hardware.

Depending on the quantum computing resources available, the above approach may be developed further such that in a third training phase, the ansatz Hamiltonian is further extended with a set of (orthogonal) operators {{tilde over (H)}_i}_i=1ⁿand parameters {{tilde over (θ)}_i}_i=1ⁿ. The parameters of the extended QBM are initialized as λ≡{θ,{tilde over (θ)}}={θ_opt,0}, where θ_optare the optimal parameters obtained at the end of the previous quantum optimization loop. The additional target expectation values custom-character {tilde over (H)}_i_ηare computed and used for the training.

In this further development, the third phase of training may be iterated over t=1 to T_q2, where T_q2represents a maximum number of iterations. Each iteration then involves an optimization of the relative entropy S(η∥ρ_λ) with respect to all the extended QBM parameters λ by obtaining the required Gibbs state expectation values on a quantum device.

In some implementations, the second stage of the training (and/or the third stage of the training if relevant) may be performed on a hybrid system which includes both quantum computing hardware and classical computing hardware. For example, Gibbs states for the Quantum Boltzmann machine may be used to provide samples for machine learning. The Gibbs states may be prepared and sampled on the quantum computing hardware, whereas the parameters for the model may be maintained on classical computing hardware. Various other configurations of a hybrid system may also be used for the second and/or third training stages.

FIG. 4 is a schematic diagram showing various hardware and software components of an example of a system 400 described herein for machine learning. In particular, the system 400 comprises a classical computing platform 410 and a quantum computing platform 450. The classical computing platform 410 may comprise a known form of digital computer(s) including one or more processors for executing program instructions and memory for storing the program instructions and data. The quantum computing platform may comprise a known form of one or more quantum computers. It will be appreciated that the components and configuration shown in FIG. 4 are presented by way of example only and not by way of limitation.

FIG. 4 depicts three particular components implemented using the classical computing platform 410, namely a Hamiltonian ansatz 415, an optimization program 420, and a set of target data 480. The Hamiltonian ansatz 415 is structured in accordance with a Quantum Boltzmann Machine (QBM) and represents a model, for example relating to a complex physical system. The Hamiltonian 415 incorporates a set of operators and a set of parameters. The machine learning involves determining values for the set of parameters such that the output of the model, as represented by expectation values of the operators, mimics (largely coincides with) the system, which is being modelled, as represented by the target data 480.

The classical computing platform 410 further includes an optimization (minimization) program 420, for example a program which performs stochastic gradient descent (SGD). In broad terms, the optimization program 420 may obtain samples, as represented by expectation values of the operators in the Hamiltonian ansatz 415, for comparison with training data, namely target data 480. The optimization program uses the results of these comparisons to update the parameters of the Hamiltonian ansatz 415 so as to reduce quantum relative entropy. The optimization program 420 performs multiple iterations of this machine learning process to reach a configuration of the model parameters which has a low (minimal) quantum relative entropy.

The first stage of the process (pre-training) is performed solely on the classical computing device 410. Such a device may not have enough processing capability to perform the whole optimization procedure. Thus, as described herein, the pre-training may be performed, for example, with respect to a subset of the model parameters. The remaining parameters (those not in subset) may be held at a fixed value, such as zero. Using a subset of the parameters for the optimization such as using SGD generally reduces the computational resources used for this second stage of training.

The second stage of the process (after the pre-training) involves the use of the quantum computing platform 450. The quantum computing platform 450 includes a quantum circuit 452 associated with one or more qubits 455 to support computations running on the quantum computing platform 450. The quantum computing platform 450 also includes a QBM 425 associated with the Hamiltonian ansatz. This Hamiltonian on the quantum computer 450 generally matches the Hamiltonian ansatz 415 on the classical computing device 410, especially in terms of the associated model, but they are adapted to run on different hardware platforms as shown in FIG. 4. For example, the QBM 425 may be implemented using the quantum circuit 452 of the quantum computing platform 450.

In the example of FIG. 4, the optimization program 420 is also used to control the optimization procedure in the second stage in a similar manner to the first stage. Accordingly, the second stage can be regarded as hybrid, in that it involves computing operations on both the classical computing platform 410 and the quantum computing platform 450. The optimization program 420 (such as SGD) provides parameters to the QBM 425 and compares the QBM output with training data (target data 480), which can then be used to determine machine learning updates.

By measuring the physical properties of the QBM (425) prepared on the quantum device (450), the optimizer (SGD) can search in parallel across parameter space to find parameter values that have the lowest relative entropy. This ability may offer the potential of performing machine learning on the quantum computer 450 that is not computationally feasible on a classical computer 410 (or is more computationally expensive on a classical computer). For example, the second phase of the searching may be performed with a larger subset (or complete set) of the parameters for the model. Accordingly, the approach described herein exploits the different properties and characteristics of classical and quantum computing devices to support an efficient approach for machine learning with respect to complex systems.

Various implementations and examples have been disclosed herein. It will be appreciated that these implementations and examples are not intended to be exhaustive, and the skilled person will be aware of many potential variations and modifications of these implementations and examples that fall within the scope of the present disclosure. It will also be understood that features of particular implementations and examples can typically be incorporated into other implementations and examples (unless the context clearly indicates to the contrary). In summary, the various implementations and examples herein are disclosed by way of illustration rather than limitation, and the scope of claimed embodiments is defined in the appended claims.

APPENDICES
Appendix A: Preliminaries: Some Useful Mathematical Facts and Relations

Here we identify some useful mathematical facts and relations and derive some useful results that are used in the proofs in later appendices.

1. Convexity

Definition 2 (Convexity). A multivariate function ƒ: custom-character ^m is said to be convex when

$\begin{matrix} f (t x + (1 - t) y) \leq t f (x) + (1 - t) f (y), \forall x, y \in ℝ^{m}, t \in [0, 1] . & (A1) \end{matrix}$

If additionally the gradient ∇ƒ(x*) is zero only for one unique vector x*∈ custom-character ^m, then ƒ is said to be strictly convex.

The following Lemma can be deduced from the standard definition of convexity; see Garrigos et al., “Handbook of Convergence Theorems for (Stochastic) Gradient Methods”, arXiv:2301.11235, 2023 (hereinafter “Garrigos”).

Lemma 1. Let ƒ be twice continuously differentiable. Then ƒ is convex if

$\begin{matrix} v^{_{T}} \nabla^{2} f (x) v \geq 0, \forall x, v \in ℝ^{m} . & (A2) \end{matrix}$

A stronger version of convexity is used in some of our discussions.

Definition 3 (α-Polyak-Lojasiewicz). Let ƒ: custom-character ^M→, and α>0. We say that ƒ is α-Polyak-Lojasiewicz if

$\begin{matrix} \frac{1}{2 α} { \nabla f (x) }^{2} \geq f (x) - \min_{x} f (x) . & (A3) \end{matrix}$

Where ∥⋅∥ is the Euclidean norm.

An even stronger convexity condition is the following.

Definition 4 (α-strong convexity). Let ƒ: custom-character ^m→, and α>0. We say that ƒ is α-strongly convex if

$\begin{matrix} \frac{α t (1 - t)}{2} { x - y }^{2} + f (t x + (1 - t) y) \leq tf (x) + (1 - t) f (y) . & (A 4) \end{matrix}$

The former implies the latter

Lemma 2. If ƒ is α-strongly convex then ƒ is α-Polyak-Lojasiewicz.

The strong convexity of a function can be tested as follows.

Lemma 3. Let ƒ be twice continuously differentiable. Then ƒ is α-strongly convex if

$\begin{matrix} υ^{T} \nabla^{2} f (x) υ \geq α { υ }^{2}, & (A 5) \end{matrix}$

$\forall x, υ \in ℝ^{m} .$

Besides convexity, we also need to characterize the smoothness of a function.

Definition 5 (L-smoothness). Let ƒ: custom-character ^m→ and L>0. We say that ƒ is L-smooth if it is differentiable and if the gradient ∇ƒ is L-Lipschitz:

$\begin{matrix}  \nabla f (x) - \nabla f (y)  \leq L  x - y , & (A 6) \end{matrix}$

$\forall x, y \in ℝ^{m} .$

For L-smooth functions we have the following useful property (see Garrigos).

Lemma 4 (Descent lemma). Let ƒ: custom-character ^m→ be a twice differentiable, L-smooth function, then

$\begin{matrix} f (y) \leq f (x) + \nabla {f (x)}^{T} (y - x) + \frac{L}{2} { y - x }^{2} . & (A 7) \end{matrix}$

2. Derivative of a Matrix Exponential

The derivative of the matrix exponential e^Hwith respect to a parameter is given by Duhamel's formula

$\begin{matrix} \partial_{θ} e^{H} = \int_{0}^{1} e^{(1 - s) H} (\partial_{θ} H) e^{s H} ds . & (A 8) \end{matrix}$

Taking H=W+θV, with simple manipulations we find a useful alternative expression

$\begin{matrix} \begin{matrix} \partial_{θ} e^{H} = e^{H} \int_{0}^{1} e^{- sH} V e^{sH} ds \\ = \sum_{j, k} ❘ j k ❘ j ❘ V ❘ k e^{λ_{j}} \int_{0}^{1} e^{s (λ_{k} - λ_{j})} ds \\ = \sum_{j, k} ❘ k ❘ V_{j, k} e^{λ_{j}} \frac{e^{λ_{k} - λ_{j}} - 1}{λ_{k} - λ_{j}} . \end{matrix} & (A 9) \end{matrix}$

Here we use the basis diagonalizing the Hamiltonian, H=Σ_jλ_j| custom-character j|, and we introduce the notation V_jk=j|V|k The above expression is valid also for the diagonal entries, k=j, since

$\lim_{x \to 0} \frac{e^{x} - 1}{x} = 1 .$

Now,

$\begin{matrix} e^{λ_{j}} \frac{e^{λ_{k} - λ_{j}} - 1}{λ_{k} - λ_{j}} = ⁠ e^{λ_{j}} \frac{e^{λ_{k} - λ_{j}} - 1}{e^{λ_{k} - λ_{j}} + 1} \frac{e^{λ_{k} - λ_{j}} + 1}{ λ_{k} - λ_{j}} = \frac{\tanh (\frac{λ_{k} - λ_{j}}{2})}{\frac{λ_{k} - λ_{j}}{2}} \frac{e^{λ_{k}} + e^{λ_{j}}}{2} . & (A10) \end{matrix}$

With the notation

$\hat{f} (ω) = \frac{\tanh (\frac{ω}{2})}{\frac{ω}{2}}$

we can write

$\begin{matrix} \partial_{θ} e^{H} = \sum_{j, k} ❘ j k ❘ V_{jk} \hat{f} (λ_{k} - λ_{j}) \frac{1}{2} (e^{λ_{k}} + e^{λ_{j}}) . & (A11) \end{matrix}$

Let us interpret {circumflex over (ƒ)}(ω) as the Fourier transform of another function: {circumflex over (ƒ)}(ω)=∫_−∞^∞ƒ(t)e^−itωdt. Plugging this in the previous expression we obtain

$\begin{matrix} \begin{matrix} \partial_{θ} e^{H} = \sum_{j, k} ❘ j k ❘ V_{jk} \int_{- \infty}^{\infty} f (t) e^{- it (λ_{k} - λ_{j})} dt (\frac{e^{λ_{k}}}{2} + \frac{e^{λ_{j}}}{2}) \\ = \frac{1}{2} \int_{- \infty}^{\infty} f (t) \sum_{j} e^{it λ_{j}} ❘ j j ❘ V \sum_{k} e^{- it λ_{k} + λ_{k}} ❘ k k ❘ dt + \\ \frac{1}{2} \int_{- \infty}^{\infty} f (t) \sum_{j} e^{it λ_{j}} ❘ j j ❘ V \sum_{k} e^{- it λ_{k}} ❘ k k ❘ dt \end{matrix} & (A12) \end{matrix}$

$\begin{matrix} = \frac{1}{2} \int_{- \infty}^{\infty} f (t) e^{itH} {Ve}^{- itH} {dte}^{H} + \frac{1}{2} e^{H} \int_{- \infty}^{\infty} f (t) e^{itH} {Ve}^{- itH} dt \\ = \frac{1}{2} {Φ (V), e^{H}} . \end{matrix}$

Here {A, B}=AB+BA is the anti-commutator, and we have defined ΦV)=∫_−∞^∞ƒ(t)e^itHVe^−itHdt.

We have recovered, by different means, a result that is achievable via the method described in Hastings, “Quantum belief propagation: An algorithm for thermal quantum systems”, Phys. Rev. B 76, 201102 (2007).

Appendix B: Properties of the Quantum Relative Entropy for Quantum Boltzmann Machines

Set out below is a proof of some properties of the quantum relative entropy S(η∥ρ_θ) of a generic QBM ρ_θwith respect to some arbitrary target η. These properties are used for the proof of the theorems in the main text. We start by showing the convexity and afterward we show the L-smoothness.

1. Strict Convexity

In order to show (strict) convexity of S, we can use Lemma 1 above. We first show that the Hessian of the quantum relative entropy with respect to the QBM parameters, ∇²S, is positive semidefinite. Afterwards, we show that S has only one unique global optimizer θ* for which ∇S(η∥ρ_θ*)=0, and apply the Lemma.

We recall from the main text that the QBM Hamiltonian, custom-character _θ=Σ_iθ_iH_i, is a sum over Hermitian, in general non-commuting, operators H_i. Using the derivative of the matrix exponential in Equation (A12), we have:

$\begin{matrix} \begin{matrix} \frac{\partial S}{\partial θ_{i}} = \frac{\partial}{\partial θ_{i}} Tr [η (\log η - ℋ_{θ} + \log Tr [e^{ℋ_{θ}}])] \\ = - Tr [η H_{i}] + \frac{T r [{Φ (H_{i}), e^{ℋ_{θ}}}]}{2 T r [e^{ℋ_{θ}}]} \\ = - Tr [η H_{i}] + T r [ρ_{θ} Φ (H_{i})] \end{matrix} & (B1) \end{matrix}$

$= - Tr [η H_{i}] + T r [ρ_{θ} H_{i}] . (B 1)$

In the last step we use the cyclic property of the trace. This is Equation (4) in the main text that precedes the appendix. We now take the second derivative starting from Equation (B1):

$\begin{matrix} \begin{matrix} \frac{\partial^{2} s}{\partial θ_{i} \partial θ_{j}} = \frac{\partial}{\partial θ_{j}} Tr [ρ_{θ} Φ (H_{i})] \\ = Tr [(\frac{{Φ (H_{j}), e^{ℋ_{θ}}}}{2 T r [e^{ℋ_{θ}}]} - \frac{e^{ℋ_{θ}} T r [{Φ (H_{j}), e^{ℋ_{θ}}}]}{2 {(T r [e^{ℋ_{θ}}])}^{2}}) Φ (H_{i})] \\ = \frac{1}{2} Tr [ρ_{θ} {Φ (H_{i}), Φ (H_{j})}] - T r [ρ_{θ} Φ (H_{i})] T r [ρ_{θ} Φ (H_{j})] . \end{matrix} & (B2) \end{matrix}$

In the last step we used Tr[A{B, C}]=Tr[C{A, B}] to rearrange the terms.

As Φ(V) is a Hermitian operator for any Hermitian V we see that the Hessian has the form of a covariance matrix.

It is then readily shown to be positive semidefinite and satisfies Equation (A2). For any vector v∈

$\begin{matrix} \begin{matrix} v^{T} \nabla^{2} Sv = \sum_{n, m} v_{n} v_{m} (\frac{1}{2} T r [ρ_{θ} {Φ (H_{n}), Φ (H_{m})}] - \\ Tr [ρ_{θ} Φ (H_{n})] T r [ρ_{θ} Φ (H_{m})]) \\ = \frac{1}{2} T r [ρ_{θ} {\sum_{n} v_{n} Φ (H_{n}), \sum_{m} v_{m} Φ (H_{m})}] - \end{matrix} & (B3) \end{matrix}$

$\begin{matrix} Tr [ρ_{θ} \sum_{n} v_{n} Φ (H_{n})] T r [ρ_{θ} \sum_{m} v_{m} Φ (H_{m})] \\ = \frac{1}{2} T r [ρ_{θ} {Φ (W), Φ (W)}] - T r [ρ_{θ} Φ (W)] T r [ρ_{θ} Φ (W)] \\ = Tr [ρ_{θ} {Φ (W)}^{2}] - T {r [ρ_{θ} Φ (W)]}^{2} \end{matrix}$

$\begin{matrix} = Tr [{ρ_{θ} (Φ (W) - T r [ρ_{θ} Φ (W)] I)}^{2}] \\ = \geq 0. \end{matrix}$

Here we define Hermitian operator W=Σ_nv_nH_n. The last line is the expectation value of the square of a Hermitian operator, and as such it must be non-negative.

This means that the quantum relative entropy is convex. We now show strict convexity by a contradiction argument, following Proposition 17 in Anshu et al., “Sample-efficient learning of interacting quantum systems”, Nature Physics 17, 931 (2021) (hereinafter “Anshu”). Assume we have found one set of parameters θ* with ∇(η∥ρ_θ*)=0. Then from Equation (B1) we have

custom-character
H
_i

_η
=

H
_i

_ρ
_θ*

for all H_i. Note that we can always find at least one such θ* by Jaynes' custom-character principle, see Jaynes, “Information Theory and Statistical Mechanics”, Phys. Rev. 106, 620 (1957). Next, assume there exists a different set of parameters, χ≠θ*, with H_i_η=H_i_ρ_χfor all H_i. Then

$\begin{matrix} \begin{matrix} S (ρ_{χ}  ρ_{θ}^{*}) = Tr [ρ_{χ} \log ρ_{χ}] - T r [ρ_{χ} \log ρ_{θ^{*}}] \\ = Tr [ρ_{χ} \log ρ_{χ}] - \sum_{i} θ_{i}^{*} T r [ρ_{χ} H_{i}] + \log Z_{θ^{*}} \\ = Tr [ρ_{χ} \log ρ_{χ}] - \sum_{i} θ_{i}^{*} T r [ρ_{θ} * H_{i}] + \log Z_{θ^{*}} \end{matrix} & (B4) \end{matrix}$

$\begin{matrix} = Tr [ρ_{χ} \log ρ_{χ}] - T r [ρ_{θ} * \log ρ_{θ^{*}}] \\ = \geq 0. (B 4) \end{matrix}$

Similarly, by swapping ρ_χand ρ_θ*, we find

$\begin{matrix} S (ρ_{θ^{*}}  ρ_{χ}) = T r [ρ_{θ^{*}} \log ρ_{θ^{*}}] - T r [ρ_{χ} \log ρ_{χ}] \geq 0. & (B5) \end{matrix}$

It follows that S(ρ_θ*∥ρ_χ)=0, implying ρ_θ*=ρ_χ. Now because the operators H_iare orthogonal we have θ*=χ. This contradicts the assumption in the beginning (θ*≠χ), and we can have only one unique θ* with ∇S(η∥ρ_θ*)=0. Hence S is strictly convex by Definition 2.

2. Strong Convexity

To show α-strong convexity of S one can use Lemma 3. To the best of our knowledge there is no proof in the literature showing that quantum relative entropy of Gibbs states is strongly convex in general. On the other hand, this property has been proven for particular classes of Hamiltonians. Anshu et al. prove strong convexity for k-local Hamiltonians defined on a finite dimensional lattice. They show that in this case

$α \in O (\frac{1}{n}),$

a polynomial decrease with respect to the system size. Strong convexity for the more general class of low-intersection Hamiltonians was proved in Haah et al., “Optimal learning of quantum Hamiltonians from high temperature Gibbs states”, IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 135-146 (2022) (hereinafter “Haah”). Low-intersection Hamiltonians have terms that act non-trivially only on a constant number of qubits, and each term intersects non-trivially with a constant number of other terms.

In this section, we use differentiable programming to numerically analyze the smallest eigenvalue of the Hessian, λ_min(∇²S), seeking evidence for strong convexity; see Baydin et al., “Automatic differentiation in machine learning: A survey”, arXiv:1502.05767 (2018). We consider a 1D nearest-neighbor Hamiltonian:

$ℋ = \sum_{i = 1}^{n - 1} h_{i, i + 1}^{xx} X_{i} X_{i + 1} + h_{i, i + 1}^{yy} Y_{i} Y_{i + 1} + h_{i, i + 1}^{zz} Z_{i} Z_{i + 1};$

and a fully-connected one:

$ℋ = \sum_{i = 1}^{n - 1} \sum_{j > i}^{n} h_{i, j}^{xx} X_{i} X_{j} + h_{i, j}^{yy} Y_{i} Y_{j} + h_{i, j}^{zz} Z_{i} Z_{j} .$

We randomly sample coefficients uniformly in [−μ,μ] where μ is a scale parameter and determines the maximum size of random parameters of the vector of the coefficients. FIG. 5 shows the minimum eigenvalue (y-axis) of the Hessian, as a function of the number of qubits, n (x-axis), showing the median of 25 random instances for (a) the 1D nearest neighbour Hamiltonian and (b) a fully-connected Hamiltonian. The scale parameter p determines the maximum size of random parameters. In all cases, the smallest eigenvalue decreases with increasing the number of qubits, but appears to converge to a positive value (plateau) for larger values of n (especially in the case of the fully-connected Hamiltonian (b)). The fully-connected Hamiltonian has m∈O(n²) parameters and yields smaller eigenvalues than the 1D Hamiltonian which has m∈O(n) parameters instead. These results provide evidence of strong convexity with a decreasing polynomially with the system size.

3. L-Smoothness

We show that the quantum relative entropy S(η∥ρ_θ) is an L-smooth function of θ. To do so we need an upper bound on the largest eigenvalue of the Hessian in Equation B2. We begin with the following property:

$\begin{matrix}  Φ (V)  =  \sum_{j, k} | j 〉 〈 k | V_{j k} \hat{f} (λ_{k} - λ_{j})  \leq ❘ {\hat{f}}_{\max} ❘  \sum_{j, k} ❘ j 〉 〈 k ❘ V_{j k}  =  V , & (B6) \end{matrix}$

where we use that {circumflex over (ƒ)}>0 and {circumflex over (ƒ)}_max=1. In what follows we use the above result with ∥⋅∥₂, the operator norm induced by the Euclidean vector norm (p=2). Let us bound the entries of the Hessian

$\begin{matrix} {❘ \frac{\partial^{2} S}{\partial θ_{j} \partial θ_{k}} ❘ = ❘ \frac{1}{2} T r [ρ_{θ} {Φ (H_{j}), Φ (H_{k})}] - T r [ρ_{θ} Φ (H_{j})] T r [ρ_{θ} Φ (H_{k})] ❘ \leq \frac{1}{2} ❘ λ_{\max} ({Φ (H_{j}), Φ (H_{k})}) ❘ + ❘ λ_{\max} (Φ (H_{j})) ❘ ❘ λ_{\max} (Φ (H_{k})) ❘ \leq \frac{1}{2} { {Φ (H_{j}), Φ (H_{k})} }_{2} + { Φ (H_{j}) }_{2} { Φ (H_{k}) }_{2} \leq 2 ❘ Φ (H_{j}) }_{2} { Φ (H_{k}) }_{2} \leq 2 { H_{j} }_{2} { H_{k} }_{2} . & (B7) \end{matrix}$

Here we use that expectations are bounded by the largest eigenvalue or, alternatively, by the p=2 operator norm. We also use the sub-multiplicative property of the operator norm, and Equation (B6) above. We are now able to put an upper-bound on the largest eigenvalue of the Hessian matrix:

$\begin{matrix} { \nabla^{2} S }_{2} = ❘ λ_{\max} (\nabla^{2} S) ❘ \leq \max_{j} \sum_{k} | \frac{\partial^{2} S}{\partial θ_{j} \partial θ_{k}} | \leq m \max_{j} \max_{k} ❘ \frac{\partial^{2} S}{\partial θ_{j} \partial θ_{k}} ❘ \leq 2 m \max_{j} \max_{k} { H_{j} }_{2} { H_{k} }_{2} = 2 m \max_{j} { H_{j} }_{2}^{2} & (B8) \end{matrix}$

The first equality uses the fact that the Hessian is a symmetric matrix, the first inequality is a consequence of the Gershgorin circle theorem.

We can use this result to prove the L-smoothness.

Let us define a function h(t)=∇S(η∥ρ_y+t(x-y)). Then we have

$\begin{matrix} { \nabla S (η  ρ_{x}) - \nabla S (η  ρ_{y}) }_{2} = { h (1) - h (0) }_{2} = { \int_{0}^{1} h^{'} (t) dt }_{2} \leq \int_{0}^{1} { h^{'} (t) }_{2} dt = \int_{0}^{1}  \nabla^{2} S (η  ρ_{y + t (x - y)}) (x - y) _{2} dt \leq \int_{0}^{1}  \nabla^{2} S (η  ρ_{y + t (x - y)}) _{2} { x - y }_{2} dt \leq 2 m \max_{n} { H_{n} }_{2}^{2} { x - y }_{2} & (B9) \end{matrix}$

where in the last step we used Equation B8. Thus, the quantum relative entropy is L-smooth with L=2m max_j∥H_j∥₂².

Appendix C: Convergence Results of Stochastic Gradient Descent for Training Quantum Boltzmann Machines

In this appendix we first review useful results from the machine learning literature, then prove Theorems 1 and 2 in the main text that precedes the appendix. We also discuss a few upper bounds for the relative entropy in the context of QBM learning.

1. Review of Stochastic Gradient Descent Convergence Results

We begin by stating three convergence results from the SGD literature. Consider a loss function ƒ: custom-character ^m→ that is L-smooth (Definition 5 and bounded from below by β_inf∈. The stochastic gradient is unbiased, i.e., [ĝ]=∇ƒ, and satisfies

$\begin{matrix} 𝔼 { \hat{g} (x) }^{2} \leq 2 A (f (x) - f_{\inf}) + B { \nabla f (x) }^{2} + C, & (C1) \end{matrix}$

for some A, B, C≥0 and all x∈ custom-character ^m. SGD iteratively minimizes ƒ according to the update rule x^t=x^t-1γ^tĝ_x_t-1at time step t. Khaled et al., “Better theory for SGD in the nonconvex world”, arXiv:2002.03329 (2020) (hereinafter “Khaled”), proved the following SGD convergence result.

Lemma 5 (restatement of Corollary 1 in Khaled). Choose precision ϵ>0 and step size

$\begin{matrix} γ = \min {\frac{1}{\sqrt{LAT}}, \frac{1}{L B}, \frac{ϵ}{2 L C}}, and set δ_{0} = 𝔼 [f (x^{0})] - f_{\inf} . & (C2)] \end{matrix}$

$Then provided that$

$T \geq \frac{1 2 δ_{0} L}{ϵ^{2}} \max {B, \frac{1 2 δ_{0} A}{ϵ^{2}}, \frac{2 C}{ϵ^{2}}},$

we have that SGD converges with

$\begin{matrix} \min_{1 \leq t \leq T} 𝔼  \nabla f (x^{t})  \leq ϵ . & (C3) \end{matrix}$

Here E[⋅] denotes the expectation with respect to x^t, which is a random variable due to the stochasticity in the gradient. Let us now consider a loss function which, in addition to the previous conditions, is also α-Polyak-Lojasiewicz (Definition 3). We consider the following iterative learning rate scheme for γ_t.

Lemma 6 (restatement of Lemma 3 in Khaled). Considera sequence (r_t)_tsatisfying

r
_t+1(1−αγ_t)r_t+cγ_t²,

where γ_t≤1/b for all t≥0 and a,c≥0 with a≤b. Fix T>0 and let

$k_{0} = ⌈ \frac{T}{2} ⌉ .$

Then choosing the step size as

$\begin{matrix} γ_{t} = {\begin{matrix} \frac{1}{b}, & if T \leq \frac{b}{a} or t < k_{0}, \\ \frac{2}{a (s + t - k_{0})}, & if T \geq \frac{b}{a} and t > k_{0}, \end{matrix} & (C4) \end{matrix}$

$with s = \frac{2 b}{a} gives r_{T} \leq \exp {- \frac{aK}{2 b}} r_{0} + \frac{9 c}{a^{2} T}$

For this learning rate scheme, Khaled et al. proved the following SGD convergence result.

Lemma 7 (restatement of Corollary 2 in Khaled). Choose precision ϵ>0 and step size γ_tfollowing Lemma 6 with

$γ_{t} \leq \min {\frac{α}{2 A L}, \frac{1}{2 B L}} .$

Then provided that

$\begin{matrix} T \geq \frac{L}{α} \max {\frac{2 A}{α} \log \frac{2 δ_{0}}{ϵ}, \frac{2 B}{α} \log \frac{2 δ_{0}}{ϵ}, \frac{9 C}{2 αϵ}} & (C5) \end{matrix}$

we have that SGD converges with

$\begin{matrix} 𝔼 ❘ f (x^{T}) - f_{\inf} ❘ \leq ϵ . & (C6) \end{matrix}$

2. Proofs of Theorems 1 and 2 in the Main Text

We prove Theorem 1, which is repeated here for completeness.

Theorem 1 (QBM training). Given a QBM defined by a set of n-qubit Pauli operators {H_i}_i=1^m, a precision κ for the QBM expectations, a precision for the data expectations, and a target precision ϵ such that

$κ^{2} + ξ^{2} \geq \frac{ε}{2 m}$

After

$\begin{matrix} T \geq \frac{4 8 δ_{0} m^{2} (κ^{2} + ξ^{2})}{ϵ^{4}} & (C7) \end{matrix}$

iterations of stochastic gradient descent on the relative entropy S(η∥ρ_θ) with constant learning rate

$γ^{t} = \frac{ϵ}{4 m^{2} (κ^{2} + ξ^{2})},$

we have

$\begin{matrix} \min_{t = 1, \dots, T} 𝔼 ❘ {〈 H_{i} 〉}_{ρ_{θ^{t}}} - {〈 H_{i} 〉}_{η} ❘ \leq ϵ, \forall i, & (C8) \end{matrix}$

where custom-character [⋅] denotes the expectation with respect to the random variable t. Each iteration t∈{0, . . . , T} requires

$\begin{matrix} 𝒩 \geq 𝒪 (\frac{1}{κ^{4}} \log \frac{m}{1 - λ^{\frac{1}{T}}}) & (C9) \end{matrix}$

preparations of the Gibbs state ρ_θ_t, and the success probability of the full algorithm is λ. Here, δ₀=S(η∥ρ_θ₀)−S(η∥ρ_θ_opt) is the relative entropy difference with the optimal model ρ_θ_opt.

Proof. The quantum relative entropy is L-smooth with L=2m max_i∥H_i∥₂², and for Pauli operators ∥H_i∥₂=1. Then, we can minimize the relative entropy by SGD and apply the convergence result in Lemma 5.

For the SGD algorithm we need an unbiased gradient estimator with bounded variance. We recall that the gradient of the relative entropy is given by ∂_θ_iS(η∥ρ_θ)= custom-character H_i_ρ_θ−H_i_θ. The target expectation values H_i_θare estimated as from the data set, as described in Appendix E below. Note that |H_i_θ−ĥ_i,θ|≤ξ, where ξ>0 is limited by the size of the data set. One can improve on by collecting more data, as long as the amount of samples is polynomial in n.

For estimating the QBM expectation values custom-character H_i_ρ_θ, we can use a number of techniques. Here we focus on classical shadow tomography. As is known from Theorem 4 in Huang et al., “Information-theoretic bounds on quantum advantage in machine learning”, Phys. Rev. Lett. 126, 190505 (2021), for example, there exists a procedure that returns the expectation values of m different Pauli operators {H_i} to precision κ with

$𝒪 (\frac{\log (m / \tilde{λ)}}{κ^{4}})$

preparations of ρ_θ¹. The success probability of the procedure is 1−{tilde over (λ)}. Thus, we can obtain estimators ĥ_i,ρ_θsuch that

$\begin{matrix} \max_{i} ❘ {\hat{h}}_{i, ρ_{θ}} - {〈 H_{i} 〉}_{ρ_{θ}} ❘ \leq κ . & (C10) \end{matrix}$

We then use ĝ_θ_i=ĥ_i,ρ_θ−ĥ_i,ηas estimators for the partial derivatives of the quantum relative entropy. The variance of the norm of the gradient estimator is bounded as ¹Note that this procedure only applies to Pauli operators so from now on we define the H_iin the QBM Hamiltonian to be Pauli operators. We discuss in the main text that this result can be generalized to other types of operators resorting to other shadow tomography protocols.

$\begin{matrix} \begin{matrix} 𝔼 { {\hat{g}}_{θ} - \nabla S (η  ρ_{θ}) }^{2} = 𝔼 \sum_{i = 1}^{m} {({\hat{h}}_{i, ρ_{θ}} - {\hat{h}}_{i, η} - {〈 H_{i} 〉}_{ρ_{θ}} + {〈 H_{i} 〉}_{η})}^{2} \\ \leq 𝔼 \sum_{i = 1}^{m} {({\hat{h}}_{i, ρ_{θ}} - {〈 H_{i} 〉}_{ρ_{θ}})}^{2} + {({\hat{h}}_{i, η} - {〈 H_{i} 〉}_{η})}^{2} \\ \leq m (κ^{2} + ξ^{2}) \end{matrix} . & (C11) \end{matrix}$

Since the variance can also be written as custom-character ∥ĝ_θ∥²−∥∇S∥(η∥ρ_θ)∥²we find that our setup is compatible with Equation (C5) for A=0, B=1, C=m(κ²+ξ²). We choose

$ϵ < 1 and κ^{2} + ξ^{2} \geq \frac{ϵ}{2 m}$

in Lemma 5. This yields a learning rate of

$γ = \frac{ϵ}{4 m^{2} (κ^{2} + ξ^{2})} .$

We conclude that after

$\begin{matrix} T \geq \frac{4 8 δ_{0} m^{2} (κ^{2} + ξ^{2})}{ϵ^{4}} & (C12) \end{matrix}$

iterations of SGD we have

$\begin{matrix} \min_{1 \leq t \leq T} 𝔼  \nabla S (η  ρ_{θ^{t}})  \leq ϵ . & (C13) \end{matrix}$

Here δ₀=S(η∥ρ_θ₀)−S(η∥ρ_θ_opt) is the relative entropy at the initialization minus the relative entropy at the optimum. Importantly we note the QBM expectation values are computed with a success probability 1−{tilde over (λ)} at each iteration. Consequently, the total success probability of the whole training is equal to (1−{tilde over (λ)})^Tfor T update steps. Then to have a total success probability of λ we need to set

$\tilde{λ} = (1 - λ^{\frac{1}{T}})$

in the shadow tomography protocol. This result, together with the sampling bound on the number of measurements of the shadow tomography

$𝒪 (\frac{\log (m / \tilde{λ)}}{κ^{4}}),$

completes the proof of Theorem 1.

We now provide a proof for Theorem 2, which we restate here.

Theorem 2 (α-strongly convex QBM training). Given a QBM defined by a Hamiltonian ansatz custom-character _θsuch that S(η∥ρ_θ) is α-strongly convex, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that

$κ^{2} + ξ^{2} \geq \frac{ε}{2_{m}} .$

After

$\begin{matrix} T \geq \frac{1 8 m^{2} (κ^{2} + ξ^{2})}{α^{2} ε^{2}} & (C 14) \end{matrix}$

iterations of stochastic gradient descent on the relative entropy S(η∥ρ_θ) with learning rate

$γ^{t} \leq \frac{1}{4 m^{2}}$

(see Appendix C.2 for the specific learning rate schedule), we have

$\begin{matrix} \begin{matrix} \min_{t = 1 \dots . T} 𝔼 | H_{i} ρ_{At} - H_{i} η | \leq ϵ, & \forall i . \end{matrix} & (C 15) \end{matrix}$

Each iteration requires the number of samples given in Equation (C9).

In order to prove this theorem, we first show that η, ρ^optand ρ_θare ‘collinear’ with respect to the relative entropy.

$\begin{matrix} \begin{matrix} S (η  ρ_{θ}) - S (η  ρ_{θ^{opt}}) = - Tr [η \log ρ_{θ}] + T r [η \log ρ_{θ^{opt}}] \\ = - Tr [η ℋ_{θ}] + \log Z_{θ} + [η ℋ_{θ^{o p t}}] - \log Z_{θ^{opt}} \\ = - Tr [ρ_{θ^{o p t}} ℋ_{θ}] + \log Z_{θ} + [ρ_{θ^{o p t}} ℋ_{θ^{o p t}}] - \log Z_{θ^{opt}} \\ = - Tr [ρ_{θ^{opt}} \log ρ_{θ}] + T r [ρ_{θ^{opt}} \log ρ_{θ^{opt}}] \\ = S (ρ_{θ^{opt}}  ρ_{θ}) . \end{matrix} & (C 16) \end{matrix}$

Here, in the going from the second to the third line, we used the fact that Tr[ηH_i]=Tr[ρ_θ_optH_i], which follows from setting Equation (B1) to zero. Rearranging the terms we get the collinearity S(η∥ρ_θ)=S(η∥ρ_θ_opt)+S(ρ_θ_opt∥ρ_θ). This is a non-trivial result because the relative entropy is not a distance: it is not symmetric and does not satisfy the triangle inequality in general. With this relation we are now able to prove Theorem 2.

Proof. S(ρ_θ_opt∥ρ_θ) satisfies all the relevant assumptions for SGD convergence: it is L-smooth function with L=2m max_i∥H_i∥₂², it is bounded below by 0, and the stochastic gradient has bounded variance [Equations (C11) and (C9) apply]. Recall that we use Pauli terms in the Hamiltonian, so that ∥H_i∥₂=1 and L=2m.

In addition, the α-strong convexity assumed by the theorem implies that S(ρ_θ_opt∥ρ_θ) is α-Polyak-Lojasiewicz by Lemma 2. This means we can invoke Lemma 7. As before, we set A=0, B=1, C=m(κ²+ξ²) and choose ϵ′<1 in the Lemma, thus obtaining a maximum learning rate

$γ_{t} \leq \frac{1}{4 m^{2}} .$

Looking at the case²where

$\frac{2}{α} \log \frac{2 δ_{0}}{ϵ^{'}} \leq \frac{9 m (κ^{2} + ξ^{2})}{2 {αϵ}^{'}},$

we find that after

$T \geq \frac{9 m^{2} (κ^{2} + ξ^{2})}{α^{2} ϵ^{'}}$

iterations the expected relative entropy is custom-character S(ρ_θ_opt∥ρ_θ_T)≤ϵ′. It follows that

$\begin{matrix} \begin{matrix} 𝔼S (ρ_{θ^{opt}}  ρ_{θ^{T}}) & \geq \frac{l}{2 \ln 2} 𝔼  ρ_{θ^{opt}} - ρ_{θ^{T}} _{1}^{2} \end{matrix} & (C 17) \end{matrix}$

$\begin{matrix} \geq \frac{1}{2} {(𝔼  ρ_{θ^{opt}} - ρ_{θ^{T}})}^{2} \end{matrix}$

$\begin{matrix} = \frac{1}{2} {(𝔼 \max_{- I < U \leq I} | Tr [U (ρ_{θ^{opt}} - ρ_{θ^{T}}))}^{2} \end{matrix}$

where we apply Pinsker's inequality in the first step, and we use the variational definition of trace distance in the last step. The maximization is over unitary matrices. Let us now consider unitary matrices defined as

$U_{i} = \frac{H_{i}}{‖ H_{i} ‖} + \sqrt[i]{I - \frac{H_{i}^{2}}{‖ H_{i} ‖^{2}}} .$

These have the property that

$H_{i} =  H_{i}  \frac{(U_{i} + U_{i}^{†})}{2} .$

Therefore,

$\begin{matrix} \begin{matrix} \sqrt{2 𝔼S (ρ_{θ^{opt}}  ρ_{θ^{T}})} \geq 𝔼 \max_{- I < U \leq I} | Tr [\frac{1}{2} (U + U^{†}) (ρ_{θ^{opt}} - ρ_{θ^{T}})] | \\ \geq 𝔼 \underset{}{\max_{i}} | Tr [\frac{1}{2} (U_{i} + U_{i}^{†}) (ρ_{θ^{opt}} - ρ_{θ^{T}})] | \\ \geq 𝔼 \underset{}{\max_{i}} \frac{1}{‖ H_{i} ‖} | Tr [H_{i} ρ_{θ^{opt}}] - T r [H_{i} ρ_{θ^{T}}] | . \end{matrix} & (C 18) \end{matrix}$

Thus, we obtain ²Note that, depending on the problem specific parameter δ₀, and the free parameters κ and ξ, one could be in the other case of Lemma 7. One then follows the same steps shown here, and arrives at a slightly different, yet polynomial in n, number of steps.

$\begin{matrix} \begin{matrix} \sqrt{2 ϵ^{'}} \geq \frac{1}{‖ H_{i} ‖} 𝔼 | Tr [H_{i} ρ_{θ^{opt}}] - T r [H_{i} ρ_{θ^{T}}] |, & \forall i . \end{matrix} & (C 19) \end{matrix}$

Since ∥H_i∥=1 for Pauli operators. To solve the QBM learning to precision ϵ we choose

$ϵ^{'} = \frac{ϵ^{2}}{2}$

and conclude that

$\begin{matrix} \begin{matrix} 𝔼 | Tr [H_{i} ρ_{η}] - T r [H_{i} ρ_{θ^{T}}] | \leq ϵ, & \forall i . \end{matrix} & (C 20) \end{matrix}$

3. Achieving a Desired Precision on the Quantum Relative Entropy for Theorem 1

In this section we study the scenario where the user is interested in obtaining a certain precision on the quantum relative entropy, rather than on the difference in the expectation values. Again, due to a potential model mismatch, we discuss the relative entropy S(ρ_θ_opt∥ρ_θ) w.r.t. the optimal QBM ρ_θ_opt.

We begin by training the QBM ρ_θwith SGD. Using Theorem 1, we can achieve | custom-character H_i_η−H_i_ρ_θ|≤ϵ for all i with polynomial sampling complexity. This for same item implies a similar relation w.r.t. the optimal model: |H_i_η−H_i_ρ_θ_opt|≤ϵ. By the triangle inequality we have that |H_i_ρθ−H_i_ρ_θ_opt|≤2ϵ. Then

$\begin{matrix} \begin{matrix} S (ρ_{θ^{opt}}  ρ_{θ}) = Tr [ρ_{θ^{opt}} \log ρ_{θ^{opt}}] - \sum_{i} θ_{i} Tr [ρ_{θ^{opt}} H_{i}] + \log Z_{θ} \\ = Tr [ρ_{θ^{opt}} \log ρ_{θ^{opt}}] - \sum_{i} θ_{i} Tr [ρ_{θ^{opt}} H_{i}] + \log Z_{θ} + \\ \sum_{i} θ_{i} Tr [ρ_{θ} H_{i}] - \sum_{i} θ_{i} Tr [ρ_{θ} H_{i}] \\ = Tr [ρ_{θ^{opt}} \log ρ_{θ^{opt}}] - Tr [ρ_{θ} \log ρ_{θ}] + \\ \sum_{i} θ_{i} (Tr [ρ_{θ} H_{i}] - Tr [ρ_{θ^{opt}} H_{i}]) \end{matrix} . & (C21) \end{matrix}$

Similarly,

$\begin{matrix} S (ρ_{θ}  ρ_{θ^{opt}}) = - Tr [ρ_{θ^{opt}} \log ρ_{θ^{opt}}] + Tr [ρ_{θ} \log ρ_{θ}] - \sum_{i} θ_{i}^{opt} (Tr [ρ_{θ} h_{i}] - Tr [ρ_{θ^{opt}} H_{i}]) . & (C22) \end{matrix}$

Thus

$\begin{matrix} \begin{matrix} S (ρ_{θ^{opt}}  ρ_{θ}) \leq S (ρ_{θ^{opt}}  ρ_{θ}) + S (ρ_{θ}  ρ_{θ^{opt}}) \\ = \sum_{i} (Tr [ρ_{θ} H_{i}] - Tr [ρ_{θ^{opt}} H_{i}]) (θ_{i} - θ_{i}^{opt}) \\ \leq \sum_{i} ❘ Tr [ρ_{θ} H_{i}] - Tr [ρ_{θ^{opt}} H_{i}] ❘ \cdot ❘ θ_{i} - θ_{i}^{opt} ❘ \\ \leq 2 ϵ { θ - θ^{opt} }_{1} \end{matrix} . & (C23) \end{matrix}$

To minimize the quantum relative entropy to precision ϵ′, we choose

$ϵ \leq \frac{ϵ^{'}}{2 { θ - θ^{opt} }_{1}} .$

This determines the number of SGD iterations via Theorem 1. Note that the number of iterations remains polynomial in the system size n.

Finally we combine this result with Equation (C16) and obtain the implication

$\begin{matrix} ❘ H_{i} η - H_{i} ρ_{θ} ❘ \leq ϵ, \forall i \Rightarrow S (η  ρ_{θ}) - S (η  ρ_{θ^{opt}}) = S (ρ_{θ^{opt}}  ρ_{θ}) \leq 2 ϵ { θ - θ^{opt} }_{1} . & (C24) \end{matrix}$

This proves Equation (6) in the main text.

Appendix D: Pre-Training

In this appendix we first prove Theorem 3 in the main text, and then discuss various pre-training models.

1. Proof of Theorem Guaranteed Performance Improvement by Pre-Training

For completeness we start by restating Theorem from the main text.

Theorem 3 (QBM pre-training). Assume a target η and a QBM model ρ_θ=e^Σⁱ⁼¹^m^θⁱ^Hⁱ/Z for which we like to minimize the relative entropy S(η∥ρ_θ). Initializing at θ⁰=0 and pre-training S(η∥ρ_θ) with respect to any subset of {tilde over (m)}≤m parameters guarantees that

$\begin{matrix} S (η  ρ_{θ^{pre}}) \leq S (η  ρ_{θ^{0}}), & (D1) \end{matrix}$

where θ^pre=[χ^pre,0_{m-{tilde over (m)}}] and the vector χ^preof length i contains the parameters for the terms {H_i}_i=1^{{tilde over (m)}}at the end of pre-training. More precisely, starting from p_x=eΣ_i=1^{{tilde over (m)}}x_iH_i/Z and minimizing S(η∥ρ_χ) with respect to χ ensures Equation (D1) for any S(η∥ρ_χ_pre)≤S(η∥ρ_Ω₀).

Proof. First we relate the difference in relative entropy between two parameter vectors in the full space to the difference in relative entropy of the pre-trained parameter space. In particular, for any real parameter vectors θ=[χ,0_{m-{tilde over (m)}}] and θ′=[χ′,0_{m-{tilde over (m)}}] we have

$\begin{matrix} \begin{matrix} S (η  ρ_{θ}) - S (η  ρ_{θ'}) = Tr [η \log ρ_{θ'}] - Tr [η \log ρ_{θ}] \\ = \sum_{i = 1}^{m} (θ_{i}^{'} - θ_{i}) Tr [η H_{i}] - \log Tr [e^{\sum_{i = 1}^{m} θ_{i}^{'} H_{i}}] + \\ \log Tr [e^{\sum_{i = 1}^{m} θ_{i} H_{i}}] \\ = \sum_{i = 1}^{\tilde{m}} (χ_{i}^{'} - χ_{i}) Tr [η H_{i}] - \log Tr [e^{\sum_{i = 1}^{\tilde{m}} χ_{i}^{'} H_{i}}] + \\ \log Tr [e^{\sum_{i = 1}^{\tilde{m}} χ_{i} H_{i}}] \\ = Tr [η \log ρ_{χ^{'}}] - Tr [η \log ρ_{χ}] \\ = S (η  ρ_{χ}) - S (η  ρ_{χ'}) \end{matrix} & (D2) \end{matrix}$

Now using pre-training vectors θ^pre=[χ^pre,0_{m-{tilde over (m)}}] and θ⁰=[χ⁰,0_{m-{tilde over (m)}}]=0 we see that S(η∥ρ_χ_pre)≤S(η∥ρ_χ₀) implies S(η∥ρ_θ_pre)≤S(η∥ρ_θ₀). Thus, any method that finds such a χ^preguarantees Equation (14).

While conclusive, the above proof does not provide us with a method to find such a χ^pre, i.e., it is agnostic to the specific pre-training method. As a constructive example, let us consider minimizing χ^prewith noiseless gradient decent on a subset of {tilde over (m)} parameters. This means we update the subset parameters as χ^t=χ^t-1−γ{tilde over (∇)}S(η{tilde over (∥)}ρ_χ_t-1), where {tilde over (∇)}S(η∥ρ_χ_t-1) is the gradient of the subset of parameters, and γ the learning rate. Since S is L-smooth, we can use the descent Lemma 4 to bound the difference in relative entropy of the subset

$\begin{matrix} \begin{matrix} S (η  ρ_{χ^{t}}) - S (η  ρ_{χ^{t - 1}}) \leq \nabla {S (η  ρ_{χ^{t - 1}})}^{T} (- \tilde{γ} \nabla S (η  ρ_{χ^{t - 1}})) + \\ \frac{L}{2} { - \tilde{γ} \nabla S (η  ρ_{χ^{t - 1}}) }^{2} \\ = - γ (1 - \frac{γ L}{2}) { \tilde{\nabla} S (η  ρ_{χ^{t - 1}}) }^{2} \end{matrix} . & (D3) \end{matrix}$

Setting

$γ \leq \frac{2}{L}$

we obtain S(η∥ρ_χ_t)≤S(η∥ρ_χ_t-1). By recursively applying this inequality we obtain a λ^prewith

S(η∥ρ_χ_pre)≤S(η∥ρ_χ₀),

which by our theorem above ensures Equation (D1). Note that the smoothness L here is the smoothness on the subset of parameters, which can be bounded by L≤2{tilde over (m)} max_i∥H_i∥₂².

2. Pre-Training Methods

Here we discuss possible pre-training models and strategies to optimize them. We focus on the models discussed in the main text: 1) a mean-field model, 2) a Gaussian Fermionic model, 3) nearest-neighbor quantum spin models. The advantage of the first two models is that they can be trained analytically. While for the nearest-neighbor models this is not possible, they satisfy the locality assumptions in Anshu and Haah, and hence have a strongly convex relative entropy.

2a. Mean-field Quantum Boltzmann Machine

We define the mean-field QBM by the parameterized Hamiltonian

$\begin{matrix} ℋ_{θ} = \sum_{i}^{n} θ_{i}^{x} σ_{i}^{x} + θ_{i}^{y} σ_{i}^{y} + θ_{i}^{z} σ_{i}^{z} . & (D4) \end{matrix}$

Since this Hamiltonian has a simple structure, in which many terms commute, we can find the optimal parameters analytically. First, recall that the QBM expectation values are given by

$\begin{matrix} H_{i} ρ_{θ} = \frac{\partial}{\partial θ_{i}} \log Tr [e^{ℋ_{θ}}] = \frac{\partial}{\partial θ_{i}} \log Z_{θ} . & (D5) \end{matrix}$

For the mean-field Hamiltonian, we find

$\begin{matrix} Z_{θ} = Tr [e^{\sum_{i = 1}^{n} θ_{i}^{x} σ_{i}^{x} + θ_{i}^{y} σ_{i}^{y} + θ_{i}^{z} σ_{i}^{z}}] = \prod_{i = 1}^{n} Tr [e^{θ_{i}^{x} σ_{i}^{x} + θ_{i}^{y} σ_{i}^{y} + θ_{i}^{z} σ_{i}^{z}}] = \prod_{i = 1}^{n} 2 \cosh { θ_{i} }_{2}, & (D6) \end{matrix}$

where we have defined ∥θ_i∥₂=√{square root over (θ_i^x²+θ_i^y²+θ_i^z²)}. Here we have used the commutativity of single qubit operators in the first equality, and expanded the exponent for the second equality. We therefore get

$\begin{matrix} \log Z_{θ} = \sum_{i = 1}^{n} \log 2 \cosh { θ_{i} }_{2} . & (D7) \end{matrix}$

From which the derivative follows as

$\begin{matrix} \frac{\partial}{\partial θ_{i}^{x, y, z}} \log Z_{θ} = \frac{θ_{i}^{x, y, z}}{{ θ_{i} }_{2}} \tanh { θ_{i} }_{2} . & (D8) \end{matrix}$

In order to find the optimal QBM parameters for each qubit, i, we then solve the three coupled equations,

$\begin{matrix} \frac{θ_{i}^{x, y, z}}{{ θ_{i} }_{2}} \tanh { θ_{i} }_{2} = σ_{i}^{x, y, z} η, & (D9) \end{matrix}$

which corresponds to setting the QBM derivative in Equation (4) in the main text to zero. From the strict convexity of the relative entropy, we know this has one unique solution provided the target expectation values, custom-character σ_i^x,y,z_ηform a consistent set, i.e. it comes from a density matrix. We can find the solution by squaring the three equations, and adding them together, giving

$\begin{matrix} { θ_{i} }_{2} = \tan h^{- 1} (\sqrt{σ_{i}^{x} η_{2} + σ_{i}^{y} η_{2} + σ_{i}^{z} η_{2}}) . & (D10) \end{matrix}$

Here we used that the argument of the tanh is always positive. Substituting this into Equation (D9) we then find the closed-form solution of the QBM parameters

$\begin{matrix} θ_{i}^{x, y, z} = σ_{i}^{x, y, z} η \frac{\tan h^{- 1} (\sqrt{{〈 σ_{i}^{x} 〉}_{η}^{2} + {〈 σ_{i}^{y} 〉}_{η}^{2} + {〈 σ_{i}^{z} 〉}_{η}^{2}})}{\sqrt{{〈 σ_{i}^{x} 〉}_{η}^{2} + {〈 σ_{i}^{y} 〉}_{η}^{2} + {〈 σ_{i}^{z} 〉}_{η}^{2}}} . & (D11) \end{matrix}$

In practice, the optimal parameters for an arbitrary mean-field QBM can be obtained by numerically evaluating this expression for the given target expectation values.

2b. Gaussian Fermionic Quantum Boltzmann Machine

The Gaussian Fermionic QBM has a parameterized, quadratic, Fermionic Hamiltonian

$\begin{matrix} ℋ_{θ} = {\vec{C}}^{†} \tilde{Θ} \vec{C} \equiv \sum_{i, j} {\tilde{Θ}}_{ij} {\vec{C}}_{i}^{†} {\vec{C}}_{j} . & (D12) \end{matrix}$

Here, {right arrow over (C)}^†=[c₁^†, . . . , c_N^†, c₁, . . . c_n] is a vector containing n Fermionic mode creation and annihilation operators, which satisfy the Fermionic commutation relations {c_i,c_j^†}=δ_i,jand {c_i,c_j}=0. These Fermionic operators can be expressed as strings of Pauli operators by the Jordan-Wigner transformation. {tilde over (Θ)} is the 2n×2n dimensional matrix containing the QBM model parameters θ, which can be identified as a Fermionic single-particle Hamiltonian. Note that this matrix needs to be Hermitian, and since terms like c₁^†c₁^†are zero it has in total n²free parameters.

In order to find the optimal parameters, we use that the single-particle correlation matrix with entries [Γ_ρθ]_ij custom-character {right arrow over (C)}_i^†{right arrow over (C)}_j_ρ_θcontains sufficient information to compute all possible properties of the Gaussian quantum system. This includes all possible observables (via Wick's theorem), entanglement measures, and also sampling from ρ_θ; see Surace et al., “Fermionic Gaussian states: An introduction to numerical approaches”, SciPost Physics Lecture Notes (2022). In particular, the Gaussian Fermionic QBM gradient reduces to the difference in the correlation matrices of the target and the model

$\begin{matrix} \frac{\partial S}{\partial {\tilde{Θ}}_{ij}} = {\vec{C}}_{i}^{†} {\vec{C}}_{j} ρ_{θ} - {\vec{C}}_{i}^{†} \vec{C_{j}} η . & (D13) \end{matrix}$

We can solve this by first determining the target expectation values custom-character {right arrow over (C)}_i^†{right arrow over (C)}_j_ηand setting {right arrow over (C)}_i^†{right arrow over (C)}_j_ρ_θ*={right arrow over (C)}_i^†{right arrow over (C)}_j_η. Then we use the fact that the Hamiltonian of a Gaussian Fermionic system can be written in the eigenbasis of the correlation matrix as

$\begin{matrix} H_{η} = \frac{1}{2} W_{η} σ^{- 1} (Λ_{η}) W_{η}^{†}, & (D14) \end{matrix}$

where W_ηand Λ_ηis given by the eigen decomposition Γ_η=W_ηΛ_ηW_η^†, and σ⁻¹(X) the inverse sigmoid function. Thus, we (numerically) diagonalize Γ_ηand set the optimal Gaussian Fermionic QBM Hamiltonian equal to

$H_{θ^{*}} = \frac{1}{2} W_{η} σ^{- 1} (Λ_{η}) W_{η}^{†} .$

Since the eigen decomposition of a Hermitian matrix is unique, we find one unique solution. This is in agreement with the strict convexity of the quantum relative entropy.

2c. Geometrically-Local Quantum Boltmann Machine

The last type of restricted QBM model we discuss are the geometrically-local QBMs. We consider the same Hamiltonian as for a generic fully connected 2-local QBM [Equation (16)], but then with additional constraints on the locality of the Pauli operators. In particular, we focus on nearest-neighbor models on some d-dimensional lattice, e.g. a one-dimensional chain where each Pauli operator only acts on two neighbouring qubits. In full generality, the parameterized QBM Hamiltonian is given by

$\begin{matrix} ℋ_{θ} = \sum_{k = x, y, z} \sum_{〈 i, j 〉} λ_{ij}^{k} σ_{i}^{k} σ_{j}^{k} + \sum_{i}^{n} γ_{i}^{k} σ_{i}^{k}, & (D15) \end{matrix}$

where we sum over the nearest-neighbour sites custom-character i,j of the lattice with periodic boundary conditions. In the main text we consider for example a d=1 lattice (a ring), and a d=2 square lattice.

In order to use these models for pre-training, we train them with SGD on the relative entropy until a fixed precision is reached. Importantly, as these Hamiltonians only have m= custom-character (n) terms and a finite interaction range, Anshu and Haah show that the quantum relative entropy is strongly convex. Therefore, the optimization is guaranteed to converge quickly to the global optimum, recall Theorem 2. However, this includes obtaining Gibbs state expectation values of geometrically local Hamiltonians. This can be done with a quantum computer, or potentially classically with tensor networks; see Kuwahara et al., “Improved thermal area law and quasilinear time algorithm for quantum Gibbs states”, Phys. Rev. X 11, 011047 (2021) and Alhambra et al., “Locally accurate tensor networks for thermal states and time evolution”, PRX Quantum 2, 040331 (2021).

Appendix E: Construction of the Target State Expectation Values

In this appendix we review how to embed classical data into a target density matrix η. We will follow the approach for quantum spin models in Kappen, “Learning quantum models from quantum or classical data”, Journal of Physics A: Mathematical and Theoretical 53, 214001 (2020) (hereinafter “Kappen”). We also show how to extend this formalism to Fermionic quantum models needed for the pre-training of our Gaussian Fermionic QBM. Lastly, we describe the two different targets used for the numerical simulations in the main text.

1. Classical Data Encoding

Following the approach in Kappen, one way to encode a classical dataset consisting of N bit strings { custom-character ∈{0,1}ⁿ}_μ=1^Minto a quantum state is by defining the pure state

$\begin{matrix} η = ❘ ψ ψ ❘, & (E1) \end{matrix}$

with

$\begin{matrix} ❘ ψ 〉 = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s})} ❘ \vec{s} 〉 . & (E2) \end{matrix}$

Here

$q (\vec{s}) = \frac{1}{N} \sum_{μ = 1}^{M} δ_{\vec{s}, {\vec{s}}^{μ}}$

is the classical empirical probability for bitstring {right arrow over (s)}, and |{right arrow over (s)} custom-character is a computational basis state indexed by {right arrow over (s)}. The q({right arrow over (s)}) can be found by counting the bitstrings in the data set {}. From |ψ one can compute expectation values such as

$\begin{matrix} 〈 ψ ❘ σ_{i}^{z} ❘ ψ 〉 = \sum_{\vec{s} \in {0, 1}^{n}} q (\vec{s}) {\vec{s}}_{i} & (E3) \end{matrix}$

for the Pauli spin operator σ_i^z. This can be efficiently computed classically for a polynomially sized dataset, i.e. for polynomially many custom-character . Computing such expectation values from η is possible for all 1- and 2-local Pauli operators as shown in Kappen.

We now show that we can generalize this encoding to Fermionic QBMs, i.e. the terms in the Hamiltonian ansatz consists of Fermionic creation c_i^†and annihilation operators c_i. We define |{right arrow over (s)} custom-character to be equal to the Fermionic Fock basis. This is analogous to the computational basis in the spin-picture (by the Jordan-Wigner transformation), but the bit-strings {∈{0,1}ⁿ}_μ=1^Min the data set should now be interpreted as occupation-number vectors of Fermions. Note that the occupation number basis is defined by the eigenstates of the Fermionic number operator Σ_ic_i^†c_i.

The creation and annihilation operators act on the Fock-basis states as follows

$\begin{matrix} c_{i}^{†} ❘ \vec{s} = (1 - {\vec{s}}_{i}) ❘ \vec{s} +, & (E4) \end{matrix}$

$c_{i} ❘ \vec{s} = {\vec{s}}_{i} ❘ \vec{s} -,$

where custom-character is the unit bit-string with a 1 at position i and zeros everywhere else. With these relations we can derive the required expectation values for the target η to train the (Gaussian) Fermionic QBM

$\begin{matrix} ψ ❘ c_{i}^{†} c_{i} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} q (\vec{s}) {\vec{s}}_{i}, & (E5) \end{matrix}$

$ψ ❘ c_{i}^{†} c_{j} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s}) q (F_{i} F_{j} \vec{s})} (1 - {\vec{s}}_{i}) {\vec{s}}_{j},$

$ψ ❘ c_{i}^{†} c_{j}^{†} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s}) q (F_{i} F_{j} \vec{s})} (1 - {\vec{s}}_{i}) (1 - {\vec{s}}_{j}), i \neq j$

$ψ ❘ c_{i} c_{j} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s}) q (F_{i} F_{j} \vec{s})} {\vec{s}}_{i} {\vec{s}}_{j}, i \neq j$

$ψ ❘ c_{i} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s}) q (F_{i} \vec{s})} {\vec{s}}_{i}$

$ψ ❘ c_{i}^{†} ❘ ψ = \sum_{\vec{s} \in {0, 1}^{n}} \sqrt{q (\vec{s}) q (F_{i} \vec{s})} (1 - {\vec{s}}_{i}),$

where F_iflips the Fermion occupation number (from occupied to unoccupied and vice versa) of index i in the vector {right arrow over (s)}.

2. Data Used for Numerical Simulations in the Main Text

For the numerical simulations in the main text we use two different targets η: 1) a target constructed from a quantum source, and 2) a classical data set embedded into η 1 using the encoding above. For the quantum source we use the XXZ model Hamiltonian

$\begin{matrix} ℋ_{XXZ} = \sum_{i = 1}^{n - 1} J (σ_{i}^{x} σ_{i + 1}^{x} + σ_{i}^{y} σ_{i + 1}^{y}) + Δ σ_{i}^{z} σ_{i + 1}^{z} + \sum_{i = 1}^{n} h_{z} σ_{i}^{z} . & (E6) \end{matrix}$

Here J and Δ are the model parameters describing the Heisenberg interactions between the quantum spins on a one-dimensional lattice, and h_zthe strength of an external magnetic field. We set

$η = \frac{e^{H} xxz}{Z}$

with J=−0.5, Δ=−0.7 and h_z=−0.8, and compute the expectation values custom-character H_i_η classically. This is intractable in general, but our aim is to replicate the scenario in which the expectation values are measured experimentally—for example, from a state prepared on a quantum device.

For the classical source, we use the classical salamander retina dataset given in Tkačik et al., “Searching for collective behavior in a large network of sensory neurons”, PLOS Computational Biology 10, 1 (2014). This data set consists of bit-string data of different features of the response of cells in salamander retina. We select the first 8 features and trim the data to the first 10 data recordings. We then construct the expectation values custom-character H, from the procedure outlined above.

SYSTEM AND METHOD FOR PERFORMING MACHINE LEARNING USING A QUANTUM COMPUTER

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)