This application claims priority to Indian Patent Application No. 202141023642, “Quantum Orthogonal Neural Networks,” filed on May 27, 2021. The subject matter of which is incorporated herein by reference in its entirety.
This disclosure relates generally to neural networks, and more particularly, to training and using orthogonal neural networks using a quantum computing system or a classical computing system.
In the evolution of neural networks structures, adding constraints to the weight matrices has often been an effective path. For example, orthogonal neural networks (OrthoNNs) have been proposed as a new type of neural network for which, at each layer, the weights matrix should remain orthogonal. This property is useful to reach higher accuracy performance and avoid vanishing or exploding gradient for deep architectures. Several classical gradient descent methods have been proposed to preserve the orthogonality while updating the weights matrices. However, these techniques suffer from longer running time and sometimes only approximate the orthogonality. In particular, the main method for achieving orthogonality during training is to first perform gradient descent to update the weights matrix (which is now not going to be orthogonal) and then perform Singular Value Decomposition to orthogonalize or almost orthogonalize the weights matrix. However, achieving orthogonality hinders a fast training process, since at every step an SVD computation needs to be performed.
In the emergent field of quantum machine learning, several proposals have been made to implement neural networks. Some algorithms rely on long term and perfect quantum computers, while others try to harness the existing quantum devices using variational circuits. However, it is unclear how such architectures scale and whether they provide efficient and accurate training.
This disclosure describes novel approaches for machine learning algorithms, such as deep learning algorithms. This disclosure describes a class of Neural Networks that has the property of having orthogonal weight matrices. This is an improved technique for approximating certain functions, like those for the classification of data due to the reasons described below. The neural networks described constructed may also be optimized in terms of the number of gates, scaling time of training, and type of gates in the circuit.
Orthogonal neural networks may provide an advantage for deep neural networks, that is neural networks with a large number of layers. They may preserve the norms both during the forward and backward pass. This property enables the prevention of gradient vanishing and explosion, which is prominent in deep neural networks. Such neural networks also have the property of non-redundancy in the weights since the vectors are orthogonal and linearly independent, thereby each of them giving “different” information about the input-output relation.
Some embodiments relate to a quantum architecture for a connected neural network that offers orthogonality in the weight matrices. In some embodiments, the neural network comprises a quantum circuit shaped like an inverted pyramid for each layer of the neural network.
Some embodiments relate to using a unary preserving quantum circuit (e.g., with BS gates as described in Section 2) to form a layer of an orthogonal neural network. The layer may be trained in O(n) time where n is the number of input nodes of the layer. Data may be loaded into the layer using a data loader e.g., as described in Section 2.
Other aspects include components, devices, systems, improvements, methods, processes, applications, computer readable mediums, and other technologies related to any of the above.
Embodiments of the disclosure have other advantages and features which will be more readily apparent from the following detailed description and the appended claims, when taken in conjunction with the examples in the accompanying drawings, in which:
The figures and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
This disclosure presents a new training method for neural networks that preserves (e.g., perfect) orthogonality while having the same running time as usual gradient descent methods without the orthogonality condition, thus achieving the best of both worlds, most efficient training and perfect orthogonality.
One of the main ideas comes from the quantum world, where any quantum circuit corresponds to an operation described by a unitary matrix, which if we only use gates with real amplitudes, is an orthogonal matrix. In particular, this disclosure describes a novel special-architecture quantum circuit, for which there is an efficient way to map the elements of an orthogonal weights matrix to the parameters of the gates of the quantum circuit and vice versa. Thus, while performing a gradient descent on the elements of the weights matrix individually does not preserve orthogonality, performing a gradient descent on the parameters of the quantum circuit does preserve orthogonality (since any quantum circuit with real parameters corresponds to an orthogonal matrix) and is equivalent to updating the weights matrix. This disclosure also proves that performing gradient descent on the parameters of the quantum circuit can be done efficiently classically (with constant update cost per parameter) thus concluding that there exists a quantum-inspired, but fully classical way of efficiently training perfectly orthogonal neural networks.
Moreover, the special-architecture quantum circuit defined herein has many properties that make it a good candidate for NISQ (Noisy Intermediate-Scale Quantum) implementations: it may use only one type of quantum gate, may use a simple connectivity between the qubits, may have depth linear in the input and output node sizes, and may benefit from powerful error mitigation techniques that make it resilient to noise. This allows us to also propose an inference method running the quantum circuit on data which might offer a faster running time (e.g., given the shallow depth of the quantum circuit).
Some of our contributions are summarized in Table 1 (below), where we have considered the time to perform a feedforward pass, or one gradient descent step. A single neural network layer is considered, with input and output of size n. For example, the methods described in this disclosure are just as fast as other methods during the feedforward pass. Additionally, the algorithms in this disclosure are faster than other orthogonal methods and just as fast as non-orthogonal methods in the matrix update process.
In this section we define a special-architecture parametrized quantum circuit that may be useful for performing training and inference on orthogonal neural networks. As we said, the training may be (e.g., completely) classical in the end, but the intuition of the new method comes from this quantum circuit, while the inference can happen classically or by applying this quantum circuit. A basic introduction to quantum computing concepts for this work is given in Sections 9 and 12.
The quantum circuits proposed in this work that implement fully connected neural network layers with orthogonal weight matrices may use only one type of quantum gate: the Reconfigurable Beam Splitter (BS) gate. The BS gate is a parametrizable two-qubit gate. This two-qubit gate may be considered hardware efficient, and it may have one parameter: angle θ∈[0, 2π]. An example matrix representation of the BS gate is given as:
The BS gate may be represented by other similar matrices. For example, the rows and columns of the above matrix can be permuted, a phase element eip may be introduced instead of the “1” at matrix position (4,4), or the two elements sin(θ) and −sin(θ) may be changed to, for example, i*sin(θ) and i*sin(θ). The above BS gate can also be decomposed in a set of two- and one-qubit parametrized gates. All these gates are practically equivalent, and our methods can use any of them. Thus, as used herein, “BS gate” may refer to any of these gates. Here are some specific examples of alternative BS gates, however, this list is not exhaustive:
BS
1(θ)=[[1,0,0,0],[0, cos(θ),−i*sin(θ),0],[0,−i*sin(θ), cos(θ),0],[0,0,0,1]];
BS
2(θ)=[[1,0,0,0],[0, cos(θ), sin(θ),0],[0, sin(θ),−cos(θ),0],[0,0,0,1]];
BS
3(θ,φ)=[[1,0,0,0],[0, cos(θ),−i*sin(θ),0],[0,−i*sin(θ), cos(θ),0],[0,0,0,e−iφ]]; and
BS
4(θ,φ)=[[1,0,0,0],[0, cos(θ), sin(θ),0],[0,−sin(θ), cos(θ),0],[0,0,0,e−iφ]].
We can think of the BS gate as a rotation in the two-dimensional subspace spanned by the basis {|01, |10
, while it acts as the identity in the remaining sub space {|00
, |11
}. Or equivalently, starting with two qubits, one in the |0
state and the other one in the state |1
, the qubits can be swapped or not in superposition. The qubit |1
stays on its wire with amplitude cos θ or switches with the other qubit with amplitude+sin θ if the new wire is below (|10
|01
) or −sin θ if the new wire is above (|01
|10
). Note that in the two other cases (|00
and |11
) the BS gate acts as identity.
2.2 Quantum Pyramidal Circuit
We now propose a quantum circuit that implements an orthogonal layer of a neural network. The circuit is a pyramidal structure of BS gates, each with an independent angle. More details are provided below concerning the input loading and the equivalence with a neural network's orthogonal layer.
To mimic a given classical layer with a quantum circuit, the number of output qubits may be the size of the classical layer's output. We refer to the square case when the input and output sizes are equal, and to the rectangular case otherwise.
One property to note is that the number of parameters of the quantum pyramidal circuit corresponding to a neural network layer of size n×d is (2n−1−d)*d/2, which is the same as the number of degrees of freedom of an orthogonal matrix of dimension n×d (the least number of parameters that uniquely define the orthogonal matrix).
For simplicity, our analysis considers the square case (i.e., n input nodes and n output nodes) but everything can be easily extended to the rectangular case (i.e., n input nodes and p≠n output nodes). As stated, the pyramidal structure of the quantum circuit described above imposes the number of free parameters to be N=n(n−1)/2, which is the exact number of free parameters to specify an n×n orthogonal matrix. Said differently, there is an efficient one-to-one mapping between the N=n(n−1)/2 parameter angles {θi: i ∈[N]} of the gates in the inverted pyramid and the N=n(n−1)/2 degrees of freedom of an n×n orthogonal matrix W with entries wij. In the example case of
In Section 3 we show how the parameters of the gates of this pyramidal circuit can be related to the elements of the orthogonal matrix of size n×n that describes it. We note that alternative architectures can be imagined as long as the number of gate parameters is equal to the parameters of the orthogonal weights matrix and a (e.g., simple) mapping between them and the elements of the weights matrix can be found.
Note that this pyramid circuit has linear depth and is convenient for near term quantum hardware platform with restricted connectivity. Indeed, the example distribution of the BS gates uses only nearest neighbor connectivity between qubits in the circuit diagram. However, alternative versions may or may not use nearest neighbor connectivity (examples later).
Although
2.3 Loading the Data
Before applying the quantum pyramidal circuit, the classical data may be uploaded into the quantum circuit. We may use one qubit per feature of the data. For this, we use a unary amplitude encoding of the input data. Let's consider an input sample x=(x0, . . . , xn−1) E n, such that ∥x∥2=1. The sample can be encoded in a superposition of unary states:
|x=x0|10 . . . 0
+x1|010 . . . 0
+ . . . +xn−1|0 . . . 01
(2)
The previous state can be rewritten using |ei to represent the ith unary state with a |1
in the ith position |0 . . . 010 . . . 0
, as:
Although a logarithmic depth data loader circuit can be used for loading such states, a simpler circuit may be used. It is a linear depth cascade of n−1 BS gates which, due to the structure of our quantum pyramidal circuit, may only add 2 extra steps to our pyramid circuit. An example of this linear depth cascade circuit (also referred to as the “diagonal loader”) is illustrated in
). The angles parameters α0, . . . , αn−2 may be classically pre-computed from the input vector. Note that the data loader circuit 405 includes an X gate (not illustrated) that flips the first qubit from the |0
state to the |1
state.
Generally, a data loader circuit starts in the all |0 state and flips a first qubit using an X gate, in order to obtain the unary state |10 . . . 0
(e.g., as shown on
using a set of n−1 angles α0, . . . , αn−2. Using Eq.(1), angles are chosen such that, after the first BS gate of the loader, the qubits would be in the state x0|100 . . .
+sin(α0)|010 . . .
and after the second one in the state x0|100 . . .
+x1|010 . . .
+sin(α0) sin(α1)|001 . . .
and so on, until obtaining |x
as in Eq.(2). To this end, classical preprocessing may be performed to compute recursively the n−1 loading angles, in time O(n):
The ability of loading data in such a way uses the assumption that each input vector is normalized, i.e. ∥x∥2=1. This normalization constraint could seem arbitrary and impact the ability to learn from the data. In fact, in the case of orthogonal neural network, this normalization shouldn't degrade the training because orthogonal weight matrices are in fact orthonormal and thus norm-preserving. Hence, changing the norm of the input vector, by dividing each component by ∥x∥2, in both classical and quantum setting is not a problem. The normalization would impose that each input has the same norm, or the same “luminosity” in the context of images, which can be helpful or harmful depending on the use case.
2.4 Additional Information on Data Loader Circuits
The first step of the data loading, given access to a classical data point (e.g., x=(x1, x2, . . . , xd)), is to pre-process the classical data efficiently, e.g., spending only O(d) total time (where the logarithmic factors are hidden), in order to create a set of parameters (e.g., θ=(θ1, θ2, . . . , θd−1)), that will be the parameters of the (d−1) two-qubit gates used in our quantum data loader circuit. During pre-processing, we may also keep track of the norms of the vectors. Note that these angles parameters are different depending on which data loader circuit is used.
We may use three different types of data loader circuits.
The shallowest data loader circuit is the parallel data loader circuit (example in
An example method for constructing a parallel data loader circuit is the following. We start with all qubits initialized to the 0 state. In the first step, we apply an X gate on the first qubit. Then, the circuit is constructed by adding BS gates in layers, using the angles θ we constructed before. The first layer has 1 BS gate, the second layer has 2 BS gates, the third layer has 4 BS gates, until the log(d)-th layer that has d/2 gates. The qubits to which the gates are added follow a tree structure (e.g., a binary tree structure). In the first layer we have one BS gate between qubits (0,d/2) with angle θ1, in the second layer we have two BS gates between (0,d/4) with angle θ2 and (d/2,3d/4) with angle θ3, in the third layer there are four BS gates between qubits (0,d/8) with angle θ4, (d/4,3n/8) with angle θ5, (d/2,5d/8) with angle θ6, (3d/4,7d/8) with angle θ7, and so forth for the other layers. Parallel data loader circuits are also described in U.S. patent application Ser. No. 16/986,553 filed on Aug. 6, 2020, which is incorporated herein by reference.
The two other types of data loader circuits may have worse asymptotic depth (in other words, larger depths) but fewer BS gates that are applied to non-nearest neighbor quits.
The diagonal data loader uses d qubits and d−1 BS gates that may be applied to nearest neighboring qubits in the circuit diagram (e.g., see
The semi-diagonal loader similarly uses d qubits and d−1 BS gates that may be applied to nearest neighboring qubits in the circuit diagram (e.g., see
To determine which data loader circuit to use, we typically choose a data loader that increases the depth the least. With the pyramid circuit, for instance, the diagonal data loader circuit fits well, despite its large intrinsic depth (as described above). However, for other neural network layer circuits, this may not be the case. When there no such trick, the parallel loader is typically preferred because of its small depth.
3.1 Brief Description of Feedforward Pass
Given the angles, one can find the unique matrix and given the matrix one can uniquely specify the angles. To get an wij entry of the weight matrix for a layer, we take the sum of expressions from (e.g., all) possible paths from qubit j to i using the following rules:
(A) If we pass by any gate with angle θn, we multiply with cos(θn).
(B) If we go up on any gate with angle θn, we multiply with −sin(θn).
(C) If we go down on any gate with angle θn, we multiply with sin(θn).
Calculating the weight matrix in this or similar manner can be done efficiently using various techniques like recursion, dynamic programming, or applying the gates to the weight matrix in the appropriate order since this is similar to the implementation of the BS gates described above.
To obtain the angles from a given orthogonal matrix, we traverse the orthogonal matrix column by column from right to left and going from bottom to top (until before the anti-diagonal element) in each column. For example, see
We can combine such layers sequentially to create a larger quantum neural network. Between each layer, one can measure the quantum states, apply a non-linearity, and then upload the data to the next layer. For example, see
We can also add another quantum layer before or after the pyramidal structure to construct different architectures that encompass our construction.
To load the data for each layer, we can use the construction in
3.2 Detailed Description of Feedforward Pass
The following paragraphs further describe subject matter in Section 3.1 above.
In this section we detail the effect of the quantum pyramidal circuit on an input encoded in a unary basis, as in Eq.(2). We will also see in the end how to simulate this quantum circuit classically with a small overhead and thus be able to provide a fully classical scheme.
Let's first consider one pure unary input, where only the qubit j is in state |1 (e.g. |00000010
). This unary input is transformed into a superposition of unary states, each with an amplitude. If we consider again only one of these possible unary outputs, where only the qubit i is in state |1
, its amplitude can be interpreted as a conditional amplitude to transfer the |1
from qubit j to qubit i. Intuitively, this value is the sum of the quantum amplitudes associated to each possible path that connects the qubit j to qubit i, as shown in
Using this image of connectivity between input and output qubits, we can construct a matrix W∈n×n, where each element Wij is the overall conditional amplitude to transfer the |1
from qubit j to qubit i.
W
56
=−s(θ16)c(θ22)s(θ23)−s(θ16)c(θ17)c(θ23)c(θ24)+s(θ16)s(θ17)c(θ18)s(θ24) (5)
In fact, the n×n matrix W can be seen as the unitary matrix of our quantum circuit if we solely consider the unary basis, which is specified by the parameters of the quantum gates. A unitary is a complex unitary matrix, but in our case, with only real operations, the matrix is orthogonal. This proves the correspondence between any matrix W and the pyramidal quantum circuit.
The full unitary UW in the Hilbert Space of our n-qubit quantum circuit is a 2n×2n matrix with the n×n matrix W embedded in it as a submatrix on the unary basis. This is achieved by loading the data as unary states and by using only BS gates that keep the number of 0s and 1s constant.
For instance, as shown in
In
Let's consider an input vector x ∈n encoded as a quantum state |x
=Σi=0n−1xi|ei
where |ei
represents the ith unary state, as explained in Eq.(3). By definition of W, each unary |ei
will undergo a proper evolution |ei
Σj=0n−1Wij|ej
. This yields, by linearity, to the following mapping
As explained above, our quantum circuit is equivalently described by the sparse unitary UW∈2
n×n. This can be summarized with
U
W
|x
=|Wx
(7)
We see from Eq.(6) and Eq.(7) that the output is in fact |y, the unary encoding of the vector y=Wx, which is the output of a matrix multiplication between the n×n orthogonal matrix W and the input x∈
n. As expected, each element of y is given by yk=Σi=0n−1Wikxi. See
Therefore, for any given neural network's orthogonal layer, there may be a quantum pyramidal circuit that reproduces it. On the other hand, any quantum pyramidal circuit may be implementing an orthogonal layer of some sort.
Additional details concerning multi-layers branching, the tomography at the end of each layer, and the way to apply the non linearities are given in Section 10.
Thus, the quantum circuits proposed in this work can rightfully be called “quantum neural networks” even though this term has been employed to arbitrary variational circuits that present some conceptual similarities to neural networks. With our quantum pyramidal circuits, we control and understand the quantum mapping. It implements each layer and their non linearities, in a modular way. Our orthogonal quantum neural networks are also different regarding the training strategies (see Section 4 for details).
3.1 Classical Implementation
While the quantum pyramidal circuit is presented as the inspiration of the new methods for orthogonal neural networks, these quantum circuits can be simulated classically on a classical computing system with a small overhead, thus yielding classical methods for orthogonal neural networks.
The classical algorithm may be the simulation of the quantum pyramidal circuit, where each BS gate is replaced by a planar rotation between its two inputs.
As shown in
planar rotations, tor a total of
basic operations. Therefore, our single layer feedforward pass has the same complexity O(n2) as the usual matrix multiplication.
y=Wx. The angles and the weights can be chosen such that our classical pyramidal circuit (
One may still have an advantage performing the quantum circuit for inference, since the quantum circuit has depth O(n), instead of the O(n2) classical complexity of the matrix-vector multiplication. Nevertheless, as discussed see below, an advantage of our methods is that orthogonal weights matrices may be trained classically in time O(n2), instead of the previously best-known O(n3).
Described differently, inference on any input data can be done by sequentially applying each layer of the neural network. This is equivalent to multiplying the input by the generated orthogonal weight matrix. For an (n×n) layer, classically this takes time O(n2), the time to multiply an n×n matrix with an n-dimensional input vector, while the quantum circuit can perform this multiplication with O(n), steps since the depth of the quantum circuit is O(n).
4 OrthoNN training: Angle's Gradient Estimation and Orthogonal Matrix Update
4.1 Brief Description of OrthoNN Training
For clarity, the remaining paragraphs of this section rephrase the above description.
Unlike in the classical feed-forward neural networks, gradient descent is performed on the BS gate angles directly and not on the weight matrix elements. This can be performed in multiple ways, such as batch gradient descent, stochastic gradient descent, etc. with a suitable learning rate. Mathematically, the update rule may be
and can use different kinds of optimizers, like adam, rmsprop, yogi, etc.
To calculate the gradient of the cost function C with respect to the angle of the B S gates, the error may be backpropagated not just over the layers of the network, but also over the mini layers (also referred to as “timesteps”) which we denote by and δλ, respectively.
is the vector representing the error (gradient) with respect to the input
to the layer, that is
δλ is the vector representing the error (gradient) with respect to the input ζλ to the mini layer, that is
The values of δλ may be calculated in the following way. δλ=(ωλ)T·δλ+1 for the weight matrix (ωλ)T of the layer index λ. For the last timestep, the first to be calculated, we have δλ
For calculating the values of Δl, the following equations can be used:
Δl−1=δ0⊙σ′(zl) where σ′ is the derivative of the activation function σ.
For an BS gate with angle θ acting on the qubits i and i+1 in the mini layer λ, the gradient calculation can be derived to be the following expression:
which can be calculated in constant time. See also Equation 9 and
On correct and efficient implementation of the above architecture and learning algorithm, we observe that the time taken for each layer to calculate and update the weights scales as O(nm) for a layer with n inputs and m outputs for each data point. This is as good as the classical non-orthogonal neural networks and provides the advantages offered by orthogonality. The forward pass (only inference), once the model is trained, gives a quadratic speedup as it scales as O(n) instead of O(nm) as in the classical case.
4.2 Detailed Description of OrthoNN Training
The following paragraphs further describe subject matter in Section 4.1 above.
An introduction and notation to backpropagation in a fully connected neural networks is described in Section 8.
When using quantum circuits to implement layers of a neural network, the parameters to update are no longer the individual elements of the weight matrices directly but may be the angles of the BS gates that give rise to these matrices. Thus, we design an adaptation of the backpropagation method to our setting based on the angles.
We start by introducing some notation for a single layer of the neural network, which is not explicit in the notation for simplicity. We assume we have as many output qubits as input qubits, but this can easily be extended to the rectangular case.
We first introduce the notion of timesteps inside each neural network layer, which correspond to the computational steps in the pyramidal structure of the circuit (see and the resulting vector
(see Section 8).
In fact, we have the correspondences ζ0= for the first inner layer, which is the input of the actual layer, and
=wλmax·ζλmax for the last timestep. We also have
=wλmax . . . w1w0. We use the same kind of notation for the backpropagation errors. At each timestep λ we define an inner error
This definition is similar to the layer error
In fact, the same backpropagation formulas may be used, without non linearities, to retrieve each inner error vector δλ=(wλ)T·δλ+1. In particular, for the last timestep, the first to be calculated, we have δλmax=(wmaxλ)T·. Finally, we can retrieve the error at the previous layer
−1 using the correspondence
=δ0⊙σ′(z
The reason for this breakdown into timesteps is the ability to efficiently obtain the gradient with respect to each angle. Let's consider one gate with angle θi, acting at the timestep λ on qubits i and i+1. We decompose the gradient
using each component, indexed by the integer k, of the inner layer and inner error vectors:
Since timestep λ is only composed of separated BS gates, the matrix wλ includes in diagonally arranged 2×2 block submatrices given in Eq. (1). Only one of these submatrices depends on the angle θ considered here, at the position i and i+1 in the matrix. We can thus rewrite the above gradient as:
Therefore, we have shown a way to compute each angle gradient: during the feedforward pass, sequentially apply each of the 2n−3=O(n) timesteps and store the resulting vectors (the inner layers ζλ). During the backpropagation, obtain the inner errors δλ by applying the timesteps in reverse. To do this, we “back-propagate” the errors by calculating first the δλ+1 and then δλ, from λmax to 0). Afterwards, a gradient descent method may be used on each angle θi, while preserving the orthogonality of the overall equivalent weight matrix:
An interesting aspect of this gradient descent is the fact that the optimization is performed in the angle landscape, and not on the equivalent weight landscape. These landscapes can potentially be different and hence the optimization can produce different models.
As one can see from the above description, this is a classical algorithm to obtain the angle's gradients, which allows the OrthoNN to be trained efficiently classically while preserving the strict orthogonality.
To obtain the angle's gradient, 2n−3 inner layers ζλ may be stored during the feedforward pass. Next, given the error at the following layer, a backward loop on each timestep may be performed (see
Thus, this classical algorithm allows the gradients of the n(n−1)/2 angles to be computed in O(n2), in order to perform a gradient descent respecting the strict orthogonality of the weight matrix. This is considerably faster than previous methods based on Singular Value Decomposition methods and provides a training method which is as fast as for normal neural networks (e.g., see Table 1), while providing the extra property of orthogonality.
We performed basic numerical experiments to verify the learning abilities of the pyramidal circuit using a classical simulation. In these experiments, we use a dataset of handwritten digits (in this case, the standard MNIST dataset) to compare our pyramidal OrthoNN with an SVB algorithm.
This disclosure describes training methods for orthogonal neural networks (OrthoNNs) that run in quadratic time, which is a significant improvement over previous methods based on Singular Value Decomposition.
One idea of our methods is to replace the usual weights and orthogonal matrices by an equivalent pyramidal circuit made of two-dimensional rotations. Each rotation is parametrizable by an angle, and the gradient descent takes place in the angle's optimization landscape. This unique type of gradient backpropagation may ensure a perfect orthogonality of the weights matrices while improving the running time compared to previous works. Moreover, both classical and quantum methods may be used for inference, where the forward pass on a near term quantum computing system may provide a provable advantage in the running time. This disclosure also expands the field of quantum deep learning by introducing new tools, concepts, and equivalences with classical deep learning theory.
The idea behind Orthogonal Neural Networks (OrthoNNs) is to add constraint to the weight matrices corresponding to the layers of a neural network. Imposing orthogonality to these matrices have theoretical and practical benefits in the generalization error. Orthogonality may ensure a low weights redundancy and may preserve the magnitude of the weight matrix's eigenvalues to avoid vanishing gradients. In terms of complexity, for a single layer, the feedforward pass of an OrthoNN is a matrix multiplication, hence has a running time of O(n2) if n×n is the size of the orthogonal matrix.
A difficulty of OrthoNNs is to preserve the orthogonality of the matrices while updating them during gradient descent. Several algorithms have been proposed to this end, but they all point that pure orthogonality is computationally hard to conserve.
As used herein, an orthogonal matrix refers to a real square matrix whose columns and rows are orthonormal vectors. One way to express this is QTQ=QQT=I, where QT is the transpose of Q and I is the identity matrix.
Backpropagation in a fully connected neural network is an efficient procedure to update the weight matrix at each layer. At layer , we note its weight matrices
and biases
. Each layer is followed by a nonlinear function a, and can therefore be written as
=σ(
·
+
)=σ(
) (10)
After the last layer, one can defined a cost function C that compares the output to the ground truth. The goal is to calculate the gradient of C with respect to each weight and bias, namely
In the backpropagation, the method calculates these gradients for the last layer, then propagates back to the first layer.
The error vector at layer may be defined by
One can show the backward recursive relation =(
)T·
⊙
, where ⊙ symbolizes the Hadamard product, or entry-wise multiplication. Note that the previous computation applies the layer (apply matrix multiplication) in reverse. We can then show that each element of the weight gradient matrix at layer
is given by
Similarly, the gradient with respect to the biases is defined as
Once these gradients are computed, the parameters may be updated using the gradient descent rule, with learning rate η (note that η may be the same or different than η used in Section 4):
This section provides a succinct quantum information background that may be helpful for this work.
9.1 Qubits
In classical computing, a bit can be either 0 or 1. With a quantum information perspective, a quantum bit or qubit can be in state |0, |1
. We use the braket notation |⋅
to specify the quantum nature of the bit. The qubits can be in superposition of both states α|0
+β|1
where α, β∈
such that |α|2+|β|2=1. The coefficients α and β are called amplitudes. The probabilities of observing either 0 or 1 when measuring the qubit are linked to the amplitudes:
p(0)=|α|2,p(1)=|β|2 (12)
As quantum physics teaches us, any superposition is possible before the measurement, which gives special abilities in terms of computation. With a n qubits, 2n possible binary combinations (e.g. |01 . . . 1001) can exist simultaneously, each with its own amplitude.
An n qubit system can be represented as a normalized vector in a 2n dimensional Hilbert space. A multiple qubit system is called a quantum register. If |p and |q
are two quantum states or quantum registers, the whole system can be represented as a tensor product |p
⊗|q
, also written as |p
|q
or |p, q
.
9.2 Quantum Computation
As logical gates in classical circuits, qubits or quantum registers are processed using quantum gates. A gate is a unitary mapping in the Hilbert space, preserving the unit norm of the quantum state vector. Therefore, a quantum gate acting on n qubits is a matrix U∈2
Common single qubit gates includes the Hadamard gate
that maps
creating the quantum superposition, the NOT gate
that permutes |0 and |1
, or Ry rotation gate parametrized by an angle θ, given by
Common two-qubits gates includes the CNOT gate
which is a NOT gate applied on the second qubit only if the first one is in state |1, or similarly the CZ gate
In this work, we use the BS gate. In some embodiments, this gate can be implemented either as a native gate, known as FSIM, or using four Hadamard gates, two Ry rotation gates, and two two-qubits CZ gates. An example of this circuit is illustrated in
One advantage of quantum gates is their ability to be applied to a superposition of inputs. Indeed, given a gate U such that U|x|f(x)
, it can be applied to all possible combinations of x at once
10.1 Tomography and Error Mitigation
As shown in =|Wx
. As often in quantum machine learning, it may be important to go all the way and consider the cost of retrieving classical outputs, using a procedure called tomography. In our case, this may be important since between each layer, the quantum output is converted into a classical one in order to apply a nonlinear function, and then reloaded for the next layer.
10.1.1 Error Mitigation
Before detailing the tomography procedure, it is interesting to notice that with our restriction to unary states, a strong benefit appears for error mitigation purposes. Indeed, since we may expect to obtain only quantum superposition of unary states at every layer, measurements may be post processed to discard measurements that include non-unary states (i.e., states with more than one qubit in state |1, or the ground state). The most expected error is a bit flip between |1
and |0
. The case where two bit flips happened, which would pass through our error mitigation, is even less probable.
10.1.2 Tomography
Retrieving the amplitudes of a quantum state comes at cost of multiple measurements, which requires to run the circuit multiples times, hence adding a multiplicative overhead in the running time. A finite number of samples is also a source of approximation in the final result. In this work, we allow for ∞ errors. The
∞ tomography on a quantum state |y
with unary encoding on n qubits may require O(log(n)/δ2) measurements, where δ>0 is the error threshold allowed. For each j∈[n], [yj] is obtained with an absolute error δ, and if [yj]<δ, it will most probably not be measured, hence set to 0. In practice, one would perform as many measurements as it is convenient during the experiment and deduce the equivalent precision δ from the number of measurements made.
In some embodiments, the sign of each component of the vector may be determined. Indeed, since we measure probabilities that are the square module of the quantum amplitudes, the sign may not be readily apparent. In the case of neural network, it may be important to obtain the sign of the layer's components in order to apply certain types of non linearities. For instance, the ReLu activation function is often used to set all negative components to 0.
In =|Wx
.
∞ tomography may be applied.
∞ tomography is a method that determines how many samples from a quantum state to take to retrieve an approximated description of it. The approximation is made with a relative error with respect to the ‘infinite norm’ (instead of the usual ‘L2’ or Euclidean norm).
The sign retrieval procedure may include three parts.
The circuit is first applied as described above (e.g., execute the circuit in ∞ tomography. The probability of measuring the unary state |e1
(i.e. |100 . . .
), is p(e1)=y12.
The same steps are applied a second time on a modified circuit (e.g., execute the circuit in and |e2
are now given by p(e1)=(y1+y2)2 and p(e2)=(y1−y2)2. Therefore if p(e1)>p(e2), we have sign(y1)≠sign(y2), and if p(e1)>p(e2), we have sign(y1)=sign(y2). The same holds for the pairs (y3,y4), and so on.
The same steps are applied again, except the additional BS gates are shifted by one position below (e.g., execute the circuit in
Each value yj with its sign may be determined e.g., assuming that y1>0. This procedure has the benefit of only adding a constant depth (in other words, it doesn't grow with the number of qubits). In this case, the depth increases by one. However, this process may use three times more runs. The overall cost of the tomography procedure with sign retrieval is given by Õ(n/δ2).
In =|Wx
. Compared to the above procedure, it executes a single circuit, but it may require an additional qubit, and the depth of the circuit may be 3n+O(1) instead of 2n+O(1). This circuit initializes the qubits in (|0
+|1
)|0
, where the last |0
corresponds to the n qubits that will be processed by the pyramidal circuit and the loaders. Next, applying the data loader for the normalized input vector x, the pyramidal circuit, according to Eq.(6), maps the state to:
Then, we use an additional data loader for the uniform norm-1 vector
Note that this loader is built in the reverse order to fit the pyramid and limit the augmentation of the depth. We also apply the adjoint of this loader after a controlled operation on the first extra qubit. Recall that if a circuit U is followed by U†, it is equivalent to the identity. Therefore, this loads the uniform state only in some part of the superposition of the extra qubit:
Afterwards, a Hadamard gate mixes both parts of the amplitudes on the extra qubit:
On this state, we can see that the probability of measuring the extra qubit in state 0 and rest in the unary state ej is given by
Therefore, for each j, if after several measurements we observe
we can deduce Wjx>0. Having the sign, we can get the value
Combining with the ∞ tomography and the non linearity, the overall cost of this tomography is given by Õ(n/δ2) as well.
10.2 Multiple Quantum Layers
In the previous sections, we have seen how to implement a quantum circuit to perform the evolution of one orthogonal layer. In classical deep learning, such layers are stacked to gain in expressivity and accuracy. Between each layer, a non-linear function may be applied to the resulting vector.
The benefit of using our quantum pyramidal circuit is the ability to simply concatenate them to mimic a multi-layer neural network. After each layer, a tomography of the output state |z is performed to retrieve each component, corresponding to its quantum amplitudes. A nonlinear function σ is then applied classically to obtain a=σ(z). The next layer starts with a new unary data loader. This scheme allows us to keep the depth of the quantum circuits reasonable for NISQ devices, by applying the neural network layer by layer.
In some embodiments, the circuit can include additional entangling gates after each pyramid layer (composed for instance of CNOT or CZ). This would mark a step out of the unary basis but may effectively allow to explore more interactions in the Hilbert Space.
This section describes example quantum circuits with different architectures that can be used to implement a layer of an orthogonal neural network. The details of them are summarized below in Table 2:
nlog(n)/2
Descriptions of the circuits listed in Table 2 are provided below.
The pyramid circuit is described in other sections. An example pyramid circuit is illustrated in
The butterfly circuit was inspired by the butterfly circuits of the Cooley-Tukey FFT algorithm. The butterfly circuit described herein is an efficient way to characterize a reduced yet powerful class of orthogonal layers. This circuit is a low depth circuit as compared to others (log(n) depth). The butterfly layer does not characterize all the orthogonal matrices (with determinant 1) due to reduced number of parameters (n log(n)/2) but still covers a class of orthogonal matrices, like the unary Fourier Transform. This circuit may require all-to-all qubit connectivity. A parallel data loader may be preferred with this circuit.
An example butterfly circuit is illustrated in
The brick circuit is the most depth efficient orthogonal layer circuit with BS gates in Table 2 which can characterize the entire class of orthogonal matrices with determinant 1. The brick circuit may have the same number of parameters as the Pyramid circuit (n(n−1)/2) but about half the depth. Some embodiments of the brick circuit use nearest neighbor qubit connectivity. However, loading data using a data loader may add an additional depth (e.g., n/2 for a semi-diagonal loader or log(n) for a parallel loader). In many cases, the brick circuit may be preferred (e.g., optimal) due to its small depth.
An example brick circuit is illustrated in
The V circuit is a reduced version of the pyramid circuits. The V circuit is designed for NISQ hardware. This layer provides a path from every qubit to every qubit but has only linear parameters (2n−3). It may be preferred to use a diagonal data loader with the V circuit.
An example V circuit is illustrated in
The X circuit (not to be confused with an X gate) is a reduced version of the brick circuit. The X circuit is designed for NISQ hardware. This layer provides a path from every qubit to every qubit but has only linear parameters (2n−3). The additional loader depth may be the same as brick circuit.
An example X circuit is illustrated in
The training methods for the above circuits may be the same as described in Section 4. For example, we go inner-layer-by-inner-layer and update each angle using the same update rule as described above for the pyramid layers.
As stated above, the above circuits provide a path from each qubit to every other qubit. For example, looking at one of the circuit diagrams in
The above circuits may characterize the special orthogonal group, i.e., the orthogonal matrices with determinant +1. They may be generalized to incorporate the ones with determinant −1 as well by applying a Z gate in the end on the last qubit.
11.1 Performing Unary Fourier Transform using Butterfly Circuits
Classically, the matrix that implements a Fourier transform (FFT) is given by:
where the omegas are roots of unity.
A Fourier Transform in the unary domain may be performed by using the butterfly circuit architecture with one additional single qubit gate per BS gate.
The circuit also includes also uses another type of one-qubit rotation gate represented by a white square with −ωk in it, for which the matrix is given by [[1, 0], [0, −ωk]], where ω is the corresponding root of unity. Thus, the circuit in
12.1 Example Method of Executing a Quantum Circuit to Implement a Layer of a Neural Network.
The computing system executes 1610 at least O(log(n)) layers of the quantum circuit that apply BS gates, each BS gate being a single parameterized two-qubit gate, the number of BS gates being equal to the number of degrees of freedom of the orthogonal weight matrix. In some embodiments, the BS gates are applied to x>0 qubits of a quantum computer. In some embodiments, execution of the at least O(log(n) layers of the quantum circuit is performed by a classical computer simulating a quantum computer (e.g., see Section 3.1 for more information). In some embodiments, the layer of the neural network is fully connected. In some embodiments, n=d.
The number of BS gates in the O(log(n)) layers may be equal to (2n−1−d)*d/2 (e.g., see Section 2 for more information). In some embodiments, the O(log(n)) layers only include BS gates (e.g., see
In some embodiments, each BS gate is applied to adjacent qubits of the quantum computer. Adjacent qubits may refer to nearest neighbor qubits on a qubit register of the quantum computer. Generally, a pair of non-adjacent qubits are qubits that are far enough apart, or with sufficiently many obstructing qubits or other components between them, that the mechanism used to couple the qubits in the physical platform they are implemented with does not work to implement a two-qubit interaction directly between the pair without some modification to the coupling procedure or hardware. Adjacent qubits on a qubit register may be adjacent to each other on a circuit diagram. For example, with respect to
In some embodiments, the at least O(log(n) layers apply BS gates to the qubits according to a pyramid pattern (e.g., see
In some embodiments, the computing system prepares a unary quantum state on the x qubits of the quantum computer (e.g., by executing quantum gates), the unary quantum state corresponding to input data (e.g., a vector) to be applied to the layer of the neural network (e.g., see Section 2.3 for more information). The unary quantum state may be a superposition of unary states corresponding to the input data vector. In some embodiments, an output quantum state formed on the x qubits by executing the at least O(log(n) layers is also a unary quantum state, the output unary quantum state corresponding to output data of the layer of the neural network (e.g., see Section 3 for more information). In some embodiments, the computing system prepares the unary quantum state on the x qubits by: executing a first layer that applies an X gate to one of the x qubits; and after executing the first layer, executing n−1 layers that apply n−1 BS gates to the x qubits of the quantum computer (e.g., see
12.2 Example Method of Training a Layer of a Neural Network.
The computing system executes 1710 layers of BS gates of a quantum circuit. Each BS gate is a single parameterized two-qubit gate. Weights of the weight matrix are based on values of parameters of the BS gates. In some embodiments, a quantum computing system executes the layers of the BS gates of the quantum circuit.
The computing system determines 1720 gradients of a cost function with respect to parameters of the BS gates of the quantum circuit.
The computing system updates 1730 values of parameters of the BS gates of the quantum circuit based on the gradients of the cost function. The updated values of the parameters preserve the orthogonality of the weight matrix.
In some embodiments, determining gradients of the cost function comprises determining gradients of the cost function with respect to the parameter of each BS gate of the quantum circuit.
In some embodiments, executing layers of BS gates of the quantum circuit includes the computing system measuring a resulting quantum state ζλ after each layer λ of the quantum circuit is executed. In some embodiments, the computing system determines errors δ for layers λ of the quantum circuit. In some embodiments, determining errors δ for layers λ of the quantum circuit comprises the computing system determining errors for each layer of the quantum circuit in reverse order according to: δλ=(wλ)T·δλ+1, where δλ is the error for layer λ of the quantum circuit and wλ is a matrix representation of BS gates in layer λ of the quantum circuit. The gradient of the cost function C with respect to a parameter θi of a BS gate acting on qubits i and i+1 may be defined by:
In some embodiments, updating values of the parameters of the BS gates of the quantum circuit based on the gradients of the cost function includes the computing system: updating a value of a parameter θi of a BS gate of the quantum circuit according to
where η is a learning rate.
The quantum computing system 2020 executes 2065 the program and computes 2070 a result (referred to as a shot or run). Computing the result may include performing a measurement of a quantum state generated by the quantum computing system 2020 that resulted from executing the program. Practically, this may be performed by measuring values of one or more of the qubits 2050. The quantum computing system 2020 typically performs multiple shots to accumulate statistics from probabilistic execution. The number of shots and any changes that occur between shots (e.g., parameter changes)) may be referred to as a schedule. The schedule may be specified by the program. The result (or accumulated results) is recorded 2075 by the classical computing system 2010. Results may be returned after a termination condition is met (e.g., a threshold number of shots occur).
Illustrated in
The storage device 2108 is any non-transitory computer-readable storage medium, such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. Such a storage device 2108 can also be referred to as persistent memory. The pointing device 2114 may be a mouse, track ball, or other type of pointing device, and is used in combination with the keyboard 2110 to input data into the computer 2100. The graphics adapter 2112 displays images and other information on the display 2118. The network adapter 2116 couples the computer 2100 to a local or wide area network.
The memory 2106 holds instructions and data used by the processor 2102. The memory 2106 can be non-persistent memory, examples of which include high-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory.
As is known in the art, a computer 2100 can have different or other components than those shown in
As is known in the art, the computer 2100 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic utilized to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, or software. In one embodiment, program modules are stored on the storage device 2108, loaded into the memory 2106, and executed by the processor 302.
Referring back to
A quantum circuit is an ordered collection of one or more gates. A sub-circuit may refer to a circuit that is a part of a larger circuit. A gate represents a unitary operation performed on one or more qubits. Quantum gates may be described using unitary matrices. The depth of a quantum circuit is the least number of steps needed to execute the circuit on a quantum computing system. The depth of a quantum circuit may be smaller than the total number of gates because gates acting on non-overlapping subsets of qubits may be executed in parallel. A layer of a quantum circuit may refer to a step of the circuit, during which multiple gates may be executed in parallel. In some embodiments, a quantum circuit is executed by a quantum computing system. In this sense a quantum circuit can be thought of as comprising a set of instructions or operations that a quantum computing system can execute. To execute a quantum circuit on a quantum computing system, a user may inform the quantum computing system what circuit is to be executed. A quantum computing system may include both a core quantum device and a classical peripheral/control device (e.g., a qubit controller) that is used to orchestrate the control of the quantum device. It is to this classical control device that the description of a quantum circuit may be sent when one seeks to have a quantum computer execute a circuit.
A variational quantum circuit may refer to a parameterized quantum circuit that is executed many times, where each time some of the parameter values may be varied. The parameters of a parameterized quantum circuit may refer to parameters of the gate unitary matrices. For example, a gate that performs a rotation about the y axis may be parameterized by a real number that describes the angle of the rotation. Variational quantum algorithms are a class of hybrid quantum-classical algorithm in which a classical computer is used to choose and vary the parameters of a variational quantum circuit. Typically, the classical processor updates the variational parameters based on the outcomes of measurements of previous executions of the parameterized circuit.
The description of a quantum circuit to be executed on one or more quantum computers may be stored in a non-transitory computer-readable storage medium. The term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing instructions for execution by the quantum computing system and that cause the quantum computing system to perform any one or more of the methodologies disclosed herein. The term “computer-readable medium” includes, but is not limited to, data repositories in the form of solid-state memories, optical media, and magnetic media.
The approaches described above may be amenable to a cloud quantum computing system, where quantum computing is provided as a shared service to separate users. One example is described in patent application Ser. No. 15/446,973, “Quantum Computing as a Service,” which is incorporated herein by reference.
The disclosure above describes example embodiments for purposes of illustration only. Any features that are described as essential, important, or otherwise implied to be required should be interpreted as only being required for that embodiment and are not necessarily included in other embodiments.
Additionally, the above disclosure often uses the phrase “we” (and other similar phases) to reference an entity that is performing an operation (e.g., a step in an algorithm). These phrases are used for convenience. These phrases may refer to a computing system (e.g., including a classical computing system and a quantum computing system) that is performing the described operations.
Some portions of above description describe the embodiments in terms of algorithmic processes or operations. These algorithmic descriptions and representations are commonly used by those skilled in the computing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs comprising instructions for execution by a processor or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of functional operations as modules, without loss of generality. In some cases, a module can be implemented in hardware, firmware, or software.
As used herein, any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment. Similarly, use of “a” or “an” preceding an element or component is done merely for convenience. This description should be understood to mean that one or more of the elements or components are present unless it is obvious that it is meant otherwise. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments. This is done merely for convenience and to give a general sense of the disclosure. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise. Where values are described as “approximate” or “substantially” (or their derivatives), such values should be construed as accurate +/−10% unless another meaning is apparent from the context. From example, “approximately ten” should be understood to mean “in a range from nine to eleven.”
Alternative embodiments are implemented in computer hardware, firmware, software, and/or combinations thereof. Implementations can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor; and method steps can be performed by a programmable processor executing a program of instructions to perform functions by operating on input data and generating output. As used herein, ‘processor’ may refer to one or more processors. Embodiments can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random-access memory. Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits) and other forms of hardware.
Although the above description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples. It should be appreciated that the scope of the disclosure includes other embodiments not discussed in detail above. Various other modifications, changes, and variations which will be apparent to those skilled in the art may be made in the arrangement, operation, and details of the methods and apparatuses disclosed herein without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
202141023642 | May 2021 | IN | national |