The present invention relates to a learning apparatus, a learning method, and a program.
In the fields of artificial intelligence, machine learning, and the like, models called neural networks (NNs) have been widely used and, recently, a neural network using a unitary matrix as a weight matrix has been attracting attention. A unitary matrix refers to a matrix W that has a complex number as an element and that satisfies WW† = I, where W† represents a conjugate transpose matrix of W and I represents an identity matrix. A neural network including a weight matrix having complex numbers as elements is also called a “complex neural network”.
The following two reasons are mainly considered as reasons why a neural network using a unitary matrix as a weight matrix has been attracting attention.
The first reason is that there have been reports on the effectiveness of a method using a unitary matrix as a weight matrix in order to mitigate a “vanishing or exploding gradient problem” that may occur when learning a deep neural network (DNN: Deep NN). In particular, because backpropagation, which is normally used in learning of a DNN, utilizes a gradient, using a unitary matrix is effective in terms of the learning efficiency.
The second reason is that, in an implementation of an optical neural network (Optical NN, Photonic NN), a matrix-vector multiplication part includes a Mach-Zehnder interferometer (MZI), which is an implementation of a Givens rotation matrix (Non-Patent Literature 1).
In this case, methods of restricting a weight matrix to a unitary matrix can be roughly classified into two methods.
The first method is a method of imposing a constraint when optimizing a weight matrix. In this method, after optimizing the weight matrix so as to satisfy the constraint, projection or retraction is required to obtain a strict unitary matrix. Therefore, it is difficult to obtain a strict unitary matrix in a state where accuracies of learning and inference of a neural network are maintained. The method also makes learning difficult because, while an arbitrary unitary matrix can be constructed, further constraints are required to construct a specific unitary matrix.
The second method is a method of using a unitary matrix as a fundamental matrix to construct an arbitrary unitary matrix with a product of a plurality of unitary matrices and a diagonal matrix (Non-Patent Literature 2). This method uses a property that a product of unitary matrices is a unitary matrix and always enables a strict unitary matrix to be constructed. In addition, a specific unitary matrix can also be constructed by changing the composition of a matrix product. Hereinafter, a weight matrix constructed by this method will also be referred to as a “structurally-constrained weight matrix”.
When an arbitrary unitary matrix is constructed by the second method described above, a Givens rotation matrix may also be used as a fundamental matrix. A method of constructing an arbitrary unitary matrix using a Givens rotation matrix as a fundamental matrix is known (Non Patent Literature 3) and is referred to as Clements’ method or the like.
However, when learning a neural network using a unitary matrix constructed by Clements’ method as a weight matrix by automatic differentiation, a computational graph becomes a deep graph including a large number of nodes and a large amount of calculation is required during backpropagation. As a result, learning the neural network also takes much time.
An embodiment of the present invention has been made in view of the foregoing and an object thereof is to efficiently learn a neural network including a structurally-constrained weight matrix.
In order to achieve the object described above, a learning apparatus according to an embodiment is a learning apparatus that learns a neural network including a linear transformation layer achieved by a weight matrix with a complex number as an element, the learning apparatus including: a formulating unit configured to formulate a differential equation of a loss function with respect to each of conjugate variables corresponding to input variables of the linear transformation layer and a differential equation of the loss function with respect to each of parameters of the neural network; and a learning unit configured to learn parameters of the neural network by backpropagation using the differential equations formulated by the formulating unit.
A neural network including a structurally-constrained weight matrix can be learned in an efficient manner.
Hereinafter, embodiments of the present invention will be described.
In the present embodiment, a learning apparatus 10 that can efficiently learn a neural network (NN) including a structurally-constrained weight matrix will be described. In particular, a case will be described in which a Givens rotation matrix is assumed as a fundamental matrix and a neural network including an arbitrary unitary matrix formed by a product of a plurality of Givens rotation matrices and a diagonal matrix as a weight matrix is learned in an efficient manner.
Hereinafter, a theoretical configuration of the present embodiment will be described.
As illustrated in
Generally, while two matrices are non-commutative with respect to multiplication, in the case of Givens rotation matrices, two Givens rotation matrices including different p and q are commutative. In other words, let R denote a Givens rotation matrix including p and q and R′ denote a Givens rotation matrix including p′ and q′, where p ≠ p′ and q ≠ q′. In this case, RR′ = R′R. Hereinafter, a matrix represented by a product of a plurality of such commutative Givens rotation matrices will also be referred to as a Givens rotation matrix as long as confusion is avoided.
For the following description, a symbol of the Mach-Zehnder interferometer (MZI) will be prepared. The MZI symbol is also used in Non Patent Literature 2 described above and represents a product of a Givens rotation matrix and a vector. For example, when X = [x1, ..., xn]t and Y = [yi, ..., yn]t are n-dimensional vectors and R denotes an n×n Givens rotation matrix, then Y = RX can be represented by an MZI symbol. Note that t is a symbol that represents transposition.
Specifically, when n = 2, Y = RX is
[Math. 1]
which can be represented by an MZI symbol as illustrated in
As described above, Clements’ method is known as a method for constructing an arbitrary unitary matrix with a Givens rotation matrix product. Clements’ method enables an arbitrary nxn unitary matrix to be constructed by products of n Givens rotation matrices and one diagonal matrix (where each element is a point on a unit circle of a complex plane). In other words, when the n Givens rotation matrices are denoted as R1, ···, Rn and the diagonal matrix is denoted as D, then an arbitrary nxn unitary matrix U can be constructed by U = DRn⋯R1. Because the unitary matrix U constructed in this way can be expressed by an n-layer structure formed by the products of the n Givens rotation matrices R with the exception of the diagonal matrix D, the unitary matrix U is also called a Givens rotation matrix product with an n-layer structure.
However, an arbitrary unitary matrix can be constructed by U = DR1, only in a case where n = 2. While the unitary matrix may be called a Givens rotation matrix product with a two-layer structure even in this case, the unitary matrix is called a product of a Givens rotation matrix and a diagonal matrix or the like to avoid any misunderstanding.
A linear transformation layer of a neural network using a Givens rotation matrix product with an n-layer structure as a structurally-constrained weight matrix can be achieved by Clements’ method described above. For example, when n = 4, a linear transformation layer of a neural network using a Givens rotation matrix product with an n-layer structure as a structurally-constrained weight matrix is as illustrated in
is obtained. Similarly, R3 is also a matrix represented by a product of two commutative Givens rotation matrices (a product of a Givens rotation matrix including parameters (φ31, θ31) and a Givens rotation matrix including parameters (φ32, θ32) ) .
When a Givens rotation matrix product with an n-layer structure is used as a structurally-constrained weight matrix, an input vector of a linear transformation layer achieved by the weight matrix is transformed into an output vector through a sequential linear transformation performed n+1 times. With respect to the above, when an n×n matrix with an arbitrary complex number as an element is used as a weight matrix, an input vector of a linear transformation layer achieved by the weight matrix becomes an output vector by one linear transformation.
In this way, the weight matrix using the plurality of Givens rotation matrix products has a fine grain layer structure in which the linear transformation layer is decomposed into even finer layers. For this reason, when learning parameters of a neural network including a structurally-constrained weight matrix of which a fundamental matrix is a Givens rotation matrix by automatic differentiation, a computational graph becomes a deep graph formed by a large number of nodes and a large amount of calculation is required during backpropagation. As a result, learning the neural network also takes much time.
In consideration thereof, in the present embodiment, two kinds of partial differentials necessary for backpropagation are formulated in advance, and partial differential equations thereof are used to efficiently perform parameter learning of a neural network including a structurally-constrained weight matrix using a Givens rotation matrix as a fundamental matrix. In this case, the two kinds of partial differentials are two kinds of equations being: an equation obtained by partial differentiation of a loss function by a parameter; and an equation obtained by partial differentiation by a conjugate variable of an input variable (in other words, a variable representing each element of an input vector) to the linear transformation layer.
In the following description, as an example, it is assumed that n = 2 (in other words, an input vector and an output vector of the linear transformation layer are two-dimensional). Using the product of a Givens rotation matrix and a diagonal matrix as a structurally-constrained weight matrix, a Forward calculation and a Backward calculation of the linear transformation layer achieved by this weight matrix are as illustrated in
In this case, the partial differential equation of the first kind is an equation obtained by partially differentiating the loss function L by each of the parameters φ and θ, the partial differential equation of the second kind is an equation obtained by partially differentiating the loss function L by each of the conjugate variables x1* and x2*, and the equations are formulated in the following manner.
where Re ( · ) represents a real part and Im ( · ) represents an imaginary part. In this way, the four partial differentials described above can be formulated into a relatively simple form.
By formulating the partial differential equations illustrated in Math. 3 in advance, the number of nodes of a computational graph used during backpropagation can be reduced and the computational graph can be made relatively shallow. This is because, while it is necessary to perform many elemental operations of a sum, a difference, a product, and the like when calculating the two kinds of partial differentials described above in an ordinary computational graph, the number of such operations can be reduced by formulating the partial differential equations illustrated in Math. 3 in advance.
The reduction of the number of operations will be described in more detail. The partial differentials in the third and fourth lines in Math. 3 described above (the partial differential of the loss function L with respect to the conjugate variable x1* and the partial differential of the loss function L with respect to the conjugate variable x2*) can be expressed in a matrix form as follows.
The matrix on the right side of Math. 4 is a conjugate transpose matrix of the matrix illustrated in Math. 1 described above used during a Forward calculation. Therefore, Math. 4 above can be calculated by simply holding matrix element values during Forward calculation and converting the values into conjugate values.
In addition, the partial differential in the first line of Math. 3 described above (the partial differential of the loss function L with respect to the parameter φ) can also be readily calculated from the conjugate variable x1* of the input variable x1 and the result of the partial differential in the third line of Math. 3 described above (the partial differential of the loss function L with respect to the conjugate variable x1*).
In this manner, by reusing a value during Forward calculation, the number of calculations during Backward calculation can be reduced and a calculation cost thereof can be suppressed.
Hereinafter, the computational graph obtained by formulating the partial differential equation illustrated in Math. 3 described above in advance is also referred to as a “coarse-grained computational graph”. The coarse-grained computational graph is a graph that includes a smaller number of nodes and that is shallower than a normal computational graph (in other words, a graph before performing the formulation of the partial differentials illustrated in Math. 3 described above).
The derivation of the partial differentials illustrated in Math. 3 described above will be described. A chain rule and Wirtinger derivative (or Wirtinger operator) are used to derive these partial differentials.
The linear transformation and the conjugate thereof in the linear transformation layer when n = 2 are expressed as:
In addition, the Wirtinger derivative is expressed as follows.
In this case, the partial differential equation of the loss function L with respect to the parameter φ is derived as follows.
Next, the partial differential equation of the loss function L with respect to the parameter θ is derived as follows.
Next, the partial differential equation of the loss function L with respect to the conjugate variable x1* is derived as follows.
In Math. 9, a relationship expressed as
is utilized.
Next, the partial differential equation of the loss function L with respect to the conjugate variable x2* is derived as follows.
In Math. 11, a relationship expressed as
is utilized.
In this manner, by utilizing a chain rule and Wirtinger derivative and utilizing the properties of a Givens rotation matrix, a partial differential of the loss function L can be formulated into a relatively simple form.
While the case of n = 2 has been mainly described in the present embodiment, n = 2 is merely an example and two kinds of partial differentials can be formulated in the same manner even when n is 3 or greater. In addition, in the present embodiment, while the matrix of the right side of Math. 1 described above has been described as an application example of Clements’ method, an application of Clements’ method is not limited thereto. For example, Non Patent Literature 2 described earlier describes a transposed matrix of the matrix on the right side of the Math. 1 described above in which θ has been replaced by -θ, and the present embodiment can be similarly applied to such a matrix. In addition, generally, a complex Givens rotation matrix is a generic term, and adding or subtracting a constant to or from the parameters θ, φ and multiplying the parameters θ, φ by a point on a unit circle of a complex plane also produce complex Givens rotation matrices to which the present embodiment can be similarly applied.
In addition, in addition to the partial differential of the linear transformation layer using the product of commutative Givens rotation matrices (for example, R1 and R3 in
Next,
The input apparatus 101 is, for example, a keyboard, a mouse, or a touch panel. The display apparatus 102 is, for example, a display. It is sufficient as long as the learning apparatus 10 includes either the input apparatus 101 or the display apparatus 102.
The external I/F 103 is an interface with an external device such as a recording medium 103a. The learning apparatus 10 can perform reading and writing from and to the recording medium 103a via the external I/F 103. Examples of the recording medium 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), a USB (Universal Serial Bus) memory card, and so forth.
The communication I/F 104 is an interface for connecting the learning apparatus 10 to a communication network. Examples of the processor 105 include various calculating apparatuses such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). Examples of the memory apparatus 106 include various storage apparatuses such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory.
The learning apparatus 10 according to the present embodiment can implement various functional units described hereinafter by including the hardware components illustrated in
Next,
The formulating unit 201 formulates various partial differential equations related to the loss function L according to Math. 3 described above when, for example, n = 2. The learning unit 202 uses the partial differential equations formulated by the formulating unit 201 to learn parameters of the neural network by automatic differentiation using a coarse-grained computational graph. A value (partial differential value) of the partial differential equation formulated by the formulating unit 201 is calculated by forward calculation and backward calculation of the neural network.
The storage unit 203 stores various kinds of data (for example, the partial differential equations formulated by the formulating unit 201 and the parameters of the neural network).
In the learning apparatus 10 according to the present embodiment, the formulating unit 201 formulates various partial differential equations related to the loss function L (for example, when n = 2, formulates the partial differential equations represented by Math. 3 described above) and then the learning unit 202 learns parameters of the neural network by using the various partial differential equations. Accordingly, high-speed learning can be achieved by eliminating an amount of calculation when learning parameters of the neural network including a structurally-constrained weight matrix using a Givens rotation matrix as a fundamental matrix. Note that because a Givens rotation matrix product with an n-layer structure can construct an arbitrary unitary matrix (naturally, a given specific unitary matrix can also be constructed), learning of the neural network using an arbitrary unitary matrix as a weight matrix can be accelerated.
Next, an experiment for comparing the learning apparatus 10 according to the present embodiment with a conventional method will be described.
In the present embodiment, an Elman-type simple recurrent neural network (RNN) illustrated in
From input, being an input terminal, one pixel (minibatch is used) is input to an input unit (Input unit). The input unit is a linear transformation unit which uses an arbitrary complex number as a weight matrix Win (128 rows, 1 column).
A hidden layer (Hidden unit) is a linear transformation unit which uses a Givens rotation matrix product W as a weight matrix (128 rows, 128 columns).
An output from the input unit and an output from the hidden layer are added together and input to a ReLU being an activation function (Activation function). An output of ReLU is fed back to the hidden layer but also input to an output unit (Output unit). The output unit is a linear transformation unit that uses an arbitrary complex number as a weight matrix Wout (10 rows, 128 columns).
A complex number output by the output unit is converted by a real number generator into a real number by using power for calculating a square of an absolute value of the complex number. Once processing for 784 pixels of one image is completed, a class identification problem is evaluated. In the evaluation, a Softmax function and a Cross entropy loss function are used and correct answer numeral data, being a target, is input to the Cross entropy loss function.
In this case, the Givens rotation matrix product being the hidden layer is formed by four layers that can achieve only a limited unitary matrix instead of 784 layers that can construct an arbitrary unitary matrix, and a diagonal matrix is omitted.
In the setting described above, learning of the Elman-type simple RNN was performed using the learning apparatus 10 according to the present embodiment and using a conventional method, respectively. As the conventional method, a pytorch code using default of pytorch on the PyTorch platform was adopted (hereinafter, referred to as “AD-py”). In the learning apparatus 10 according to the present embodiment, the linear transformation layer illustrated in
In addition, a result of a comparison of elapsed times per one epoch is illustrated in Table 1 below.
As illustrated in Table 1 above, when the elapsed time per one epoch of AD-py is 1.00, the elapsed time in BP-cpp is 0.16, which demonstrates that an increase in speed by a factor of approximately 6.2 is realized.
In addition, using the perf tool that is a tool for reading a performance counter provided in a CPU, a comparison of the number of instructions being speed deterioration factors (#instructions: retired instruction or completed instruction), the number of data loads of a last level cache (#LLC-loads: Last-level cache data loads), and the number of data load misses of the last level (#LLCM: Last-level cache data load misses) was performed. A result of the comparison is illustrated in Table 2 below.
As illustrated in Table 2 above, it is found that all of #instructions, #LLC-loads, and #LLCM are fewer in BP-cpp than in AD-py. Therefore, it is found that BP-cpp can reduce speed deterioration factors.
Next, a second embodiment will be described. In the present embodiment, a case of efficiently performing parameter learning of a neural network including a linear transformation layer achieved by a Fang-type matrix will be described.
In the second embodiment, differences from the first embodiment will be mainly described and descriptions of components substantially the same as those in the first embodiment will be omitted. In particular, the learning apparatus 10 according to the present embodiment can be implemented with a hardware configuration and a functional configuration substantially the same as those of the first embodiment.
A Fang-type matrix is a matrix expressed as R = BS2·PSθ·BS1·PSφ. In this case:
Therefore, the Fang-type matrix R is expressed as:
For details of a Fang-type matrix, for example, refer to Reference Literature 1 “Michel Y.-S. Fang, Sasikanth Manipatruni, Casimir Wierzynski, Amir Khosrowshahi, and Michel R. DeWeese, “Design of optical networks with component imprecisions”, Optics Express, vol. 27, No. 10, pp. 14009-14029, 2019” and the like.
In this case, the Fang-type matrix R illustrated in Math. 14 above can be modified to
Therefore, the partial differential equation of the loss function L with respect to each of the parameters φ and θ and the partial differential equation of the loss function L with respect to each of the conjugate variables x1* and x2* are formulated as follows.
In consideration thereof, in the learning apparatus 10 according to the present embodiment, after the various partial differential equations illustrated in Math. 16 are formulated by the formulating unit 201, the learning unit 202 learns the parameters of the neural network including the linear transformation layer illustrated in
In addition to the Fang-type matrix, for example, a partial differential of a transposed matrix of the matrix described in Non Patent Literature 2 mentioned earlier can also be formulated in a similar manner. Specifically, Non Patent Literature 2 describes a matrix representation of a transposed matrix of the Givens rotation matrix illustrated in Math. 1 described earlier in which θ has been replaced with -θ (Expression (9) in Non Patent Literature 2). The partial differential equation of the loss function L with respect to each of the parameters φ and θ and the partial differential equation of the loss function L with respect to each of the conjugate variables x1* and x2* can also be formulated in a similar manner with respect to a linear transformation layer using this matrix as a weight matrix.
Next, a third embodiment will be described. In the present embodiment, a case will be described in which, after a Fang-type matrix is decomposed into a product of two matrices, parameter learning of a neural network including a linear transformation layer achieved by matrices including the matrix product is efficiently performed.
In the third embodiment, differences from the second embodiment will be mainly described and descriptions of components substantially the same as those in the second embodiment will be omitted. In particular, the learning apparatus 10 according to the present embodiment can be implemented with a hardware configuration and a functional configuration substantially the same as those of the second embodiment.
The Fang-type matrix R can be decomposed into a matrix product of two matrices as follows.
In this case, a first term and a second term of the matrix product described above are identical matrix representations which only differ from each other in parameter names. Therefore, the linear transformation layer using the Fang-type matrix R as a weight matrix can be decomposed into two linear transformation layers (a first linear transformation layer and a second linear transformation layer) which only differ from each other in parameter names as illustrated in
In this case, in the first linear transformation layer, a transformation expressed as
is performed and, in the second linear transformation layer, a transformation expressed as
is performed.
Therefore, in the first linear transformation layer, a partial differential equation of the loss function L related to the parameter φ and a partial differential equation of the loss function L related to each of the conjugate variables x1* and x2* are formulated as follows.
In the second linear transformation layer, by replacing φ above with θ, x above with y, and y above with z, a partial differential equation of the loss function L related to the parameter θ and a partial differential equation of the loss function L related to each of the conjugate variables y1* and y2* are formulated in a similar manner.
In consideration thereof, in the learning apparatus 10 according to the present embodiment, after the various partial differential equations described earlier are formulated by the formulating unit 201, the learning unit 202 learns parameters of a neural network including the first linear transformation layer and the second linear transformation layer illustrated in
A 2×2 unitary matrix A ∈ U (2) can be expressed by a product of ei(ρ/2) ∈ U(1), ρ ∈ RN, and U ∈ SU (2), where U(n) represents an n-th order unitary group, SU(n) represents an n-th order special unitary group, and RN represents all real numbers.
In other words, the 2×2 unitary matrix A is expressed as
where ajh ∈ C and j, h = 1, 2. In addition, A satisfies AA† = A†A = I and also detA = eiρ (where ρ ∈ RN) if |detA| = 1. In this case, detA represents a determinant of the matrix A.
In addition, a 2×2 special unitary matrix U is expressed as
where α, β ∈ C. In addition, U satisfies detU = αα* + ββ* = +1. Note that U includes three independent variables (because a fourth variable is uniquely determined by the other three independent variables).
In this case, A can be expressed as A = ei(ρ/2)U.
In this case, the special unitary matrix U described above can be expressed by a linear sum of Pauli matrices σ1, σ2, σ3 and an identity matrix σ4 = I. Specifically, when p1, p2, p3, p4 ∈ RN, α = p4 + i·p3, and β = p2 + i·p1, then U can be expressed as follows, where i represents an imaginary unit.
Furthermore,
is provided. Therefore, U can be expressed as U = i·p1σ1 + i·p2σ2 + i·p3σ3 + p4σ4. Note that, for j = 1, 2, 3, Trace (σj) = 0, det (σj) = -1, σj2 = I. Furthermore, σ1σ2σ3 = iI, σ1σ2 = -σ2σ1 = iσ3, σ2σ3 = -σ3σ2 = iσ1, and σ3σ1 = -σ1σ3 = -iσ2 are satisfied.
From the above, it is found that σ1, σ2, σ3, and σ4 are mutually linear-independent orthogonal bases in a four-dimensional complex vector space formed by a 2×2 complex matrix.
As an example, when σ2 and σ3 are adopted as matrix generators, the special unitary matrix U can be expressed as U (ω, θ, φ) = U3 (ω) U2 (θ) U1 (φ), where
In this case, ex denotes an exponential function of a matrix X and is defined by
where X0 = I.
Thus, the special unitary matrix U can be expressed as:
On the right side of Math. 27 described above, the first term and the second term are diagonal matrices and the third term and the fourth term are special unitary matrices.
Accordingly, when σ2 and σ3 are adopted as matrix generators, an arbitrary 2×2 unitary matrix A can be expressed as:
On the right side of Math. 28 described above, the first to third terms are diagonal matrices and the fourth term and the fifth term are special unitary matrices. In other words, when σ2 and σ3 are adopted as matrix generators, an arbitrary 2×2 unitary matrix A can be expressed as a product of a diagonal matrix and a special unitary matrix. Therefore, in the following description, formulation of a partial differential of the special unitary matrix being denoted as V will be considered. While σ2 and σ3 have been adopted as matrix generators as an example in the present embodiment, for example, the present embodiment can also be applied to a case where σ1 and σ3 are adopted as matrix generators.
When the fourth term and the fifth term of Math. 28 described above are denoted as V,
is obtained.
Hereinafter, for the sake of simplicity, after multiplying each element of the matrix of the second item by the first item of Math. 29 described above, φ/2 is replaced by φ and θ/2 is replaced by θ to produce a matrix W (this is possible without loss of generality). In other words, the following matrix is defined as W.
A determinant detW of the matrix W is +1 and W is a representation matrix of SU (2). The complex Givens rotation matrix illustrated in Math. 1 described earlier becomes the representation matrix of SU(2) by being multiplied by exp(-iφ/2) and the Fang-type matrix illustrated in Math. 13 described earlier becomes the representation matrix of SU(2) by being multiplied by i·exp(-i(θ + φ)/2). These representation matrices can be called rotation matrices.
In this case, a linear transformation by the matrix W and a conjugate thereof are:
Therefore, the partial differential equation of the loss function L with respect to each of the parameters φ and θ and the partial differential equation of the loss function L with respect to each of the conjugate variables x1* and x2* are formulated as follows.
Accordingly, the parameter learning of the neural network including the linear transformation layer using an arbitrary 2×2 unitary matrix A as a weight matrix can be efficiently performed. Note that the formulation of the partial differentials is performed by the formulating unit 201 and the parameter learning is performed by the learning unit 202.
Derivation of the partial differential equation of the loss function L with respect to each of the parameters φ and θ will be described below. First, note that
is satisfied. Using this relationship enables the partial differential equation of the loss function L with respect to the parameter φ to be derived as follows.
Similarly, the partial differential equation of the loss function L with respect to the parameter θ is derived as follows.
The present invention is not limited to the specifically disclosed embodiments described above and various modifications, changes, and combinations with existing techniques can be made without departing from the scope of the claims.
10
101
102
103
103
a
104
105
106
107
201
202
203
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/045699 | 12/8/2020 | WO |