The present invention relates to a conversion device, a training device, a conversion method, a training method, and a program.
A type of mathematical model called a neural network is conventionally known. A classical neural network performs calculation for converting input data expressed by a vector into output data expressed by a vector, a scalar, or the like. Calculation in this type of neural network can be described in the format of nested functions that express layers.
In recent years, neural networks have come to be applied in various fields, and neural networks are often used to handle complex problems. When such a complex problem is handled using a neural network, it is often the case that the input data and the output data are data that is structured (hereinafter, also called “structured data”). Here, structured data is not simple vector data or the like, but rather is data that has some sort of structure, examples of which include data whose elements have a structured relationship with each other, and data that has a structured relationship with other data. Specific examples of structured data include a word sequence that makes up a text document, and a vector or matrix that expresses a correspondence relationship between pieces of time-series data.
In order to handle such structured data with a neural network, a method is known in which dynamic programming computation is used as a neural network layer. Dynamic programing computation is a technique in which a target problem is recursively broken down into sub-problems, and the sub-problems are successively solved in order to obtain a solution. Note that due to the versatility of expressive power of dynamic programming computation, such computation is often non-differentiable.
When parameters in a neural network are trained through back propagation for example, the derivative of a predetermined loss function is calculated based on prediction output of the neural network and correct answer data. The computation performed in the layers of the neural network thus needs to be differentiable.
However, because dynamic programming computation is often non-differentiable computation, parameter training is sometimes difficult in a neural network that has dynamic programming layers. To address this, methods have been proposed in which a CRF (Conditional Random Field) is used to convert dynamic programming computation into differentiable computation (e.g., NPL 1 and NPL 2).
However, with the methods proposed in NPL 1 and NPL 2, the output data of the dynamic programming computation layer loses sparsity, and therefore the interpretability of the output data has sometimes decreased.
In the case of a problem addressed by dynamic programming, interpretability between the structured data that is input (hereinafter, also called “structured input data”) and the structured data that is output (hereinafter, also called “structured output data”) is often very important. For example, in the case where the structured input data is a word sequence that makes up a text document and the structured output data is a matrix indicating the tagging of words in the word sequence (e.g., tags indicating the parts of speech or categories of words), it is often preferable to obtain structured output data in which one tag is associated with each word. However, with the methods proposed in NPL 1 and NPL 2, the output data of the dynamic programming computation layer loses sparsity, and therefore sometimes the structured output data is data in which multiple tags are associated with a word. For this reason, it is sometimes difficult to make an interpretation such as specifying one part of speech for a certain word.
An embodiment of the present invention was achieved in light of the foregoing situation, and an object thereof is to realize dynamic programming computation that is differentiable and has high interpretability.
In order to achieve the aforementioned object, an embodiment of the present invention is a conversion device that converts input first data X into second data Y using a neural network, the conversion device including: calculating means for calculating an approximation DPΩ(θ) of a solution of dynamic programming that addresses a problem expressed by a weighted directed acyclic graph G, with use of third data θ obtained by predetermined preprocessing performed on the first data X, and with use of a DPΩ function recursively defined using a maxΩ function in which a strongly-convex regularization function Ω is implemented in a max function; and outputting means for outputting, as the second data Y, at least one of DPΩ(θ) calculated by the calculating means and a gradient ∇DPΩ(θ) of DPΩ(θ).
According to this embodiment of the present invention, it is possible to realize dynamic programming computation that is differentiable and has high interpretability.
Embodiments of the present invention are described below. The following describes a conversion device 100 that converts structured input data into structured output data, as an embodiment of the present invention. Here, the conversion device 100 of this embodiment of the present invention uses differentiable dynamic programming computation to convert structured input data into structured output data that has high interpretability. The dynamic programming computation executed by the conversion device 100 of this embodiment of the present invention is realized as a neural network layer.
Also, a training device 200 that trains a neural network including a layer realized by the aforementioned dynamic programming computation will also be described as an embodiment of the present invention.
Here, part-of-speech tagging is one example of a task in which structured input data is converted into structured output data. In part-of-speech tagging, the structured input data is a word sequence that makes up a text document, and the structured output data is a matrix that indicates the tagging of words included in the word sequence (e.g., tags indicating the parts of speech of the words), for example. In this case, the conversion device 100 of this embodiment of the present invention functions as a text analysis device.
Translation is another example of a task in which structured input data is converted into structured output data. In translation, the structured input data is a word sequence that makes up a text document in the source language, and the structured output data is a word sequence obtained by translating the word sequence into the target language, for example. In this case, the conversion device 100 of this embodiment of the present invention functions as a translation device.
The alignment of pieces of time-series data is another example of a task in which structured input data is converted into structured output data. In this alignment, the structured input data is data indicating pieces of time-series data, and the structured output data is a vector, a matrix, or the like expressing a correspondence relationship between the pieces of time-series data (e.g., the similarity between elements included in the pieces of time-series data), for example. In this case, the conversion device 100 of this embodiment of the present invention functions as a time-series data alignment device.
Note that the structured input data is not limited to the above-described word sequence or pieces of time-series data. Any data expressed by a series, a sequence, or the like can be used as the structured input data. For example, the structured input data can be image data, video data, data expressing an acoustic signal, data expressing a biological signal, or the like.
The following describes the theoretical background of conversion, training, and the like executed with use of dynamic programming by the conversion device 100 and the training device 200 of embodiments of the present invention. In the following embodiments of the present invention, the structured input data is denoted by X, and the structured output data is denoted by Y. Also, a set of structured output data X will be denoted as follows.
x [Formula 1]
Also, a set of structured output data Y will be denoted as follows.
y [Formula 2]
When performing some sort of task for converting the structured input data X into the structured output data Y, the procedures shown in Expressions 1 and 2 below for example are performed.
Here, θ is a matrix or a tensor having real numbers as elements, and θ is a set of θ. Also, the bold R indicates all real numbers. For the sake of convenience in the notation used in this specification, all real numbers will hereinafter also be simply denoted as “R”.
Also, “preprocessing” is for converting (projecting) the structured input data X into θ in accordance with the problem addressed by dynamic programming, and is realized by a neural network, for example. Specifically, in the case of a problem that involves the above-described part-of-speech tagging for example, the preprocessing is realized by BLSTM (Bi-directional Long Short-Term Memory).
Expression 1 above obtains an optimal solution (value) of the objective function of the problem addressed by dynamic programming, and Expression 2 above obtains the argument (Y*) of an objective function that gives the optimal solution. In Expression 1, the optimal solution (value) is obtained by solving the objective function of the problem addressed by dynamic programming. On the other hand, in Expression 2, the argument (Y*) of the objective function is obtained by performing backtracking after the optimal solution (Value) has been obtained.
Whether the optimal solution (value) of the objective function is needed or the argument (Y*) of the objective function that gives the optimal solution is needed depends on the problem addressed by dynamic programming. For example, in the case of a problem that involves part-of-speech tagging or a problem that involves the alignment of pieces of time-series data, the argument (Y*) of the objective function that gives the optimal solution is needed. Note that there are also cases where both the optimal solution (value) of the objective function and the argument (Y*) of the objective function that gives the optimal solution are needed.
In general, in the case of obtaining the structured output data Y from the structured input data X, the procedure of Expression 2 is performed to obtain the argument (Y*) of the objective function that gives the optimal solution. At this time, structured output data Y=Y*. However, in the case of obtaining some sort of value when the structured output data Y=Y* has been obtained from the structured input data X (e.g., obtaining the accuracy of part-of-speech tagging when Y* has been obtained), the procedure of Expression 1 is performed to obtain the solution (value) of the objective function.
Also, in general, the optimal solution (value) of the objective function of the problem addressed by dynamic programming is often called the “dynamic programming solution”, but the argument (Y*) of the objective function that gives the optimal solution is also sometimes called the “dynamic programming solution”. In the embodiments of the present invention, the optimal solution (value) of the objective function of the problem addressed by dynamic programming will be called the “dynamic programming solution”.
Here, assuming that θ has been obtained through the above-described preprocessing, the processing for obtaining the optimal solution (value) of the objective function of the problem addressed by dynamic programming can be formulated into a problem of finding the path having the highest predetermined score among paths from the start node to the end node in a weighted directed acyclic graph (DAG).
Here, the weighted directed acyclic graph is expressed as G=(ν,ε), where ν is a set of nodes and ε is a set of edges. Also, let the number of nodes N be N=|ν|≥2. The edges in the set of edges ε are directed edges, and in the case of a directed edge from one node to another node, the one node is the “parent node”, and the other node is the “child node”.
Without loss of generality, the nodes can be ordered by sequentially giving numbers (IDs) to the nodes such that each node has a smaller number than its child node. Let the node with the ID 1 be the start node, and the node with the ID N be the end node. This can be expressed as follows.
V=[N]{1, . . . ,N} [Formula 5]
Hereinafter, the node with the ID n will be indicated as “node n”. Note that “equals sign with a triangle above” means that the left-hand side of that sign is defined by the right-hand side.
In the weighted directed acyclic graph G, node 1 is the only node that does not have a parent node, and node N is the only node that does not have a child node. Also, in the weighted directed acyclic graph G, the directed edge (i,j) from the parent node j to the child node i has the weight θi,j∈R.
Let θ∈Θ⊆RN×N be a matrix whose elements are the weights θi,j in the weighted directed acyclic graph G. Note that the weight θi,j for a directed edge (i,j) not included in the set of edges ε is θi,j=−∞.
Let the following be the set of all paths from node 1 to node N in the weighted directed acyclic graph G.
y′ [Formula 6]
The following arbitrary path
Y′∈y′ [Formula 7]
can be expressed as the binary matrix N×N. Specifically, letting be the element of the component (i,j), a path Y′ is a matrix of the element y′ij where y′ij=1 if the path Y′ passes through the directed edge (i,j), and y′ij=0 if the path Y′ does not pass through. The path Y′ expressed as such is in one-to-one correspondence with the structured output data Y. Accordingly, hereinafter, the path Y′ will be regarded the same as the structured output data Y, and will be indicated as “path Y” (yij being the element of the component (i,j) of the path Y). Similarly, a set of paths Y will be regarded the same as a set of the structured output data Y.
Here, letting <Y,θ> be the Frobenius inner product of Y and θ, <Y,θ> corresponds to the sum of the weights θi,j of the edges (i,j) along the path Y. Accordingly, letting the Frobenius inner product <Y,θ> be the score, the following combinational problem LP(θ) is solved to obtain the path Y=Y* having the highest score out of all of the paths Y.
Here, the magnitude of
y [Formula 9]
increases exponentially with N, but, using dynamic programming, LP(θ) can be calculated for an ordered path Y in the weighted directed acyclic graph G. In view of this, let the following be a set of parent nodes of the node i in the weighted directed acyclic graph G.
P
i [Formula 10]
Then vi(θ) is recursively defined by Expression 3 below.
Accordingly, the ultimately calculated vN(θ) is DP(θ). In other words, DP(θ) is expressed as follows.
DP(θ)vN(θ) [Formula 12]
Because it can be proven that the solution calculated through dynamic programming is optimal, DP(θ)=LP(θ) is true for any θ∈Θ. In other words, the dynamic programming solution (“value” in Expression 1 above) can be obtained by calculating recursively-defined Expression 3 above.
Here, when the dynamic programming solution (optimal solution of the objective function) is obtained as shown in Expression 2, the problem of obtaining the argument (Y*) of the objective function that gives the optimal solution can be said to be the problem of obtaining the following for the path Y that gives the highest score.
The argument (Y*) shown in Expression 4 above can be obtained by first performing the recursive calculation of Expression 3, and then performing backtracking.
However, DP(θ) is non-differentiable, and Y*(θ) is a discontinuous function. For this reason, if the dynamic programming computation is realized as a layer in a neural network, a derivative (derivative of a predetermined loss function) cannot be calculated through back propagation or the like, and therefore neural network training cannot be performed using gradient descent or the like.
In view of this, in these embodiments of the present invention, the procedures shown in Expressions 1′ and 2′ below are used in place of the procedures shown in Expressions 1 and 2.
Here, DPΩ is an approximation of DP, and the processing following DPΩ (i.e., the processing in a layer following the dynamic programming computation layer in the neural network) can also be accurately defined similarly to the case of using DP. Also, ∇DPΩ is the gradient of DPΩ, and the following is true.
conv(y) is a convex hull of y [Formula 16]
This convex hull is defined as follows.
conv(y){ΣY∈yλYY: λ∈Δ|y|} [Formula 17]
Also, ΔD is a D-dimensional simplex, and is defined as follows.
ΔD{λ∈R+D∥λ∥1=1} [Formula 18]
Here, unlike DP and Y*, DPΩ and ∇DPΩ are differentiable. Also, letting γ be arbitrary precision (in other words, letting γ be the difference between DPΩ and DP), the relationship between DPΩ and DP and the relationship between ∇DPΩ and Y* are expressed as follows.
In order to handle the dynamic programming problem using procedures approximated by Expressions 1′ and 2′, consider replacing the max function with the maxΩ function defined as follows.
Here, Ω:ΔD→R is a strongly-convex regularization function.
Also, as the maxΩ function regarding the following:
f:y→R [Formula 21]
the following notation is implemented for the sake of convenience.
By then substituting the max function for Expression 3, Expression 5 below can be defined recursively.
Hereinafter, Expression 5 will also be expressed as follows for the sake of convenience.
(vi(θ))i=1N [Formula 24]
vN(θ) ultimately calculated by Expression 5 is DPΩ(θ). In other words, DPΩ(θ) is expressed as follows.
DPω(θ)vN(θ) [Formula 25]
Accordingly, the dynamic programming computation layer can be expressed by the following two layers (Value layer and Gradient layer).
Value layer: DPΩ(θ)∈R
Gradient layer: ∇DPΩ(θ)∈conv(y) [Formula 26]
Note that when a dynamic programming solution is to be obtained, it is sufficient to use the Value layer as the neural network layer. On the other hand, when the value of the argument of the objective function that gives the dynamic programming solution is to be obtained, it is sufficient to use the Gradient layer as the neural network layer.
DPΩ(θ) in the Value layer can be used as a differentiable approximation of DP(θ). For example, DPΩ(θ) can be used when defining a loss function (let this loss function be L1) that indicates how close a correct answer output Ytrue and a prediction output ∇DPΩ(θ) of the neural network are to each other in neural network training. The loss function L1 is defined by Expression 6 below, for example.
[Formula 27]
DPΩ(θ)−Ytrue,θ
∈R (6)
The smaller the value of the loss function L1 is, the closer the obtained prediction output ∇DPΩ(θ) is to the correct answer output Ytrue.
When using the Value layer (i.e, the layer for calculating DPΩ(θ)) as a layer in a neural network, the gradient ∇DPΩ(θ) of DPΩ(θ) needs to be calculated in order to train the parameters of the neural network. The gradient ∇DPΩ(θ) can be calculated through back propagation using Expression 5. More specifically, letting E=∇DPΩ(θ)∈RN×N, Q=(qij)∈RN×N, and h=(h1, . . . , hN)∈RN, then E=∇DPΩ(θ)∈RN×N can be obtained using the procedures from Step 1-1 to Step 1-3 described below. Note that it is assumed that θ∈RN×N is given.
Step 1-1: As an initialization procedure, the following is set: v1←0∈R, hN←1∈R, Q←0∈RN×N, E←0∈RN×N. Note that “←” means substituting the right-hand side for the left-hand side.
Step 1-2: As a forward procedure, the following calculations and substitutions are performed sequentially for i=2, . . . , N.
Step 1-3: As a backward procedure, the following calculations and substitutions are performed sequentially for j=N−1, . . . , 1.
Here, Cj represents a set of child nodes of node J.
E, which is ultimately obtained by the above procedures, is ∇DPΩ(θ).
On the other hand, the Gradient layer ∇DPΩ(θ) can be used as a differentiable approximation of Y*(θ) defined by Expression 4. For example, ∇DPΩ(θ) can be used when defining a loss function (let this loss function be L2) that indicates how close a correct answer output Ytrue and a prediction output □DPΩ(θ) of the neural net are to each other in neural network training. The loss function L2 is defined by Expression 7 below, for example.
[Formula 30]
Δ(Ytrue,∇DPΩ(θ)) (7)
Here, Δ is a divergence such as a Euclidean distance or a Kullback-Leibler divergence. The smaller the value of the loss function L2 is, the closer the obtained prediction output ∇DPΩ(θ) is to the correct answer output Ytrue.
In the case of using the Gradient layer (i.e., the layer for calculating ∇DPΩ(θ)) as a layer in a neural network, in order to train the parameters of the neural network, it is necessary to calculate the product of the Jacobian ∇∇DPΩ(θ) of ∇DPΩ(θ) (i.e., the Hessian ∇2DPΩ(θ)) and a given matrix Z∈RN×N. This can be calculated by Pearlmutter's method disclosed in Reference Literature 1 below.
Note that the Gradient layer ∇DPΩ(θ) can also be used as a neural network attention mechanism.
Here, it is sufficient that the maxΩ function used in DPΩ(θ) and ∇DPΩ(θ) is appropriately set according to the problem addressed by dynamic programming, and the following are two specific examples of the maxΩ function.
Example 1 of the maxΩ function uses negative entropy as the strongly-convex regularization function Ω.
Take the following, where γ>0.
Accordingly, the maxΩ function, the gradient ∇maxΩ, and the Hessian ∇2maxΩ are expressed as follows.
Now take the following.
J
Ω(q)(Diag(q)−qqT)/γ [Formula 33]
Also, Diag(q) is square matrix given by elements having q as the diagonal component. Note that if γ=1, ∇maxΩ matches softmax.
Example 2 of the maxΩ function uses squared 2-norm as the strongly-convex regularization function Ω.
Take the following, where γ>0.
Accordingly, the maxΩ function, the gradient ∇maxΩ, and the Hessian ∇2maxΩ are expressed as follows.
Now take the following.
J
Ω(q)(Diag(s)−ssT/∥s∥1)/γ [Formula 36]
Also, s∈{0,1}D is a vector that supports the vector q. Note that ∇maxΩ is a Euclidean projection to a simplex.
∇maxΩ in Example 2 matches “sparsemax” described in Reference Literature 2 below. Accordingly, if the maxΩ function in Example 2 is used, it can be expected to obtain structured output data Y that has high sparsity.
<Function Configuration>
The following describes the function configurations of the conversion device 100 and the training device 200 of embodiments of the present invention.
(Conversion Device 100)
First, the function configuration of the conversion device 100 of this embodiment of the present invention will be described with reference to
As shown in
The preprocessing unit 101 and the conversion processing unit 102 convert the structured input data X into the structured output data Y (=∇DPΩ(θ)). Alternatively, the preprocessing unit 101 and the conversion processing unit 102 convert the structured input data X into a dynamic programming solution (=DPΩ(θ)). Note that as previously mentioned, DPΩ(θ) is more accurately an approximation of the dynamic programming solution DP(θ).
The preprocessing unit 101 and the conversion processing unit 102 are realized by one or more neural networks. For example, as previously mentioned, the preprocessing unit 101 is realized by a neural network such as a BLSTM, and the conversion processing unit 102 is realized by a neural network that has a dynamic programming computation layer.
Note that the preprocessing unit 101 and the conversion processing unit 102 may be realized by a neural network that is a combination of a neural network that realizes the preprocessing unit 101 and a neural network that realizes the conversion processing unit 102. In this case, the neural network that realizes the preprocessing unit 101 and the conversion processing unit 102 has a layer for converting the structured input data X into θ (a layer for performing the computation of the preprocessing unit 101), and a layer for converting θ into the structured output data Y (=∇DPΩ(θ) or a dynamic programming solution (=DPΩ(θ)) (a layer for performing the computation of the conversion processing unit 102).
The preprocessing unit 101 performs the preprocessing in Expression 1′ or 2′ using a trained neural network. Specifically, the preprocessing unit 101 converts the structured input data X into θ. This preprocessing is predetermined preprocessing that is determined according to the problem addressed by dynamic programming. For example, as previously mentioned, if the problem addressed by dynamic programming is part-of-speech tagging, the preprocessing is realized by BLSTM.
Note that instead of the conversion device 100 including the preprocessing unit 101, a device different from the conversion device 100 may include the preprocessing unit 101. In this case, it is sufficient that the structured input data X is converted into θ by the preprocessing unit 101 in the other device, and then 8 is input to the conversion device 100.
The conversion processing unit 102 performs computation corresponding to DPΩ or ∇DPΩ in Expression 1′ or 2′ using a trained neural network. Specifically, the conversion processing unit 102 converts θ, which was obtained by the preprocessing performed by the preprocessing unit 101, into the structured output data Y (=∇DPΩ(θ)) or the dynamic programming solution (=DPΩ(θ)). The conversion result (DPΩ(θ) or ∇DPΩ(θ)) is then output to a predetermined output destination. Examples of the predetermined output destination include a display device such as a display, a storage device such as an auxiliary storage device, another program, another device, or the next layer in the neural network.
In the case of performing computation corresponding to DPΩ, it is sufficient that the conversion processing unit 102 performs the recursively-defined computation in Expression 5. Accordingly, DPΩ(θ)=vN(θ) is obtained.
However, in the case of performing computation corresponding to ∇DPΩ, it is sufficient that the conversion processing unit 102 performs the computation shown in the procedures of Step 1-1 to Step 1-3. Accordingly, ∇DPΩ(θ) is obtained.
Note that as described above, whether DPΩ(θ) or ∇DPΩ(θ) is to be obtained as the conversion result of the conversion processing unit 102 is determined according to the problem addressed by dynamic programming. Note that both DPΩ(θ) and ∇DPΩ(θ) may be obtained as conversion results of the conversion processing unit 102.
(Training Device 200)
Next, the function configuration of the training device 200 of this embodiment of the present invention will be described with reference to
As shown in
Note that the preprocessing unit 101 and the conversion processing unit 102 of the training device 200 are similar to the preprocessing unit 101 and the conversion processing unit 102 of the conversion device 100 described above. However, predetermined initial values or the like have been set as the parameters of the neural networks that realize the preprocessing unit 101 and the conversion processing unit 102 of the training device 200. These parameters are updated through training.
The training data input unit 201 receives a training data set. The training data set is a set of training data made up of sets of structured input data Xtrain for use in training and correct answer output Ytrue that corresponds to the structured input data Xtrain.
The preprocessing unit 101 and the conversion processing unit 102 perform preprocessing and conversion processing (computation corresponding to DPΩ or computation corresponding to ∇DPΩ) on the pieces of structured input data Xtrain included in the training data received by the training data input unit 201, and DPΩ(θ) or ∇DPΩ(θ) is calculated as a conversion result.
The parameter updating unit 202 calculates the derivative of a predetermined loss function based on the conversion result DPΩ(θ) or ∇DPΩ(θ) obtained by the conversion processing unit 102 and the correct answer output Ytrue that corresponds to the structured input data X train subjected to preprocessing and conversion processing, and updates the parameters of the neural network using the calculated result. The derivative of the loss function is calculated using back propagation, for example. Also, the loss function is Expression 6 if the conversion result obtained by the conversion processing unit 102 is DPΩ(θ), but is Expression 7 if the conversion result obtained by the conversion processing unit 102 is ∇DPΩ(θ).
At this time, the parameter updating unit 202 repeatedly updates the parameters of the neural network until a predetermined condition is satisfied. This predetermined condition is for determining whether or not convergence has been obtained in the training of the neural network, and examples of the condition include whether or not the value of the loss function is less than or equal to a predetermined threshold value, and whether or not a predetermined repetition count has been reached.
If the predetermined condition has been satisfied, the parameter updating unit 202 outputs the values of the parameters of the neural network, for example, and then ends the processing.
As Example 1 of operations of the conversion device 100, the following describes the case where the conversion processing unit 102 performs calculation corresponding to the Viterbi algorithm. The Viterbi algorithm is one of the most famous examples of an algorithm used in dynamic programming, and is an algorithm for finding, as an output sequence, the most likely sequence of states among sequences of states for an input sequence in a state transition model of transitions from one state to another state with a predetermined probability at certain times. Letting the states be nodes, the transitions from one state to another state be directed edges, and the probabilities of transitions from one state to another state be weights, the state transition model can be expressed as a weighted directed acyclic graph (DAG). In this case, the sequences of states can be expressed as paths from the start node to the end node in the weighted directed acyclic graph.
Accordingly, letting the structured input data X be the input sequence X=(x1, x2, . . . , xT), the Viterbi algorithm finds, as the solution (output sequence), the most likely sequence of states y (i.e., the mostly likely path y in the directed acyclic graph) among the sequences of states y=(y1, y2, . . . , yT) for the input sequence X, for example. Here, each xt (t=1, 2, . . . , T) is a D-dimensional real vector, and each yt (t=1, 2, . . . , T) is an element of [S]. Note that [S] expresses the set {1, . . . , S}.
As a specific example, consider the case where the input sequence X is a word sequence X in which each xt is a word, and the output sequence y is a sequence of tags yt corresponding to xt. In this case, the Viterbi algorithm can be thought to be processing for performing part-of-speech tagging on the input sequence X.
Here, letting yt,i,j=1 indicate the case of a transition from node j to node i at the time t, and yt,i,j=0 indicate the case otherwise, a sequence of states y can be expressed as a binary tensor Y of T×S×S whose element of the (t,i,j) component is yt,i,j.
Also, let θt,i,j be the probability of a transition from node j to node i at the time t, and let θ be the real tensor of T×S×S whose element of the (t,i,j) component is θt,i,j. This θ is obtained by the preprocessing unit 101 with use of BLSTM, for example. In other words, in this case, the preprocessing unit 101 of the conversion device 100 obtains the real tensor θ of T×S×S with use of BLSTM, for example.
Accordingly, the Frobenius inner product <Y,θ> corresponds to the sum of the weights θt,i,j of the edges along the path expressed by the sequence of states y. This is shown in
Here, if this Frobenius inner product <Y, θ>=θ1,3,1+θ2,1,3+θ3,2,1 has the highest score, the path y shown in
Note that if Ω=−H (negative entropy), the linear-chain CRFs (Conditional Random Fields) disclosed in Reference Literature 3 below can be reconstructed.
In order to obtain the solution of the Viterbi algorithm, the conversion processing unit 102 of the conversion device 100 need only calculate VitΩ(θ) defined below and ∇VitΩ(θ) calculated from VitΩ(θ), based on the real tensor θ of T×S×S obtained by the preprocessing unit 101.
VitΩ(θ)maxΩ(vT(θ)) [Formula 37]
Here, vt(θ)(t=1, . . . , T) is vt(θ)=(vt,1(θ), . . . , vt,S(θ)). Also, the i-th element vt,i(θ) of vt(θ) is defined as follows.
Note that VitΩ(θ) is a convex function of an arbitrary Ω.
Here, VitΩ(θ) can be calculated through the procedures of Step 2-1 to Step 2-3 below (forward procedures). Also, ∇VitΩ(θ) can be calculated by performing the procedures of Step 3-1 and Step 3-2 below (backward procedures) after the procedures of Step 2-1 to Step 2-3 below have been performed. Note that in Step 2-1 to Step 2-3 and in Step 3-1 and Step 3-2 below, Q is a tensor of (T+1)×T×S, and U is a matrix of (T+1)×S, as shown below.
Q
(q)t=1,i,j=1T+1,S,S,U
(u)t=1,j=1T+1,S [Formula 39]
Also, it is assumed that θ∈RT×N×N is given.
Step 2-1: Let v0=0∈RS.
Step 2-2: For t=1, . . . , T, successively perform the following calculation for each i∈[S].
v
t,i=maxΩ(θt,i+vt−1)
q
t,i=∇maxΩ(θt,i+vt−1) [Formula 40]
Step 2-3: Calculate maxΩ(vT) using vT=(vT,1, . . . , vT,S) obtained above. This maxΩ(vT) is VitΩ(θ). Also, assume the following for later-described Step 3-1 to Step 3-3.
v
T+1,1=maxΩ(vT)
q
T+1,1=∇maxΩ(vT) [Formula 41]
Step 3-1: Let uT+1=(1, 0, . . . , 0)∈RS.
Step 3-2: For t=T, . . . , 0, successively perform the following calculation for each j∈[S].
e
t,⋅,j
=q
t+1,⋅,j
∘u
t+1
u
t,j
=
e
t,⋅,j,1S [Formula 42]
Here, ∘ represents an element-wise product (Hadamard product).
The following is obtained through the above procedures.
Also, after the procedures of Step 3-1 to Step 3-3 have been performed, the procedures of Step 4-1 to Step 4-5 below can be used to calculate <∇VitΩ(θ),Z> and ∇2VitΩ(θ)Z for the given Z∈RT×S×S. Note that the procedures of Step 4-1 to Step 4-3 are forward procedures, and the procedures of Step 4-4 and Step 4-5 are backward procedures.
Step 4-1: First, assume the following.
{dot over (v)}
0=0S [Formula 44]
Step 4-2: For t=1, . . . , T, successively perform the following calculation for each i∈[S].
{dot over (v)}
t,i
=
q
t,i
,z
t,i
+{dot over (v)}
t−1
{dot over (q)}
t,i
=J
Ω(qt,i)(zt+{dot over (v)}t−1) [Formula 45]
Step 4-3: Calculate the following.
{dot over (v)}
T+1,1
=
q
T+1,1
,{dot over (v)}
T
{dot over (q)}
T+1,1
=J
Ω({dot over (q)}T+1,1){dot over (v)}T [Formula 46]
Step 4-4: Next, assume the following.
{dot over (u)}
T+1=0S
{dot over (Q)}
T+1=0S×S [Formula 47]
Step 4-5: For t=T, . . . , 0, successively perform the following calculation for each i∈[S].
ė
t,⋅,j
=q
t+1,⋅,j
∘u
t+1
+{dot over (q)}
t+1,⋅,j
∘{dot over (u)}
t+1
{dot over (u)}
t,j
=
ė
t,⋅,j,1S [Formula 48]
The following is obtained through the above procedures.
As Example 2 of operations of the conversion device 100, the following describes the case where the conversion processing unit 102 performs calculation corresponding to DTW (Dynamic Time Warping). Dynamic time warping is used when analyzing the correlation (similarity) between two sequences of time-series data.
Let NA be the sequence length of time-series data A, and NB be the sequence length of time-series data B. Also, let ai be the i-th observed value in the time-series data A, and bj be the j-th observed value in the time-series data B.
Letting yij=1 indicate the case where ai and bj are similar to each other, and yij=0 indicate the case otherwise, when considering a binary matrix Y of NA×NB having yij as elements, the binary matrix Y is an alignment Y expressing the correspondence relationship (similarly relationship) between the time-series data A and the time-series data B.
Also, let θ be a matrix of NA×NB, and the elements of θ be θi,j. As a classical example, a differentiable distance scale d is used such that θi,j=d(ai,bj). This θ is obtained by the preprocessing unit 101 of the conversion device 100. Note that θ will also be called the distance matrix.
Accordingly, letting sets (ai,bj) of observed values be the nodes, the alignment Y expresses a path in a weighted directed acyclic graph (DAG).
Here, the following is the set of all monotone alignment matrices.
y [Formula 50]
A monotone alignment matrix is a matrix in which the only path that is allowed is a non-backtracking path from an upper left (1,1) component to a lower right (NA,NB) component in the matrix, that is to say, a path from the (i,j) node to either a rightward, leftward, or lower-right component. In other words, if yij=1, at least any one of yi+1,j, yi,j+1, and yi+1,j+1 is 1.
Due to using the monotone alignment matrix Y, the Frobenius inner product <Y,θ> corresponds to the sum of the weights θi,j of the edges along the path shown by the monotone alignment matrix Y. In other words, the Frobenius inner product <Y,θ> can be used in the alignment cost. In the case of the path shown in
Here, letting vi,j(θ) be the cost of the (i,j) component (cell) of the alignment, vi,j(θ) can be expressed as follows.
v
ij(θ)=θi,j+minΩ(vi,j−1(θ),vi−1,j−1(θ),vi−1,j(θ)) [Formula 51]
Also, the minΩ function, the gradient ∇minΩ, and the Hessian ∇2minΩ are defined and implemented as follows.
Accordingly, in order to obtain the most likely alignment Y, the conversion processing unit 102 of the conversion device 100 need only calculate the below-defined DTWΩ(θ) and ∇DTWΩ(θ) calculated from DTWΩ(θ), based on the distance matrix θ of NA×NB obtained by the preprocessing unit 101.
DTWΩ(θ)vN
Here, DTWΩ(θ) can be calculated through the procedures of Step 5-1 and Step 5-2 below (forward procedures). Also, ∇DTWΩ(θ) can be calculated through the procedures of Step 6-1 and Step 6-2 below (backward procedures). Note that in Step 5-1 and Step 5-2 and in Step 6-1 and Step 6-2 below, Q is a tensor of (NA+1)×(NB+1)×3, and E is a matrix of (NA+1)×(NB+1), as shown below.
Q
(q)i,j,k=1N
E
(e)i,j=1N
Also, assume that the following is given.
θ∈RN
Step 5-1: Let v0,0=0. Also, let vi,0=v0,j=∞ for i=1, . . . , NA, j=1, . . . , NB.
Step 5-2: Successively perform the following calculation for i=1, . . . , NA, j=1, . . . , NB.
v
i,j
=d
i,j+minΩ(vi,j−1,vi−1,j−1,vi−1,j)
q
i,j=∇minΩ(vi,j−1,vi−1,j−1,vi−1,j)∈R3 [Formula 56]
Step 6-1: Next, assume the following for i=1, . . . , NA, j=1, . . . , NB.
q
i,N
+1
=q
N
+1,j=03
e
i,N
+1
=e
N
+1,j=0
g
N
+1,N
+1=(0,1,0)
e
N
+1,N
+1=1 [Formula 57]
Step 6-2: Successively perform the following calculation for j=NB, . . . , 1, i=NA, . . . , 1.
e
i,j
=q
i,j+1,1
e
i,j+1
+q
i+1,j+1,2
e
i+1,j+1
+q
i+1,j,3
e
i+1,j [Formula 58]
The following is obtained through the above procedures.
DTWΩ(θ)=vN
∇DTWΩ(θ)=(e)i,j=1N
Also, after the procedures of Step 5-1 and Step 5-2 have been performed, the procedures of Step 7-1 to Step 7-4 below can be used to calculate <∇DTWΩ(θ),Z> and ∇2DTWΩ(θ)Z for the given following.
Z∈R
N
×N
[Formula 60]
Note that the procedures of Step 7-1 and Step 7-2 are forward procedures, and the procedures of Step 7-3 and Step 7-4 are backward procedures.
Step 7-1: First, assume the following for i=0, . . . , NA, j=1, . . . , NB.
{dot over (v)}
i,j
={dot over (v)}
0,j=0 [Formula 61]
Step 7-2: Successively perform the following calculation for i=1, . . . , NB, j=1, . . . , NA.
{dot over (v)}
i,j
=z
i,j,1
{dot over (v)}
i,j−1
+q
i,j,2
{dot over (v)}
i−1,j−1
+q
i,j,3
{dot over (v)}
i−1,j
{dot over (q)}
i,j
=−J
Ω(qi,j)({dot over (v)}i,j−1,{dot over (v)}i−1,j−1,{dot over (v)}i−1,j)∈R3 [Formula 62]
Step 7-3: Next, assume the following for i=0, . . . , NA, j=1, . . . , NB.
{dot over (q)}
i,N
+1
={dot over (q)}
N
+1,j=03
ė
i,N
+1
=ė
N
+1,j=0 [Formula 63]
Step 7-4: Successively perform the following calculation for j=NB, . . . , 1, i=NA, . . . , 1.
The following is obtained through the above procedures.
∇DTWΩ(θ),Z
={dot over (v)}N
∇2DTWΩ(θ)Z=(ė)i,j=1N
The following describes effects of the present invention by way of example of using Example 2, with reference to
In
As shown in (a) and (b) of
Note that ∇DTWΩ(θ) shown as a heat map in (a) and (b) in
<Hardware Configuration>
Lastly, a hardware configuration of the conversion device 100 and the training device 200 of embodiments of the present invention will be described with reference to
As shown in
The input device 301 is a keyboard, a mouse, a touch panel, or the like, and is used for the input of various operations by a user. The display device 302 is a display or the like, and displays processing results of the conversion device 100. Note that at least either the input device 301 or the display device 302 may be omitted from the conversion device 100 and the training device 200.
The external I/F 303 is an interface with an external apparatus. One example of the external apparatus is a recording medium 303a. The conversion device 100 can read data from and write data to the recording medium 303a or the like via the external I/F 303. The recording medium 303a may have recorded thereon one or more programs for realizing the function units of the conversion device 100 or the function units of the training device 200, for example.
Examples of the recording medium 303a include a flexible disk, a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
The RAM 304 is a volatile semiconductor memory for temporarily holding programs and data. The ROM 305 is a non-volatile semiconductor memory that can hold programs and data even if the power is cut. The ROM 305 has stored therein settings for an OS (Operating System), settings for a communication network, and the like.
The arithmetic device 306 is a CPU, a GPU (Graphics Processing Unit), or the like, and is for reading out programs and data from the ROM 305 and the auxiliary storage device 308 to the RAM 304 and executing processing.
The communication I/F 307 is an interface for connecting the conversion device 100 to a communication network. One or more programs for realizing the function units of the conversion device 100 or the function units of the training device 200 may be acquired (downloaded) by a predetermined server device or the like via the communication I/F 307.
The auxiliary storage device 308 is a non-volatile storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores programs and data. The programs and data stored in the auxiliary storage device 308 include an OS and one or more programs for realizing the function units of the conversion device 100 or the function units of the training device 200.
The conversion device 100 and the training device 200 of embodiments of the present invention can realize the various types of processing described above due to having the hardware configuration shown in
The present invention is not intended to be limited to the embodiments that have been disclosed in detail above, and various modifications and changes can be made without departing from the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
2018-129998 | Jul 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/025636 | 6/27/2019 | WO | 00 |