Field of the Invention
The present invention is related to sequence-to-sequence learning and, more particularly, it is related to sequence-to-sequence learning of Recurrent Neural Networks (RNNs) that relies on the agreement between a plurality of RNNs with different orders of the target sequence.
Description of the Background Art
RNNs are now popular tool for the so-called Artificial Intelligence. In departure from the Feed Forward Neural Networks (FFNN), RNNs have internal memories to store the history or the contexts of its internal states; therefore, RNNs are suitable to process a series of inputs that arrive in a sequence. For the basic architecture of RNNs, see Reference 9 (Mikolov et al., listed at the end of this specification), which is incorporated herein by reference.
RNNs, particularly, Long Short-term Memory networks (LSTMs), provide a universal and powerful solution for various tasks that have traditionally required carefully designed, task-specific solutions. For the details of the LSTMs, see Reference 5 (Hochreiter et al) and Reference 4 (Graves 2013), which are both incorporated herein by reference. On classification tasks, they can readily summarize an unbounded context which is difficult for traditional solutions, and this leads to more reliable estimation. They have advantages over traditional solutions on a more general and challenging tasks such as sequence-to-sequence learning (See Reference 11 (Sutskever et al.), which is incorporated herein by reference), where a series of local but dependent estimations are required. RNNs make use of the contextual information for the entire source sequence and also critically are able to exploit the entire sequence of previous estimations. On various sequence-to-sequence transduction tasks, RNNs have been shown to be comparable to the state-of-the-art or superior.
In the estimating phase, RNN operates as follows. An input (f1 f2 f3 <eos>) is prepared. f1, f2 and f3 are fed to RNN in this order. At the end of the sequence, <eos>is input to the RNN. In response, an output t1 is obtained from the RNN. Next, the output t1 is fed to the RNN as a next input, and a next output t2 will appear at the output of the RNN. This process is repeated until the output sequence (t1 t2 t3 t4) is obtained. If the parameters of the RNN have been well adjusted, the output sequence (t1 t2 t3 t4) will be (e1 e2 e3 e4). This is the estimating phase of the RNN. During the estimating phase, the RNN decoder must compute a large amount of probabilities. Because the computing resources are limited and a fast response is requited, the decoder utilizes a beam search as schematically shown in
Despite their successes on sequence-to-sequence learning, RNNs suffer from a fundamental and crucial shortcoming, which has surprisingly been overlooked. When making estimations, an LSTM needs to encode the previous local estimations as a part of the contextual information. If some of previous estimations are incorrect, the context for subsequent estimations might include some noises, which undermine the quality of subsequent estimations, as shown in
In
A statistical analysis on the real estimation results from an LSTM was performed in order to motivate the work reported here. The analysis supports our hypothesis, and found that on test examples longer than 10 tokens, the precision of estimations for the first two characters was higher than 77%, while for the last two characters it was only about 65%.
We conclude that this shortcoming may limit the potential of an RNN, especially for long sequences.
Therefore, there is a need for a new framework of training sequence-to-sequence estimation model such as RNNs that will be more reliable for long sequences.
To address the above shortcoming, the present invention proposes a simple yet efficient approach. The basic idea of the embodiments of the present invention relies on the agreement between two target-specific directional LSTM models: one generates target sequences from left-to-right as usual, while the other generates target sequences in another direction, for example, from right-to-left. Specifically, we first jointly train both directional LSTM models; and then for testing (estimating), we try to search for target sequences which have support from both of the models. In this way, it is expected that the final outputs contain both good prefixes and good suffixes. Since the joint search problem has been shown to be NP-hard, its exact solution is intractable, and we have therefore developed two approximate alternatives which are simple yet efficient. Even though the proposed search techniques consider only a tiny subset of the entire search space, our empirical results show them to be almost optimal in terms of sequence-level losses.
The first aspect of the present invention is directed to a computer-implemented method of training a first sequence-to-sequence estimation model and a second sequence-to-sequence estimation model. The method includes:
Preferably, the permuting function is a function reversing the order of tokens in an input sequence.
More preferably, each of the first and the second sequence-to-sequence estimation models are RNNs.
The second aspect of the present invention is directed to a computer-implemented joint estimation method utilizing the first and the second sequence-to-sequence estimation models trained by the method described above. The method includes the steps of: receiving an input sequence as an input of the computer; decoding the input utilizing the first and the second sequence-to-sequence estimation models, thereby producing a prescribed number of best hypotheses from each of the first and the second sequence-to-sequence models; permuting tokens by a second permuting function executed by the computer, in each of the prescribed number of best hypotheses output from the second sequence-to-sequence estimation model; re-scoring each of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypothesis with tokens permuted in the permuting step utilizing the first and the second sequence-to-sequence estimation models; and selecting a hypothesis with the highest score in the re-scoring step as an estimated output corresponding to the input.
Preferably, the second permuting function is an inverse of the first permuting function.
More preferably, the permuting function is a function reversing the order of tokens in an input sequence.
Further preferably, each of the first and the second sequence-to-sequence estimation models are RNNs.
The re-scoring step may include the steps of: calculating a union set of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypotheses output from the second sequence-to-sequence estimation model with tokens permuted in the permuting step; computing a first score of each of the hypotheses in the union set utilizing the first sequence-to-sequence estimation model; computing a second score of each of the hypotheses in the union set utilizing the second sequence-to-sequence estimation model; and re-scoring each of the hypotheses in the union set by multiplying the first score by the second score.
Preferably, the re-scoring step may include the steps of: calculating a union set of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypotheses output from the second sequence-to-sequence estimation model with tokens permuted in the permuting step; generating a set of new hypotheses by concatenating any one of prefixes in the hypotheses in the union set and any one of suffixes in the hypotheses in the union set; computing a first score of each of the new hypotheses utilizing the first sequence-to-sequence estimation model; computing a second score of each of the new hypotheses utilizing the second sequence-to-sequence estimation model; and re-scoring each of the new hypotheses by multiplying the first score by the second score.
The third aspect of the present invention is directed to a computer-implemented joint estimation apparatus utilizing the first and the second sequence-to-sequence estimation models trained by the method described above. The apparatus includes: a data receiving interface connected to the computer, configured to receive an input sequence as an input; a storage device connected to the computer, for storing the first and the second sequence-to-sequence estimation models; and a control unit. The control unit is configured to; decode the input utilizing the first and the second sequence-to-sequence estimation models, thereby producing a prescribed number of best hypotheses from each of the first and the second sequence-to-sequence models; permute tokens by executing a second permuting function, in each of the prescribed number of best hypotheses output from the second sequence-to-sequence estimation model; re-score each of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypothesis with tokens permuted in the permuting step utilizing the first and the second sequence-to-sequence estimation models; and select a hypothesis with the highest score in the re-scoring step as an estimated output corresponding to the input.
The second permuting function may be an inverse of the first permuting function.
Preferably, the permuting function is a function reversing the order of tokens in an input sequence.
More preferably, each of the first and the second sequence-to-sequence estimation models are RNNs.
Further preferably, in re-scoring, the control unit is configured to: calculate a union set of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypotheses output from the second sequence-to-sequence estimation model with tokens permuted; compute a first score of each of the hypotheses in the union set utilizing the first sequence-to-sequence estimation model; compute a second score of each of the hypotheses in the union set utilizing the second sequence-to-sequence estimation model; and re-score each of the hypotheses in the union set by multiplying the first score by the second score.
Preferably, in rescoring, the control unit is configured to calculate a union set of the best hypotheses output from the first sequence-to-sequence estimation model and the best hypotheses output from the second sequence-to-sequence estimation model with tokens permuted; generate a set of new hypotheses by concatenating any one of prefixes in the hypotheses in the union set and any one of suffixes in the hypotheses in the union set; compute a first score of each of the new hypotheses utilizing the first sequence-to-sequence estimation model; compute a second score of each of the new hypotheses utilizing the second sequence-to-sequence estimation model; and re-score each of the new hypotheses by multiplying the first score by the second score.
The present invention makes the following contributions: It proposes an efficient approximation of the joint search problem, and demonstrates empirically that it can achieve close to optimal performance. This approach is general enough to be applied to any deep recurrent neural networks.
The foregoing and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
In the following, the same components are denoted by the same reference numerals. Their names and functions are the same; therefore, their detailed description will not be repeated.
Although the following embodiments are directed to machine transliteration and grapheme-to-phoneme tasks for reasons of simplicity, the present invention has the potential to be applied to any sequence-to-sequence learning tasks including machine translation, in which the generation of long sequences is a challenging task.
Suppose x denotes a general (either source or target) sequence of tokens (characters in this case), its tth character (character at time step t) is xt and its length is |x|. In particular, a source sequence is denoted by f while a target sequence is denoted by e. θ denotes the overall model parameters of recurrent neural networks: θsuperscript denotes a component parameters of θ depending on superscript, and it is either a bias vector (if superscript includes character b) or a matrix; θ(xt) is a vector representing the word embedding of xt, which is either a source character or a target character; I(θ, xt) denotes the index of word xt in the source or target vocabulary specified by x. Note that in the rest of this paper, the subscript is reserved as the time step in a sequence for easier reading.
The sequence-to-sequence learning model by RNN is defined as follows (Reference 4):
where g is a softmax function, p is an operator over a vector dependent on specific instances of RNNs, [•] denotes the subscript operator of a vector and a vector ht(x) is the recurrent hidden state of sequence x at time step t with base cases: h−1(f)=0 and h-1(e)=h|f|−1(f).
Given a source sequence f and parameters θ, decoding can be formulated as follows:
where P is given by Equation (1), and Ω(f) is the set of all possible target sequences f that can be generated using the target vocabulary. Since the prediction at time step t (i.e. et) is dependent on all the previous estimations, it is NP-hard to optimize the exact solution of Equation (2). Instead, an approximate solution, beam search, is widely applied (Reference 11 (Sutskever et al., 2014). It generates the target sequence from left-to-right, that is, it generates targets from the beginning t=0 to the end (at which point a special sequence termination symbol is generated). During the search, at each t, a set of top-k partial hypotheses (i.e. prefixes) is maintained in a priority queue, and only these hypotheses will be extended. The priority queue ordered with respect to the model scores of a partial hypothesis. The search process for sequence-to-sequence transduction using LSTM RNNs usually employs a small beam size (a typical beam size being 12), which has been shown empirically to be sufficient for high quality results (Reference 11 (Sutskever et al., 2014)).
Recently, it has been shown that substantial gains in performance can be obtained when using an ensemble of multiple LSTMs, and therefore the following embodiments have adopted this approach for the experiments given somewhere in this specification. Decoding with an ensemble model is similar to that of a single LSTM except the ensemble's model scores are defined to be the sum of individual models' scores. This sum can be efficiently calculated during the search process.
Despite their successes on various tasks, RNNs still suffer from a fundamental shortcoming. Suppose at time step t when estimating et, there is an incorrect estimation et, for t′ with 0≦t′≧t. In other words, the hidden states ht, encode this incorrect information for each t″ in the range t′≦t″≦t; and this can be expected to degrade the quality of all the estimations made using the noisy ht. Ideally, if the probability of a correct estimation at t″ is pt″, then will ht contain noisy information with a probability of; 1−Π0≦t′<tPt′. As t increases the probability of noise in the context increases quickly, and therefore it is more difficult for an RNN to make correct estimations as the sequence length increases. As a result, generic LSTMs cannot maintain the quality of their earlier estimations in their later estimations as has been explained with reference to
In the subsequent sections, we will present the description of embodiments to overcome this shortcoming in detail.
As explained in the previous section, although the generic (left-to-right) LSTM struggles when estimating suffixes, fortunately, it is very capable at estimating prefixes. By contrast, a complementary LSTM which generates targets from right-to-left, is proficient at estimating suffixes. Inspired by work in the field of word alignment (Reference 8 (Liang et al. 2006)), we propose an agreement model for sequence-to-sequence learning. It encourages the agreement between both target-directional LSTM models.
Formally, we develop the following joint target-bidirectional LSTM model:
P
jnt(e|f;{right arrow over (θ)},)={right arrow over (P)}(e|f;{right arrow over (θ)})×(e|f;) (3)
where P and are the left-to-right and right-to-left LSTM models respectively, with definitions similar to Equation (1); {right arrow over (θ)} and denote their parameters. This model is called an agreement model or joint model in this specification, and ={right arrow over (θ)}, denotes its parameters for simplicity. The training can be written as the minimization of the following equation:
where the example f, e ranges over a given training set. To perform the optimization, we employ AdaDelta (Reference 12 (Zeiler 2012)), a mini-batch stochastic gradient method. The gradient is calculated using back-propagation through time (Reference 10 (Rumelhart et al., 1986), which is incorporated by reference), where the time is unlimited in the experiments described later. We employ the MAP strategy for testing.
In summary, the training of the bidirectional model encourages the agreement of two unidirectional models by minimizing a joint objective function. Then, for each test sequence, a joint search is performed to find the target sequence with the highest score from the agreement model. In the next section we will introduce the proposed methods for the joint search.
In this section we will first analyze the question of how to search for the best hypothesis during the testing of our bidirectional model, and then propose two possible solutions. The first embodiment is directed to the first solution, and the second embodiment is directed to the second.
The exact inference for an agreement model is usually intractable, even in the cases where the individual models can be factorized locally. In order to address this, on an agreement task using HMMs, Reference 8 (Liang et al. 2006) applies an approximate inference method which depends on the tractable calculation of the marginal probability of each local estimation according to the individual models. Unfortunately, this approximate method cannot be used in our case, because our individual model (the LSTM) is globally dependent and therefore such marginal calculations are not possible (tractable).
The beam search method, used for generic LSTMs mentioned before, is also impracticable. The reason being that the generation processes proceed in different directions; the joint model generates partial sequences either in a left-to-right or in a right-to-left manner during the search. It is impossible to calculate both left-to-right and right-to-left model scores simultaneously for each partial sequence.
We propose two simple approximate methods for joint search, which explore a smaller space than that of beam search. Their basic idea is aggressive pruning followed by exhaustive search: we first aggressively prune the entire exponential search space and then obtain the 1-best result via exhaustive search over the pruned space with respect to the agreement model. Critical to the success of this approach is that the aggressive pruning must not eliminate the promising hypothesis from the search space prior to the exhaustive search phase.
k-best Approximation
Suppose L12r and Lr2l are two top-k target sequence sets from the generic left-to-right and right-to-left LSTM models, respectively. Then we construct the first search space S1 as the union of these two sets:
S1=Ll2r∪Lr2l
In this way, exhaustively rescoring S1 with the agreement model has complexity O(k). One advantage of this method is that the search space is at most twice the size of that of its component LSTM models, and since the k-best size for generic LSTMs is typically very small, this method is computationally light. To make this explicit, in all the experiments reported here, the k-best size was 12, and the additional rescoring time was negligible.
Observing that both the prefixes of sequences in Ll2r and the suffixes of sequences in Lr2l are of high quality, we construct the second search space S2 as follows:
S
2
={e[:t]∘e′[t′:]e∈L
l2r
e′∈L
r2l, 0≦t≦|e|, 0≦t′≦|e′|}
where ∘ is a string concatenation operator, [: t] is a prefix operator that yields the first t tokens (characters) of a string, and [t :] is a suffix operator that yields the last t tokens (characters). Exhaustively rescoring over this space has complexity O(k2N2), where N is the length of the longest target sequence. In our implementation, the speed for rescoring over this space was approximately 0.1 seconds per sentence, thanks to efficient use of a GPU. We can see that the search space of this method includes that of the first method as a proper subset (S2⊃S1), and thus this method can be expected to lead to higher 1-best agreement model scores than the previous method.
Referring to
Bidirectional learner 10 includes a left-to-right learning data generator 120 for generating a left-to-right learning sequences by concatenating each of the source sequences and its counterpart target sequences, a learner 122 for training left-to-right model 106 in the manner as described above with reference to
Joint search apparatus 140 includes: a left-to-right-decoder 160 for decoding source input 142 utilizing left-to-right model 106 and for outputting left-to-right k-best 162 with respective decoding scores; a right-to-left decoder 164 for decoding source input 142 utilizing right-to-left model 108 and for outputting right-to-left k-best 166 with respective decoding scores; and a re-scorer 168 for rescoring each hypothesis in the union of left-to-right k-best 162 and right-to-left k-best 166 by multiplying the respective scores of the hypotheses. The 1-best of the result of rescoring of re-scorer 168 is output as target output 144.
Referring to
Bidirectional learner 100 and joint search apparatus 140 operates as follows.
Referring to
Referring to
When source input 142 is input to joint search apparatus 140, left-to-right-decoder 160 decodes the input utilizing left-to-right model 106 and outputs the left-to-right k-best 162. Right-to-left decoder 164 decodes the input and outputs right-to-left k-best 166.
Referring to
The second embodiment is directed to the polynomial approximation. Referring to
In this embodiment, the search space is substantially larger than that of the first embodiment; still, however, it is sufficiently small and the required computing amount is reasonably small.
The first embodiment and the second embodiment are directed to joint estimation using left-to-right and right-to-left models. The present invention is not limited to such embodiments. The right-to-left model may be replaced with any model that is trained with the permuted target sequence as long as the permutation G(x) has an inverse permutation H(x) such that e=H(G(e)). The third embodiment is directed to such a generalized version of the first and the second embodiments. Note that the permutation function may be different depending on the number of tokens in a sequence.
Referring to
The process at step 258 can be written in a pseudo code as follows:
while not convergence
return model parameters
The gradient descent algorithm may be AdaGrad, for example.
With reference to the right side of
The above-described embodiments 1 and 2 are the particular cases of this third embodiment.
We evaluated our approach on machine transliteration and grapheme-to-phoneme conversion tasks. For the machine transliteration task, we conducted both Japanese-to-English and English-to-Japanese directional subtasks. The transliteration training, development and test sets were taken from Wikipedia inter-language link titles: the training data consisted of 59000 sequence pairs composed of 313378 Japanese katakana characters and 445254 English characters; the development and test data were manually cleaned and each of them consisted of 1000 sequence pairs. For grapheme-to-phoneme conversion, the training set was the CMU dictionary consisting of about 110000 sequence pairs. We split the available test set consisting of 12374 sequence pairs into two equal-sized parts: the first part was used as the development set and the other was used as the test set. We use both ACC (sequence level) and FSCORE (non-sequence level) as the evaluation metrics.
Six baseline systems were used and are listed below. The first four used open source implementations, and the last two were re-implemented:
In addition we use the following notation: nl2r or nr2l denotes the number of left-to-right or right-to-left LSTMs in the ensembles of the ELSTM and BELSTM. For example, BELSTM (5l2r+5r2l) denotes ensembles of five l2r and five r2l LSTMs in the BELSTM.
For fair comparison, the stopping iteration for all systems was selected using the development set for all systems except Moses (which has its own termination criteria). For all of the re-implemented models, the number of word embedding units and hidden units were set to 500 to match the configuration using in the NTM.
Suppose the parameters of our agreement model are fixed after training, e is the reference sequence of f, S denotes the search space (either S1 or S2) of our approximate methods, and ê(f;,Ω), defined as in Equation (2), is the best target sequence of f in the search space Ω∈{S1, S2Ω(f)}.
If Pjnt(e|f;)==Pjnt(ê(f;, S)|f;), then our approximate search has resulted in the reference, as desired. Otherwise, we have the following possible outcomes:
ST:
Using this as a basis, we designed a scheme to evaluate the potential of our search methods as follows: we randomly select examples from the development set and compare the model scores of the references and the 1-best results from the approximate search methods; then analyze the distributions of the two cases GT and LT, where our model fails. In addition, to alleviate the dependency on , we tried 100 parameter sets optimized by our training algorithm starting from different initializations.
Table 1 shows the comparison between the approximate search methods on the JP-EN test set. We can see that they perform almost identically in terms of ACC and FSCORE. This result is not surprising, because both of them are near optimal (as illustrated in the previous section). Therefore, in the remainder of the experiments, we only report the results using the k-best approximate search.
Table 2 shows the results on the test sets of all three tasks:
JP-EN,EN-JP and GM-PM. Firstly, we can see that the undirectional neural networks (NMT and GLSTM) have lower performance than the strongest non-neural network baselines (Sequitur G2P), even when they achieve comparable performance on EN-JP. Our agreement model BLSTM shows substantial gains over both the GLSTM and NMT on all three tasks.
More specifically, the gain was up to 5.8 percentage points in terms of ACC and up to 2.2 percentage points in terms of FSCORE. Moreover, BLSTM showed comparable performance relative to Sequitur G2P on both JP-EN and GM-PM, and was markedly better on the EN-JP task.
Secondly, the BELSTM which used ensembles of five LSTMs in both directions consistently achieved the best performance on all the three tasks, and outperformed Sequitur G2P by up to 5.5 points in ACC and 4.7 points in FSCORE. To the best of our knowledge, this method has achieved a new state-of-the-art performance on GM-PM. In addition, BELSTM outperformed the ELSTM by a substantial margin on all tasks, showing that our bidirectional agreement is effective in improving the performance of the unidirectional ELSTM on which it is based.
Furthermore it is clear that the gains of the BELSTM relative to the ELSTM on JP-EN were larger than those on both EN-JP and GM-PM. We believe the explanation is likely to be that the relative length of target sequences with respect to the source sequences on JP-EN is much larger than those on EN-JP and GM-PM, and our agreement model is able to draw greater advantage from the relatively longer target sequences. The relative length of the target for JP-EN was 1.43, whereas the relative lengths for EN-JP and GM-PM were only 0.70 and 0.85, respectively.
One of the main weaknesses of RNNs is their unbalanced outputs which have high quality prefixes but low quality suffixes, as discussed earlier. Table 3 shows that the difference in precision is 12% for GLSTM (l2r) between prefixes and suffixes. This gap narrowed using the BLSTM, which out-performed the GLSTM (l2r) on both prefix and suffix (with the largest difference on the suffix) and outperformed the GLSTM (r2l) on the prefix. A similar effect was observed with the BELSTM, which generated the better, more balanced outputs compared to ELSTM(5l2r) and ELSTM(5r2l) models.
Our agreement model worked well for long sequences, and this is shown in Table 4. The BLSTM obtained large gains over GLSTM(l2r) and GLSTM(r2l), (the gains were up to 7.7 and 3.1 in terms of ACC and FSCORE, respectively). Furthermore, the BELSTM obtained gains of 1.2 points in terms of FSCORE over the ELSTM(5r2l), but gave no improvements in terms of ACC. This is to be expected, since for long sequences it is hard to generate targets that exactly match the references and thus it is more difficult to improve ACC.
Even though our agreement model can be applied on top of an ensemble, we compare them in order to put the advantage of our model in perspective. To ensure a fair comparison, the number of individual LSTMs in both the ensemble and our agreement model were identical in the experiments. As shown in Table 5, although the BLSTM(r2l+l2r) explores a much smaller search space than the ELSTM(2r2l), it substantially outperformed it. As the number of total number of LSTMs used was increased to ten, the BELSTM(5l2r+5r2l) still obtained substantial gains over the ELSTM(10l2r). Incorporating more directional LSTMs in the BELSTM(10l2r+10r2l) further increased the performance of the BELSTM.
The bidirectional learner 100 and joint search apparatus 140 in accordance with the above-described embodiments can be realized by computer hardware and computer program or programs executed on the computer hardware.
Referring to
Referring to
The computer program or programs causing computer system 330 to function as various functional units of the embodiments above are stored in a DVD 362 or a removable memory 364 loaded to DVD drive 350 or memory port 352, and transferred to hard disk drive 354. Alternatively, the program or programs may be transmitted to computer 340 through a network 368, and stored in hard disk 354. At the time of execution, the program or programs are loaded to RAM 360. Alternatively, the program or programs may be directly loaded to RAM 360 from DVD 362, from removable memory 364, or through the network.
The program or programs include a sequence or sequences of instructions each consisting of a plurality of instructions causing computer 340 to function as various functional units of the system in accordance with the embodiments above. Some of the basic functions necessary to carry out such functions may be provided by the operating system running on computer 340, by a third-party program, or various programming tool kits or program library installed in computer 340. Therefore, the program or programs might not include all functions required to realize the system and method of the present embodiments. The program or programs may include only the instructions that call appropriate functions or appropriate program tools in the programming tool kits provided by the system in a controlled manner to attain a desired result and thereby to realize the functions of the systems described above. The program or programs may include all necessary functions.
In the embodiment shown in
The operation of computer system 330 executing the computer program is well known. Therefore, details thereof will not be repeated here.
When generating the target in a unidirectional process for RNNs, the character level precision falls off with distance from the start of the sequence, and the generation of long sequences therefore becomes a problem. We propose an agreement model on target-bidirectional LSTMs that symmetrize the generative process. The exact search for this agreement model is NP-hard, and therefore we developed two approximate search alternatives, and analyze their behavior empirically, finding them to be near optimal. Extensive experiments showed our approach to be very promising, delivering substantial gains over a range of strong baselines on both machine transliteration and grapheme-to-phoneme conversion. Furthermore, our method has achieved the highest reported accuracy on a standard grapheme-to-phoneme conversion dataset.
In principle it is possible to apply our method to other sequence-to-sequence learning tasks, and in future research we plan to study its application to machine translation.
The embodiments as have been described here are mere examples and should not be interpreted as restrictive. The scope of the present invention is determined by each of the claims with appropriate consideration of the written description of the embodiments and embraces modifications within the meaning of, and equivalent to, the languages in the claims.
7. Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of ACL: Demonstrations.