The invention relates to statistical machine translation, which concerns using statistical techniques to automate translating between natural languages.
The Decoding problem in Statistical Machine Translation (SMT) is as follows: given a French sentence f and probability distributions Pr(e|f) and Pr(e), find the most probable English translation e of f
French and English are used as the language pair of convention: the formulation of Equation (1) is applicable to any language pair. This and other background material is established in P. Brown, S. Della Pietra, R. Mercer, 1993, “The mathematics of machine translation: Parameter estimation”, Computational Linguistics, 19(2):263-311. The content of this reference is incorporated herein in its entirety, and is referred to henceforth as Brown et al.
Because of the particular structure of the distribution Pr(f|e) employed in SMT, the above problem can be recast in the following form:
where a is a many-to-one mapping from the words of the sentence f to the words of e. Pr (f|e), Pr(e), and a are in SMT parlance known as Translation Model, Language Model, and alignment respectively.
Several solutions exist for the decoding problem. The original solution to the decoding problem employed a restricted stack-based search, as described in U.S. Pat. No. 5,510,981 issued Apr. 23, 1996 to Berger et al. This approach takes exponential time in the worst case. An adaptation of the Held-Karp dynamic programming based TSP algorithm to the decoding problem runs in O(l3m4)≈O(m7) time (where m and l are the lengths of the sentence and its translation respectively) under certain assumptions. For small sentence lengths, optimal solution to the decoding problem can be found using either the A* heuristic or integer linear programming. The fastest existing decoding algorithm employs a greedy decoding strategy and finds suboptimal solution in O(m6) time. A more complex greedy decoding algorithm finds suboptimal solution in O(m2) time. Both algorithms are described in U. Germann, “Greedy decoding for statistical machine translation in almost linear time”, Proceedings of HLT-NAACL 2003, Edmonton, Canada.
An algorithmic framework for solving the decoding problem is described in Udupa et al., full publication details for which are: R. Udupa, T. Faruquie, H. Maji, “An algorithmic framework for the decoding problem in statistical machine translation”, Proceedings of COLING 2004, Geneva, Switzerland. The content of this reference is incorporated herein in its entirety. The substance of this reference is also described in U.S. patent application Ser. No. 10/890,496 filed 13 Jul., 2004 in the names of Raghavendra U Udupa and Tanveer A Faruquie, and assigned to International Business Machines Corporation (IBM Docket No JP9200300228US1). The content of this reference is also incorporated herein in its entirety.
The framework described in the above references is referred to as alternating optimization, in which the decoding problem of translating a source sentence to a target sentence can be divided into two sub-problems, each of which can be solved efficiently and combined to iteratively refine the solution. The first sub-problem finds an alignment between a given source sentence and a target sentence. The second sub-problem finds an optimal target sentence for a given alignment and source sentence. The final solution is obtained by alternatively solving these two sub-problems, such that the solution of one sub-problem is used as the input to the other sub-problem. This approach provides computational benefits not available with some other approaches.
As is apparent from the foregoing description, a decoding algorithm is assessed in terms of speed and accuracy. Improved speed and accuracy relative to competing systems is desirable for the system to be useful in a variety of applications. The speed of the decoding algorithm is primarily responsible for its usage in real-time translation applications, such as web pages translation, bulk document translations, real-time speech to speech systems and so on. Accuracy is more highly valued in applications that require high quality translations but do not require real-time results, such as translations of government documents and technical manuals.
Though progressive improvements have been made in solving the decoding problem, some of which are described above, further improvements—such as in speed and accuracy—are clearly desirable.
A decoding system takes a source text and from a language model and a translation model generates a set of target sentences and associated scores, which represent the probability for the generated particular target sentence. The sentence with the highest probability is the best translation for the given source sentence.
The source sentence is decoded in an iterative manner. In each of the iterations, two problems are solved. First, an alignment family consisting of exponentially many alignments is constructed and the optimal translation for this family of alignments is found out. To construct the alignment family, a set of alignment transformation operators is employed. These operators are applied on a starting alignment, also called the generator alignment, systematically. Second, the optimal alignment between the source sentence and the solution obtained in the previous step is computed. This alignment is used as the starting alignment for the next iteration.
The described decoding procedure uses the Alternating Optimization framework described in above-mentioned U.S. patent application Ser. No. 10/890,496 filed 13 Jul. 2004 and uses dynamic programming. The time complexity of the procedure is O(m2), where m is the length of the sentence to be translated.
An advantage of the decoding procedure described herein is that the decoding procedure builds a large sub-space of the search space, and uses computationally efficient methods to find a solution in this sub-space. This is achieved by proposing an effective solution to solve a first sub-problem of the alternating optimization search. Each alternating iteration builds and searches many such search sub-spaces. Pruning and caching techniques are used to speed up this search.
The decoding procedure solves the first sub-problem by first building a family of alignments with an exponential number of alignments. This family of alignment represents a sub-space within the search space. Four operations: COPY, GROW, MERGE and SHRINK are used to build this family of alignments. Dynamic programming techniques are then used to find the “best” translation within this family of alignments, in m phases, in which m is the length of source sentence. Each phase maintains a set of partial hypotheses which are extended in subsequent phases using one of the four operators mentioned above. At the end of m phases the hypothesis with the best score is reported.
The reported hypothesis is the optimal translation which is then used as the input to the second sub-problem of the alternating optimization search. When the first sub-problem of finding optimal translation is again revisited in the next iteration, a new family of alignments is explored. The optimal translation (and its associated alignment) found in the last iteration is used as a foundation to find the best swap of “tablets” that improves the score of previous alignment. This new alignment is then taken as the generator alignment and a new family of alignments can be build using the operators.
The algorithm uses pruning and caching to speed performance. Though any pruning method can be used, generator guided pruning is a new pruning technique described herein. Similarly, any of the parameters can be cached, and the caching of language model and distortion probabilities improves performance.
As the search space explored by the procedure is large, two pruning techniques are used. Empirical results obtained by extensive experimentation on test data show that the new algorithm's runtime grows only linearly with m when either of the pruning techniques is employed. The described procedure outperforms existing decoding algorithms and a comparative experimental study shows that an implementation 10 times faster than the implementation of the Greedy decoding algorithm can be achieved.
One or more embodiments of the invention will now be described with reference to the following drawings.
FIGS. 9 to 24 present various experimental results, as briefly outlined below and subsequently described in context.
Decoding is one of the three fundamental problems in SMT and the only discrete optimization problem of the three. The problem is NP-hard even in the simplest setting. In applications such as speech-to-speech translation and automatic webpage translation, the translation system is expected to have a very good throughput. In other words, the Decoder should generate reasonably good translations in a very short duration of time. A primary goal is to develop a fast decoding algorithm which produces satisfactory translations.
An O(m2) algorithm in the alternating optimization framework is described (Section 2.3). The key idea is to construct a reasonably big subspace of the search space of the problem and design a computationally efficient search scheme for finding the best solution in the subspace. A family of alignments (with Θ (4m) alignments) is constructed starting with any alignment (Section 3). Four alignment transformation operations are used to build a family of alignments from the initial alignment (section 3.1).
A dynamic programming algorithm is used to find the optimal solution for the decoding problem within the family of alignments thus constructed (Section 3.3). Although the number of alignments in the subspace is exponential in m, the dynamic algorithm is able to compute the optimal solution in O(m2) time. The algorithm is extended to explore several such families of alignments iteratively (Section 3.4). Heuristics can be used to speedup the search (Section 3.5). By caching some of the data used in the computations, the speed is further improved (Section 3.6).
2.1 Preliminaries
Let f and e denote a French sentence and an English sentence respectively. Suppose f has m>0 words and e has 1>0 words. These respective sentences can be represented as f=f1f2 . . . fm and e=e1e2 . . . e1, where fj and ei respectively denote the jth and ith word in the French or English sentence. For technical reasons, the null word e0 is prepended to every English sentence. The null word is necessary to account for French words that are not associated with any of the words in e.
An alignment, a, is a mapping which associates each word fj; j=1, . . . m in the French sentence (f) to some word ea
The fertility of ei, i=0, . . . l in an alignment a is the number of words of f mapped to it by a. Let Øi denote the fertility of ei, i=0, . . . l. In the alignment shown in
Associated with every alignment are a tableau and a permutation. Tableau is a partition of the words in the sentence f induced by the alignment and permutation is an ordering of the words in the partition.
2.1.1 Tableau
Let τ be a mapping from [0, . . . l] to subsets of {f1, . . . fm} defined as follows:
τi={fi:jε{1, . . . ,m}∀aj=i}∀i=0, . . . ,l
τi is the set of French words which are mapped to the word position i in the translation by the alignment. τi, i=0, . . . l are called the tablets induced by the alignment a and τ is called a tableau. The kth word in the tablet τi is denoted by τik.
2.1.2 Permutation
Let permutation π be a mapping from [0, . . . l] to subsets of {1, . . . ,m} defined as follows:
πi={j:jε{1, . . . ,m}∀aj=i}∀i=0, . . . ,l.
πi is the set of positions that are mapped to position i by the alignment a. The fertility of ei is Øi=|πi|. Assume that the positions in the set 7 is are ordered, i.e. πik<πik+1, k=1, . . . ,Øi−1. Further assume that τik=fπ
There is a unique alignment corresponding to a tableau and a permutation.
2.2 Probability Models
Every English sentence e is a “translation” of f, though some translations are more likely than others. The probability of e is Pr(e|f). In SMT literature, the distribution Pr (e|f) is replaced by the product Pr(f|e) Pr(e) (by applying Bayes' rule) for technical reasons. Furthermore, a hidden alignment is assumed to exist for each pair (f,e) with a probability Pr(f,a|e) and the translation model (Pr(f|e)) is expressed as a sum of Pr(f,a|e) over all alignments: Pr(f|e)=Σa Pr (f,a|e).
Pr(f,a|e) and Pr(e) are modeled using models that work at the level of words. Brown et al. propose a set of 5 translation models, commonly known as IBM 1-5. IBM-4 along with the trigram language model is known in practice to give better translations than other models. Therefore, decoding algorithm is described in the context of IBM-4 and trigram language model only, although the described methods can be applied to other IBM models as well.
2.2.1 Factorization of Models
While IBM 1-5 models can be factorized in many ways, a factorization which is useful in solving the decoding problem efficiently is used. The factorization is along the words of the translation:
and therefore
Here, the terms Ti, Di, Ni, and Li are associated with ei. The terms Ti, Di, Ni are determined by the tableau and the permutation induced by the alignment. Only Li is Markovian.
IBM-4 employs distributions to (word translation model), n( ) (fertility model), d1( ) (head distortion model) and d>1( ) (non-head distortion model) and the language model employs the distribution tri( ) (trigram model).
For IBM-4 and trigram language model:
A and B are word classes, ρi is the previous fertile English word, cρ is the center of the French words connected to the English word eρ, ρ1 is the probability of connecting a French word to the null word (e0), and ρ0=1−ρ1.
Although IBM-4 is a complex model, factorization to T, D, N and L can be used, as described herein, to design an efficient decoding algorithm.
2.3 Alternating Optimization Framework
The decoder attempts to solve the following search problem:
where Pr(f, a|e) and Pr(e) are defined as described in the previous section.
In the alternating optimization framework, instead of joint optimization, one alternates between optimizing e and a:
In the search problem specified by Equation (3), the length of the translation (1) and the alignment (a) is kept fixed while in the search problem specified by Equation (4), the translation (e) is kept fixed. An initial alignment is used as a basis for finding the best translation for f with that alignment. Next, keeping the translation fixed a new alignment is determined which is at least as good as the previous one. Both the alignment and the translation are iteratively refined in this manner. The framework does not require that the two problems be solved exactly. Suboptimal solutions to the two problems in every iteration are sufficient for the algorithm to make progress.
Alternating optimization framework is useful in designing fast decoding algorithms for the following reason:
Lemma 1. Fixed Alignment Decoding: The solution to the search problem specified by Equation 3 can be found in O(m) time by Dynamic Programming.
A suboptimal solution to the search problem specified by Equation (4) can be computed in O(m) by local search. Further details concerning this proposition can be obtained from Udupa et al., referenced above and incorporated herein in its entirety.
A family of alignments starting with any alignment can be constructed.
3.1 Alignment Transformation Operations
Let a, a′ be any two alignments. Let (τ,π) and (τ′,π′) be the tableau and permutation induced by a and a′ respectively. A relation R is defined between alignments and say that a′Ra if a′ can be derived from a by doing one of the operations COPY, GROW, SHRINK and MERGE on each of (τi,πi), 0≦i≦1 starting with (τ1,π1). Let i and i′ be the counters for (τ,π) and (τ′,π′) respectively. Initially, (τ0,π0)=(τ0,π0) and i′=i=1. The operations are as follows:
1. Copy:
(τ′i′,π′i′)=(τi,πi);
i=i+1;i′=i′+1.
2. Grow:
(τ′i′,τ′i′)=({},{})
(τ′i′+1,π′i′+1)=(τi,πi);
i=i+1;i′=i′+2.
3. Shrink:
(τ′0,π′0)=(τ′0∪ti,π′0∪πi);
i=i+1.
4. Merge:
(τ′i′−1,π′i′−1)=(τ′i′−1∪τi,π′i′−1∪πi);
i=i+1
The four alignment transformation operations generate alignments that are related to the starting alignment but have some structural difference. The COPY operations maintain structural similarity in some parts between the starting alignment and the new alignment. The GROW operations increase the size of the alignment and therefore, the length of the translation. The SHRINK operations reduce the size of the alignment and therefore, the length of the translation. MERGE operations increase the fertility of words.
3.2 A Family of Alignments
Given an alignment a, the relation R defines the following family of alignments: A={a′:a′Ra}. Further, if a is one-to-one, the size of this family of alignments is |A|=Θ(4m) and a is called the generator of the family A.
A family of alignments A, is determined and the optimal solution in this family is computed:
3.3 A Dynamic Programming Algorithm
Computing the optimal solution in a family of alignments is now described.
Lemma 2. The solution to the search problem specified by Equation 5 can be computed in O(m2) time by Dynamic Programming when A is a family of alignments as defined in Section 3.2.
The dynamic programming algorithm builds a set of hypotheses and reports the hypothesis with the best score and the corresponding translation, tableau and permutation. The algorithm works in m phases and in each phase it constructs a set of partial hypotheses by expanding the partial hypotheses from the previous phase. A partial hypothesis after the ith phase, h, is a tuple (e0 . . . e′i′, τ′0 . . . τ′i′,π′0 . . . π′i′,C) where e0 . . . ee′ is the partial translation, τ′0 . . . τ′i′ the partial tableau, π′0 . . . π′i′ is the partial permutation, and C is the score of the partial hypothesis.
In the beginning of the first phase, there is only one partial hypothesis (e0,τ′0,π′0,0). In the ith phase, a hypothesis is extended as follows:
1. Do an alignment transformation operation on the pair (τi,πi)
2. For each pair (π′i′,π′i′) added by doing the operation
As observed in Section 3.2, an alignment transformation operation can result in the addition of 0 or 1 or 2 new tablets. Since each tablet corresponds to an English word, the expansion of a partial hypothesis results in appending 0 or 1 or 2 new words to the partial sentence:
1. COPY: An English word ei′ is appended to the partial translation (i.e. the partial translation grows from e0 . . . ei′−1 to e0 . . . ei′). The word ei′ is chosen from the set of candidate translations of the French words in the tablet τi. If the number of candidate translations a French word can have in the English vocabulary is bounded by NF, then the number of new partial hypotheses resulting from the COPY operation is at most NF.
2. GROW: Two English words ei′,ei′+1 are appended to the partial translation as a result of which the partial translation grows from e0 . . . ei′−1 to e0 . . . ei′ei′+1. The word ei′ is chosen from the set of infertile English words and ei′+1 from the set of English translations of the French words in the tablet τi. If the number of infertile words in the English vocabulary is N0, then the number of new partial hypotheses resulting from the GROW operation is at most NFN0.
3. SHRINK, MERGE: The partial translation remains unchanged. Only one new partial hypothesis is generated.
At the end of a phase of expansion, these are a set of partial hypotheses. These hypotheses can be classified based on the following:
1. The last two words in the partial translation (ei′−1, ei′),
2. Fertility of the last word in the partial translation (|π′i′|) and
3. The center of the tablet corresponding to the last word in the partial translation.
If two partial hypotheses in the same class are extended using the same operation, then their scores increase by equal amount. Therefore, for each class of hypotheses the algorithm retains only the one with the highest score.
3.3.1 Analysis
The algorithm has m phases and in each phase a set of partial hypotheses are expanded. The number of partial hypotheses generated in any phase is bounded by the product of the number of hypothesis classes in that phase and the number of partial hypotheses yielded by the alignment transformation operations. The number of partial hypotheses classes in phase i is determined. There are at most |VE|2 choices for (ei′−1, ei′), at most φmax choices for the fertility of ei′ and m choices for the center of the tablet corresponding to ei′. Therefore, the number of partial hypotheses classes in phase i is at most φmax|VE|2 m. The alignment transformation operations on a partial hypothesis result in at most NF (1+N0)+2 new partial hypotheses. Therefore, the number of partial hypotheses generated in phase i is at most φmax (NF(1+N0)+2)|VE|2 m. As there are totally m phases, the total number of partial hypotheses generated by the algorithm is at most φmax (NF(1+N0)+2) |VE|2m2. Note that φmax, NF and N0 are constants independent of the length of the French sentence. Therefore, the number of operations in the algorithm is O(m2). In practice φmax<10, NF≦11, and N0≦100.
3.4 Iterative Search Algorithm
Several alignment families are explored iteratively using the alternating optimization framework. In each iteration two problems are solved. In the first problem, a generator alignment a is used as a reference to build an alignment family A for the generator. The best solution in that family is determined using the dynamic programming algorithm. In the second problem, a new generator is determined for the next iteration. To find a new generator, the tablets in the solution found in the previous step are swapped, and checked if that improves the score. In fact, the best swap of tablets that improves the score of the solution is thus determined. Clearly, the resulting alignment ã is not part of the alignment family A. This alignment ã is used as the generator in the next iteration.
3.5 Pruning
Although our dynamic programming algorithm takes O(m2) time to compute the translation, the constant in the O is prohibitively large. In practice, the number of partial hypotheses generated by the algorithm is substantially smaller than the bound in Section 3.3.1, but large enough to make the algorithm slow. Two partial hypothesis pruning schemes are described below, which are helpful in speeding up the algorithm.
At each phase of the algorithm, the geometric mean of the scores of partial hypotheses generated in that phase is computed. Only those partial hypotheses whose scores are at least as good as the geometric mean are retained for the next phase and the rest are discarded. Although conceptually simple, pruning the partial hypotheses with the Geometric Mean as the cutoff is a efficient pruning scheme as demonstrated by empirical results.
3.5.2 Generator Guided Pruning
In this scheme, the generator of the alignment family A is used to find the best translation (and tableau and permutation) using the O(m) algorithm for Fixed Alignment Decoding. We then determine the score C(i), at each of the m phases, of the hypothesis that generated the optimal solution. These scores are used to prune the partial hypotheses of the dynamic programming algorithm. In the ith phase of the algorithm, only those partial hypotheses whose scores are at least C(i) are retained for the next phase and the rest are discarded. This pruning strategy incurs the overhead of running the algorithm for Fixed Alignment Decoding for the computation of the cutoff scores. However, this overhead is insignificant in practice.
3.6 Caching
The probability distributions (n,d1,d>,t and tri) are loaded into memory by the algorithm before decoding. However, it is better to cache the most frequently used data in smaller data structures so that subsequent accesses are relatively faster.
3.6.1 Caching of Language Model
While decoding the French sentence, one knows a priori the set of all trigrams that could potentially be accessed by the algorithm. This is because these trigrams are formed by the set of all candidate English translations of the French words in the sentence and the set of infertile words. Therefore, a unique id can be assigned for every such trigram. When the trigram is accessed for the first time, it is stored in an array indexed by its id. Subsequent accesses to the trigram make use of the cached value.
3.6.2 Caching of Distortion Model
As with the language model, the actual number of distortion probability data values accessed by the decoder while translating a sentence is relatively small compared to the total number of distortion probability data values. Further, distortion probabilities are not dependent on the French words but on the position of the words in the French sentence. Therefore, while translating a batch of sentences of roughly the same length, the same set of data is accessed repeatedly. The distortion probabilities required by the algorithm are cached.
3.6.3 Starting Generator Alignment
The algorithm requires a starting alignment to serve as the generator for the family of alignments. The alignment aj=j, i.e., l=m and a=(1, . . . ,m) is used as the starting alignment.
This Section describes an overview of the procedures involved in determining optimal alignments. The following flowcharts are used to describe the procedure.
The procedure described by
The procedure described by
The components of the computer system 800 include a computer 820, a keyboard 810 and mouse 815, and a video display 890. The computer 820 includes a processor 840, a memory 850, input/output (I/O) interface 860, communications interface 865, a video interface 845, and a storage device 855. All of these components are operatively coupled by a system bus 830 to allow particular components of the computer 820 to communicate with each other via the system bus 830.
The processor 840 is a central processing unit (CPU) that executes the operating system and the computer software program executing under the operating system. The memory 850 includes random access memory (RAM) and read-only memory (ROM), and is used under direction of the processor 840.
The video interface 845 is connected to video display 890 and provides video signals for display on the video display 890. User input to operate the computer 820 is provided from the keyboard 810 and mouse 815. The storage device 855 can include a disk drive or any other suitable storage medium.
The computer system 800 can be connected to one or more other similar computers via a communications interface 865 using a communication channel 885 to a network, represented as the Internet 880.
The computer software program may be recorded on a storage medium, such as the storage device 855. Alternatively, the computer software can be accessed directly from the Internet 880 by the computer 820. In either case, a user can interact with the computer system 800 using the keyboard 810 and mouse 815 to operate the computer software program executing on the computer 820. During operation, the software instructions of the computer software program are loaded to the memory 850 for execution by the processor 840.
Other configurations or types of computer systems can be equally well used to execute computer software that assists in implementing the techniques described herein.
6.1 Experimental Setup
The results of several experiments are present. There experiments are designed to study the following:
1. Effectiveness of the pruning techniques.
2. Effect of caching on the performance.
3. Effectiveness of the alignment transformation operations.
4. Effectiveness of the iterative search scheme.
Fixed Alignment Decoding is used as the baseline algorithm in the experiments. To compare the performance of our algorithm with a state-of-the-art decoding algorithm, the Greedy decoder is used as available from http://www.isi.edu/licensed-sw/rewrite-decoder. In the empirical results from the experiments, in place of the translation score, the logscore (i.e. negative logarithm) of the translation score is used. When reporting scores for a set of sentences, the geometric mean of their translation scores is treated as the statistic of importance and the average logscore reported.
6.1.1 Training of the Models
A French-English translation model (IBM-4) is built by training over a corpus of 100 K sentence pairs from the Hansard corpus. The translation model is built using the GIZA++ toolkit. Further details can be obtained from http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html and Och and Ney, “Improved statistical alignment methods”, ACL00, pages 440-447, Hongkong, China, 2000. The content of both these references is incorporated herein in their entirety. There were 80 word classes which were determined using the mkcls tool. Further details can be obtained from http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/mkcls.html. The content of this reference is incorporated herein in its entirety. An English trigram language model is built by training over a corpus of 100 K English sentences. The CMU-Cambridge Statistical Language Modeling Tool Kit v2 is used for training the language model. This is developed by R. Rosenfeld and P. Clarkson, and is available from http://mi.eni.cam.ac.uk/˜prc14/toolkit documentation.html. While training the translation and language models, the default setting of the corresponding tools is used. The corpora used for training the models were tokenized using an in-house Tokenizer.
6.1.2 Test Data
The data used in the experiments consisted of 11 sets of 100 French sentences picked randomly from the French part of the Hansard corpus. The sets are formed based on the number of words in the sentences. There are 11 sets of sentences selected, whose length is in the range 6-10; 11-15, . . . , 56-60.
6.2 Decoder Implementation
The algorithm is implemented in C++ and compiled it using gcc with —O3 optimization setting. Methods which had less than 15 lines of code are inlined.
6.2.1 System
The experiments are conducted on an Intel Dual Processor machine (2.6 GHz CPU, 2 GB RAM) with Linux as the OS, with no other job running.
6.3 Starting Generator Alignment
The algorithm requires a starting alignment to serve as the generator for the family of alignments. The alignment aj=j, i.e., l=m and a=(1, . . . ,m) is used as the starting alignment. This particular alignment is a natural choice for French and English as their word orders are closely related.
6.4 Effect of Pruning
The following measures are indicative of the effectiveness of pruning:
1. Percentage of partial hypotheses retained by the pruning technique at each phase of the dynamic programming algorithm.
2. Time taken by the algorithm for decoding.
3. Loigscores of the translations.
6.4.1 Pruning with the Geometric Mean (PGM)
6.4.2 Generator Guided Pruning (GGP)
6.4.3 Performance
The logscores of the translations found by PGM are compared with those of the translations found by the dynamic programming algorithm without pruning and found that the logscores were identical. This means that our pruning techniques are very effective in identifying and removing inconsequential partial hypotheses.
From
6.5 Effect of Caching
In caching, the number of cache hits is a measure of the repeated use of the cached data. Also of interest is the improvement in runtime due to caching.
6.5.1 Language Model Caching
6.5.2 Distortion Model Caching
6.6 Alignment Transformation Operations
To understand the effect of the alignment transformation operations on the performance of the algorithm, experiments are conducted in which each of GROW, MERGE and SHRINK operations are removed, and with the decoder using Generator Guided Pruning.
The MERGE operation while not contributing significantly to the runtime of the algorithm plays a role in improving the scores.
6.7 Iterative Search
6.8 Comparison with the Greedy Decoder
The performance of the algorithm is compared with that of the Greedy decoder.
A suitable decoding algorithm is key to a statistical machine translation system in terms of speed and accuracy. Decoding is in essence an optimization procedure in finding a target sentence. While every problem instance has an “optimal” target sentence, finding that target sentence given time/computational constraints is a central challenge for such systems. Since the space of possible translations is large, typically decoding algorithms that examine a portion of that space risk overlooking satisfactory solutions. Various alterations and modifications can be made to the techniques and arrangements described herein, as would be apparent to one skilled in the relevant art.