The present invention relates to automatic translation systems, and, in particular, statistical machine translation systems and methods.
Recently, significant progress has been made in the application of statistical techniques to the problem of translation between natural languages. The promise of statistical machine translation (SMT) is the ability to produce translation engines automatically without significant human effort for any language pair for which training data is available. However, current SMT approaches based on the classic word-based IBM models (Brown et al. 1993) are known to work better on language pairs with similar word ordering. Recently, strides toward correcting this problem have been made by bilingually learning phrases that can improve the translation accuracy. However, these experiments (Wang 1988, Yamada and Knight 2001, Och et al. 2000, Koehn et al. 2002, Zhang et al. 2003) have neither gone far enough in harnessing the full power of phrasal-translation, nor successfully solved the structural problems in the output translations.
This motivates the present invention of syntactic chunk-based, two-level machine translation methods, which learns vocabulary translations within syntactically and semantically independent units and learns global structural relationships among the chunks separately. The invention not only produces higher quality translations but also needs much less training data than other statistical models since it is considerably more modular and less dependent on training data.
The object of the present invention is to provide a chunk-based statistical machine translation system.
Briefly, the present invention performs two separate levels of training to learn lexical and syntactic properties, respectively. To achieve this new model of translation, the present invention introduces chunk alignment into a statistical machine translation system.
Syntactic chunking segments a sentence into syntactic phrases such as noun phrases, prepositional phrases, and verbal clusters without hierarchical relationships between the phrases. In this invention, part-of-speech information and a handful set of chunking rules suffice to perform accurate chunking. Syntactic chunking is performed on both source and target languages independently. The aligned chunks serve not only as the direct source for chunk translation but also as the training material of statistical chunk translation. The translation models such as lexical model, fertility model and distortion model within chunks are learned from the aligned chunks in the chunk-level training.
The translation component of the system comprises of chunk translation, reordering, and decoding. The system chunk parses the sentence into syntactic chunks and translates each chunk by looking up candidate translations from the aligned chunk table and with a statistical decoding method using the translation models obtained during the chunk-level training. Reordering is performed using blocks of chunk translations instead of words, and multiple candidate translation of chunks are decoded using both a word language model and chunk head language model.
The foregoing and other objects, aspects and advantages of the invention will be better understood from the following detailed description of preferred embodiments of this invention when taken in conjunction with the accompanying drawings in which:
In the preferred embodiment of this present invention, a chunk-based statistical machine translation system offers many advantages over other known statistical machine translation systems. A presently preferred embodiment of the present invention can be constructed in a two step process. The first step is the training step where models are created for translation purposes. The second step is the translation step where the models are utilized to translate input sentences.
In the preferred embodiments of the present invention, two separate levels of training are performed to learn lexical and syntactic properties, respectively. To achieve this new model of translation, chunk alignment is provided in a statistical machine translation system.
In the second step, the translation step, referring to
Referring to
The aligned chunks 22 produced by chunk alignment 16 serve not only as the source for direct chunk translation table 32 but also as the training material of statistical chunk translation to produce translation models 34. The translation models such as lexical model, fertility model and distortion model within chunks are learned from the aligned chunks in the chunk-level training 24. This second level of SMT training is one of the important novel features of the invention. The learned models in this way tend to be more accurate than those learned from aligned sentences.
Initial target side corpus is used to build a word language model 38. The word language model is a statistical n-gram language model trained on target language corpus.
The chunked target sentences go through a chunk-head extractor 18 to generate target sentence of chunk-head which is used to build a chunk-head language model 36. The definition of chunk-head language model is a statistical n-gram language model trained on the chunk head sequences of the target language. The head word of a chunk is determined by the linguistic rules. For instance, the noun is the head of a noun phrase, and the verb is the head of a verb phrase. The chunk head language model can capture long distance relationship between words by omitting structurally unimportant modifiers. Chunk head language model is possible due to syntactic chunking, and it is another advantage of the invention.
Referring to
Reordering 108 is performed using blocks of chunk translations instead of words, and multiple candidate translation of chunks are decoded using a word language model 38 and chunk head language model 36. Reordering can be performed before the decoder or integrated with the decoder.
Depending on the language, linguistic processing such as morphological analysis and stemming is performed to reduce vocabulary size and to balance the source and target languages. When a language is inflectionally rich language, like Korean, many suffixes are attached to the stem to form one word. This leads one stem to have many different forms, all of which are translated into one word in another language. Since a statistical system cannot tell that all these various forms are related and therefore treats them as different words, a potentially severe data sparseness problem may result. By decomposing a complex word into prefixes, stem, and suffixes, and optionally removing meaningfully unimportant parts, we can reduce vocabulary size and mitigate the data sparseness problem.
Part-of-speech tagging is performed on the source and target languages before chunking. Part-of-speech tagging provides syntactic properties especially necessary for chunk parsing. One can use any available part-of-speech tagger such as Brill's tagger (Brill 1995) for the languages in question.
With respect to chunker as illustrated by
The most common way of chunking (Tjong 2000) in the natural language processing field is to learn chunk boundaries from manually parsed training data. The acquisition of such, however, is time consuming.
In this invention, part-of-speech information and a handful set of manually built chunking rules suffice to perform accurate chunking. For better performance, idioms can be used, which can be found with the aid of dictionaries or statistical methods. Syntactic chunking is performed on both source and target languages independently. Since the chunking is rule-based and the rules are written in a very simple form of regular expressions comprising of part-of-speech tags and lexical items, it is easy to modify the rules depending on the language pair. Syntactic chunks are easily definable, as shown in
This method requires fewer resources and is easy to adapt to new language pairs. Chunk rules for each language may be developed independently. However, ideally, they should take into consideration the target language in order to achieve superior chunk alignment. For instance, when one deals with English and Korean in which pronouns are freely dropped, one can add a chunk rule which combines pronouns and verbs in English so that a Korean verb without a pronoun can have a better chance to align to an English chunk consisting of a verb and a pronoun. Multiple ways of chunking rules may be used to accommodate better chunk alignment.
Generally chunk rules are part-of-speech tag sequences but they may also be mixed, comprise of both part-of-speech tags and lexical items, or even comprise of lexical items only, to accommodate idioms as illustrated in
When there is no existing parallel corpus, and one has to build one from scratch, one can even build a parallel chunk corpus. As syntactic chunks are usually psychologically independent units of expression, one can generally translate them without context.
Referring to
Referring to
In IBM models 1-5 (Brown et al, 1993), the relationship between word alignment and lexical model is restricted to one-to-one mapping, and only one specific model is utilized to estimate parameters of statistical translation model. In contrast to IBM models, the approach of the present invention combines different lexicon model estimation approaches with different ML word alignments in each iteration of the model training. As a result, the system is more flexible in terms of the integration of the lexicon model and the word alignment during the recursive estimation, and thus can improve both predictability and precision of the estimated lexicon model and word alignment. Different probabilistic models are introduced in order to estimate the associativity between the source and target words. First, a maximum a posteri (MAP) algorithm is introduced to estimate the word translation model, whereas the word occurrence in the parallel sentences is used as a posteri information. Furthermore, we estimate the lexicon model parameters from the marginal probabilities in the parallel sentence, besides the global information in the entire training corpus. This approach will increase the discriminativity of learned lexical model and word alignment, by considering the local context information embedded in the parallel sentence. As a result, this approach is capable of increasing the recall ratio of word alignment and the lexicon size without decreasing the alignment precision, which is especially important for applications with limited training parallel corpus.
Referring to
where ΦL denotes the set of all possible alignment matrices subject to the lexical constraints. The conditional probability of a target sentence generated by a source sentence depends on the lexicon translation model. Lexicon translation probability can be modeled in numerous ways, i.e. using the source-target word co-occurrence frequency, context information from the parallel sentence, and the alignment constraints. During each iterations of the word alignment, the lexical translation probabilities for each sentence pair are re-estimated using the lexical model learned from previous iterations, and the specific source-target word pairs occurring in the sentence.
Referring to
Referring to
One of main problems of the word alignment in other SMT systems is that many words are incorrectly unaligned. In other words, the recall ratio of word alignment tends to be low. Chunk alignment, however, is able to mitigate this problem. Chunks are aligned if at least one word of a chunk in the source language is aligned to a word of a chunk in the target language. The underlying assumption is that chunk alignments are more one-to-one than word alignment. In this way, many words that would not be aligned by the word alignment are included in chunk alignment, which in turn improves training for chunk translation. This improvement is possible because both target language sentences and source language sentences are independently pre-segmented in this invention. For a phrase-based SMT such as Alignment Template Model (Och et al. 2000), this kind of improvement is less feasible. The “phrases” of Alignment Temple Model are solely determined by the word alignment information and the quality of word alignment is more or less the only thing to determine the quality of phrases found in their model.
Another major problem of the word alignment is that a word is incorrectly aligned to another word. This low precision problem is a much harder problem to solve and potentially leads to greater translation quality degradation. This invention overcomes this problem in part by adding a constraint using part-of-speech information to selectively use more confident alignment information. For instance, we can filter out certain word alignments if the part-of-speech of the aligned words are incompatible. In this way, possible errors in word alignment are filtered out in chunk alignment.
Compared to word alignment, the one-to-one alignment ratio is high in chunk alignment (i.e. the fertility is lower), but there are some cases that one chunk is aligned to more than one chunk in the other language. To achieve a one-to-one chunk alignment, the preferred embodiment of the present invention allows chunks to be merged or split.
Referring to
The direct chunk translation uses the direct chunk translation table 32 with probability constructed from the chunk alignment. The chunk translation probability is estimated from the co-occurrence frequency of the aligned source-target chunk pair and the frequency of the source chunk from chunk alignment table. Direct chunk translation has the advantage of handling both word order problems within chunks as well as translation problems of non-compositional expressions, which covers many translation divergences (Dorr 2002). While the quality of direct chunk translation is very high, the coverage may be low. Several ways of chunking with different rules may be tested to construct a better direct chunk translation table to balance quality and coverage.
The second method is a statistical method 110, which is basically the same as other statistical methods except that the training is performed on the aligned chunks rather than the aligned sentences. As a result training time is significantly reduced and more accurate parameters can be learned to produce better translation models 34. To make a more complete training corpus for chunk translation, we can use not only the aligned chunks but also statistical phrases generated from another phrase-based SMT system. One can also add the lexicon table from the first SMT training. The addition of the lexicon table significantly reduces oov's (out of vocabulary items).
As shown in
Referring to
In contrast, syntactic chunks are syntactically meaningful units and they are useful to handle word order problems. Word order problems can be local, such as the relation between the head noun and its modifiers within a noun phrase, but more serious word order problems deal with long distance relationships, such as the order of subject, object and the verb in a sentence. These long distance word order problems become tractable when we shift the unit of reordering from words to syntactic chunks.
The “phrases” found by a phrase-based statistical machine translation model (Och et al. 2000) are bilingual word sequence pairs in which words are aligned with other. As they are derived from word alignment, the phrase pairs are good translations from each other, but they are not good syntactic units. Hence, reordering using such phrases may not be as advantageous as reordering based on syntactic chunks.
For language pairs with very different word order, one can perform heuristic transformations to move around chunks into another position to make one language word order more similar to the other language to improve translation quality. For instance, English is a SVO (subject-verb-object) language, while Korean is a SOV (subject-object-verb). If the Korean noun phrases marked by the object marker are moved before the main verb, the transformed Korean sentences will be more similar to English in terms of word order.
In terms of reordering, the decoder need only consider permutations of chunks and not words, which is a more tractable problem.
In the preferred embodiment of the invention, chunk reordering is modeled as the combination of traveling salesman problem (TSP) and global search of the ordering of the target language chunks. The TSP problem is an optimization problem that tries to find the path to cover all the nodes in a direct graph with certain defined cost function. For short chunks, we perform global search of optimally reordered chunks using target language model (LM) scores as cost function. For long chunks, we use TSP algorithm to search for sub-optimal solution using LM scores as cost function.
For chunk reordering the LM score between contiguous chunks acts as the transitional cost between two chunks. The LM score is obtained through the log-linear interpolation of an n-gram based lexicon LM and an n-gram based chunk head LM. A 3-gram LM with Good-Turing discounting, for example, is used to train the target language LM. Due to the efficiency of the combined global search and TSP algorithm, a distortion model is not necessary to guide the search for optimal chunk reordering paths. The performance of reordering in this model is superior to word-based SMT not only in quality but also in speed due to the reduction in search space.
An embodiment of a decoder of this invention, as depicted in
Referring to
Referring to
Referring to
Referring to
The merged and normalized chunk segments are organized into a two-level chunk lattice in order to facilitate the re-ranking of source-target chunk pairs with multi-segmentation schemes, and the search algorithm. The first level of chunk lattice consists of source chunks starting at different positions in the source sentence. The second level of the lattice contains source chunks with the same starting position, and different ending positions in the source sentence, and their corresponding target chunks merged from different translation models.
Referring to
Referring to
Referring to
Referring to
Referring to
While the present invention has been described with reference to certain preferred embodiments, it is to be understood that the present invention is not limited to such specific embodiments. Rather, it is the inventor's contention that the invention be understood and construed in its broadest meaning as reflected by the following claims. Thus, these claims are to be understood as incorporating not only the preferred embodiments described herein but all those other and further alterations and modifications as would be apparent to those of ordinary skilled in the art.