AUGMENTING STATISTICAL MACHINE TRANSLATION WITH LINGUISTIC KNOWLEDGE

BACKGROUND

This section provides background information related to the present disclosure which is not necessarily prior art.

Statistical machine translation (SMT) generally utilizes statistical models to provide a translation from a source language to a target language. One type of SMT is phrase-based statistical machine translation. Phrase-based SMT can map sets of words (phrases) from a source language to a target language. Phrase-based SMT may rely on lexical information, e.g., the surface form of the words. The source language and the target language, however, may have significant lexical differences, such as when one of the languages is morphologically-rich.

SUMMARY

A computer-implemented technique is presented. The technique can include receiving, at a computing system including one or more processors, a translation model including a plurality of pairs of phrases, each of the plurality of pairs of phrases including a first phrase of one or more words in a first language and a second phrase of one or more words in a second language, wherein a specific first phrase is aligned with a specific second phrase for a specific pair of phrases. The technique can include determining, at the computing system, one or more features for each of the plurality of pairs of phrases based on linguistic differences between the first and second languages to obtain a plurality of sets of features. The technique can include associating, at the computing system, the plurality of sets of features with the plurality of pairs of phrases, respectively, to obtain a modified translation model. The technique can also include performing, at the computing system, statistical machine translation from the first language to the second language using the modified translation model.

In some embodiments, the modified translation model has lexical and inflectional agreement for each of the plurality of pairs of phrases.

In other embodiments, one of the first and second languages is a morphologically-rich language.

In some embodiments, the morphologically-rich language is characterized by morphological processes that produce a large number of word forms for a given root word.

In other embodiments, the morphologically-rich language is a synthetic language.

In some embodiments, one of the first and second languages is a non-morphologically-rich language.

In other embodiments, the non-morphologically-rich language is an isolating language or an analytic language.

In some embodiments, the one or more features include at least one of parts of speech features and dependency features.

In other embodiments, the parts of speech features include at least one of personal pronouns, possessive pronouns, determiners, particles, and wh-nouns.

In some embodiments, the dependency features include syntactic divergences including at least one of word order, verb-less phrases, and possessiveness.

In other embodiments, the dependency features include lexical divergences including at least one of idiomatic expressions, prepositions, named entities, ambiguity, and alignment difficulties.

In some embodiments, performing the statistical machine translation using the modified translation model further includes: receiving, at the computing system, one or more words in the first language; generating, at the computing system, one or more potential translations of the one or more words using the modified translation model, the one or more potential translations having one or more probability scores, respectively; selecting, at the computing system, one of the one or more potential translations based on the one or more probability scores to obtain a selected translation; and outputting, at the computing system, the selected translation.

Another computer-implemented technique is also presented. The technique can include receiving, at a computing system including one or more processors, a translation model configured for translation between a first language and a second language. The technique can include receiving, at the computing system, a plurality of pairs of phrases, each of the plurality of pairs of phrases including a first phrase of one or more words in the first language and a second phrase of one or more words in the second language, wherein a specific first phrase is aligned with a specific second phrase for a specific pair of phrases. The technique can include receiving, at the computing system, a source phrase for translation from the first language to the second language. The technique can include determining, at the computing system, a translated phrase based on the source phrase using the translation model. The technique can include determining, at the computing system, a selected second phrase from the plurality of pairs of phrases based on the translated phrase. The technique can include predicting, at the computing system, one or more features for each word of the translated phrase based on the selected second phrase, a selected first phrase associated with the selected second phrase, and linguistic differences between the first and second languages. The technique can include modifying, at the computing system, the translated phrase based on the one or more features to obtain a modified translated phrase. The technique can also include outputting, from the computing system, the modified translated phrase.

In some embodiments, the translated phrase has lexical and inflectional agreement with the source phrase.

In other embodiments, the one or more features include at least one of lemma, parts of speech features, morphological features, and dependency features.

In some embodiments, predicting the one or more features for each word in the translated phrase further includes determining, at the computing system, at least one of the lemma, the part of speech features, and the morphological features for each word in the selected first phrase and for each word in the selected second phrase.

In other embodiments, predicting the one or more features for each word in the translated phrase further includes projecting, at the computing system, dependency relations from the selected first phrase to the selected second phrase based on an alignment between the selected first phrase and the selected second phrase and the part of speech features of both the selected first phrase and the selected second phrase.

In some embodiments, the parts of speech features include at least one of personal pronouns, possessive pronouns, determiners, particles, and wh-nouns.

In other embodiments, the dependency features include syntactic divergences including at least one of word order, verb-less phrases, and possessiveness.

In some embodiments, the dependency features include lexical divergences including at least one of idiomatic expressions, prepositions, named entities, ambiguity, and alignment difficulties.

In other embodiments, one of the first and second languages is a morphologically-rich language.

In some embodiments, the morphologically-rich language is characterized by morphological processes that produce a large number of word forms for a given root word.

In other embodiments, the morphologically-rich language is a synthetic language.

In some embodiments, one of the first and second languages is a non-morphologically-rich language.

In other embodiments, the non-morphologically-rich language is an isolating language or an analytic language.

A system is also presented. The system can include one or more computing devices configured to perform operations. The operations can include receiving a translation model including a plurality of pairs of phrases, each of the plurality of pairs of phrases including a first phrase of one or more words in a first language and a second phrase of one or more words in a second language, wherein a specific first phrase is aligned with a specific second phrase for a specific pair of phrases. The operations can include determining one or more features for each of the plurality of pairs of phrases based on linguistic differences between the first and second languages to obtain a plurality of sets of features. The operations can include associating the plurality of sets of features with the plurality of pairs of phrases, respectively, to obtain a modified translation model. The operations can also include performing statistical machine translation from the first language to the second language using the modified translation model.