This application is based upon and claims the benefit of priority from prior Chinese Patent Application No. 200710186545. 6, filed Dec. 7, 2007, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to target language word inflection (TLWI) in the corpus based automatic machine translation technology, specifically, relates to a method and apparatus for training a target language word inflection (TLWI) model based on a bilingual corpus, a TLWI method and apparatus, and a translation method and system for translating a source language text into a target language translation.
2. Description of the Related Art
In many languages, there exists word inflection. For example, in English, verbs could be inflected in tense and nouns could be inflected in number. Thus information such as time, number and sensibility can be obtained from the word inflection and used to understand the English sentence accurately.
Currently, there exist two main techniques for the automatic machine translation: rule-based approach and corpus-based approach. The rule-based approach is to utilize translation rules to train and build a translation model and make translation based on the trained translation model. The corpus-based approach is to utilize a bilingual corpus to train and build the translation model.
In the rule-based approach, the target language word inflection can be produced by using the translation rules. But generally the translation rules are written manually, which would spend much time. And the translation rules must use deep syntax parsing information. For spoken language translation, the structure of the sentence is very relaxed, so it is very difficult to parse the sentence accurately.
In the corpus-based approach, the target language word inflection comes from the bilingual corpus. Only the bilingual corpus contains the target language word inflection, the translation model based on this bilingual corpus could output the target language word inflection. Therefore the size of the bilingual corpus will affect the accuracy of the translation.
The rule-based approach and the corpus-based approach have been described in detail, for example, in the book “Machine Translation Theory”, Tiejun ZHAO, etc. (Harbin Institute of Technology Press, May, 2001), and in the book “Machine Translation: an Introductory Guide”, D. J. Arnold, Lorna Balkan, Siety Meijer, R. Lee Humphreys and Louisa Sadler (Blackwells-NCC, 1994), and in the article “Machine Translation over Fifty Years”, John Hutchins, in Histoire, Epistemologies, Language, Tome XXII, pp. 7-31, 2001.
The present invention is directed to above technical problems and provides a method and apparatus for training a target language word inflection (TLWI) model based on a bilingual corpus, a TLWI method and apparatus, and a translation method and system for translating a source language text into a target language translation.
According to one aspect of the invention, there is provided with a method for training a target language word inflection model based on a bilingual corpus, wherein the bilingual corpus includes a plurality of aligned corpus pairs of source language and target language, the method comprising: building an initial TLWI model; pre-processing the source language corpus and the target language corpus; extracting patterns containing TLWI information, based on the pre-processed source language corpus and the target language corpus; and training the TLWI model by using the patterns.
According to another aspect of the invention, there is provided with a TLWI method, wherein a source language text is translated into a target language translation and the source language text is pre-processed so that each of source language words in the source language text is prototypical and tagged with POS, the method comprising: training a TLWI model by using the above method for training a target language word inflection model based on a bilingual corpus; and inflecting target language words in the target language translation based on the TLWI model.
According to another aspect of the invention, there is provided with a translation method for translating a source language text into a target language translation, comprising: pre-processing the source language text to obtain a sequence of source language words each of which is prototypical and tagged with POS; translating the pre-processed source language text into an initial target language translation based on a corpus based translation model; and editing the initial target language translation to obtain the final target language translation by using the above TLWI method.
According to another aspect of the invention, there is provided with an apparatus for training a TLWI model based on a bilingual corpus, wherein the bilingual corpus includes a plurality of aligned corpus pairs of source language and target language, the apparatus comprising: an initial model builder configured to build an initial TLWI model; a corpus pre-processing unit configured to pre-process the source language corpus and the target language corpus; a pattern extractor configured to extract patterns containing TLWI information based on the pre-processed source language corpus and the target language corpus; and a training unit configured to train the TLWI model by using the patterns.
According to another aspect of the invention, there is provided with a TLWI apparatus, wherein a source language text is translated into a target language translation and the source language text is pre-processed so that each of source language words in the source language text is prototypical and tagged with POS, the apparatus comprising: a TLWI model trained by the above apparatus for training a TLWI model based on a bilingual corpus; and a word inflection unit configured to inflect target language words in the target language translation based on the TLWI model.
According to another aspect of the invention, there is provided with a translation system for translating a source language text into a target language translation, comprising: a text pre-processing unit configured to pre-process the source language text to obtain a sequence of source language words each of which is prototypical and tagged with POS; a corpus based translation model configured to translate the pre-processed source language text into an initial target language translation based on; and a TLWI apparatus according to any one of claims 25-27 configured to edit the initial target language translation to obtain the final target language translation.
It is believed that the above and other objectives, characteristics and advantages of the present invention will be more apparent with the following detailed description of the specific embodiments for carrying out the present invention taken in conjunction with the drawings.
In this embodiment, the bilingual corpus includes a plurality of aligned corpus pairs of source language and target language and the corpus can be in phrase form, sentence form or paragraph form. In order to facilitate the description, in the present and later described embodiments, it is assumed that the corpus is in sentence form. That is, the bilingual corpus is the bilingual example corpus in which the source language sentences and the target language sentences are aligned.
As shown in
Then at Step 105, the source language sentences and the target language sentences in the bilingual example corpus are pre-processed. Specifically, for each pair of the plurality of aligned sentence pairs of source language and target language, the source language sentence is pre-processed so that each of source language words in the pre-processed source language sentence is prototypical and tagged with Part of Speech (POS). At the same time, the target language sentence is pre-processed so that each of target language words in the pre-processed target language sentence is prototypical and tagged with POS.
Next the step 105 will be described assuming the source language is Chinese and the target language is English. Firstly, the Chinese sentence is segmented into a sequence of Chinese words each of which is tagged with POS. The segmentation method is known to the person skilled in the art and its description will be omitted. Then, each of the English words in the English sentence is stemmed and tagged with POS.
At Step 110, based on the pre-processed plurality of aligned sentence pairs of source language and target language, patterns containing TLWI information can be extracted.
Then at Step 1105, inconsistent target language words between the original target language sentence and the pre-processed target language sentence are searched out. That is, the inflected target language words can be searched from the target language sentence.
At Step 1110, the source language words aligned with the inconsistent target language words searched in Step 1105 can be obtained from the pre-processed source language sentence, based on the word alignment information.
Then at Step 1115, according to the inconsistent target language words and the aligned source language words and contexts of the aligned source language words in the original source language sentence, the patterns containing TLWI information can be generated.
In this embodiment, the TLWI information can include: POS of the source language word; combinations of the contexts of the source language word as conditions; inflection behavior of the target language word aligned with the source language words as action. That is, the pattern is composed of POS portion, condition portion and action portion.
Further, the combinations of the contexts of the source language word in the condition portion can be pre-determined, for example, including: a) previous source language word; b) previous source language word and next source language word; c) source language word before the previous source language word; and d) source language word after the next source language word.
For example, the Chinese sentence contains 7 Chinese words, i.e. “C1/P1 C2/P2 C3/P3 C4/P4 C5/P5 C6/P6 C7/P7”, wherein Ci represents the Chinese word and Pi represents the POS. Assuming that “C4/P4” is the Chinese word aligned with the inflected English word “W4/P4”, when the above example is used as the combinations of the contexts, the conditions of the extracted pattern are: a) −1 C3; b) −1 C3 +1 C5; c) −2 C2; d) +2 C6.
Apparently, the person skilled in the art can understand that the combinations of the contexts are not limited as the above-described examples and can include other combinations.
Return to
The method for training a TLWI model based on a bilingual corpus of the embodiment will be described in detail in conjunction with a specific example.
A pair of aligned Chinese sentence and English sentence is:
At first, the two sentences are pre-processed as follows:
Chs: /pron /n /adv /v /u /pron /no /w
Eng: The/art girl/n just/adv wash/v these/pron apple/n ./w
The pre-processed Chinese sentence is shown in Table 1.
°
The pre-processed English sentence is shown in Table 2.
Then the word alignment is performed on the pre-processed Chinese sentence and the pre-processed English sentence to obtain the word alignment information, as shown in Table 3.
°
Then, the inconsistent English words with the original English sentence can be searched out in the pre-processed English sentence. By comparison, two inconsistent English words are obtained, i.e.
Thus, the Chinese words aligned with the two inconsistent English words in the Chinese sentence are and
According to the two inconsistent English words, the aligned Chinese words and the contexts of the aligned Chinese words in the original Chinese sentence, two patterns containing the English word inflection information can be generated, as shown in Table 4.
In Table 4, the pattern P1 is generated from “wash|washed” inflection, which means that for a Chinese word with POS “v” in a Chinese sentence, if the previous Chinese word is and the next Chinese word is the inflection of the English word aligned with the Chinese word is to add “ed” to the termination. The pattern P2 is generated from “apple|apples” inflection, which means that for a Chinese word with POS “n” in a Chinese sentence, if the previous Chinese word is the inflection of the English word aligned with the Chinese word is to add “s” to the termination.
Finally, after all patterns are extracted based on the bilingual example corpus, the TLWI model is trained by these patterns.
It can be seen from above description that the method for training a TLWI model based on a bilingual corpus of the embodiment can train the TLWI on the basis of the pre-processed bilingual corpus and only use the shallow parsing information. The trained TLWI model can be applied to the spoken translation system and other corpus based translation system and can improve the translation quality.
Under the same inventive concept,
The TLWI method of the embodiment can be used to further make a target language translation more accurate. In this embodiment, the target language translation is obtained by translating a source language text based on a corpus based translation model, and the source language text is pre-processed so that each of source language words in the source language text is prototypical and tagged with POS.
The corpus based translation model can be any existing or future corpus based translation model, for example, the statistical machine translation (SMT) model.
As shown in
Then at Step 310, the target language words in the target language translation are inflected based on the trained TLWI model.
If there are the corresponding patterns, at Step 3105, for each of the patterns, it is verified whether the contexts of the source language word satisfy the conditions in the pattern. If the conditions in the pattern are satisfied, the action in the pattern is performed on the target language word aligned with the source language word in the target language translation. If the conditions are not satisfied, the Step 3101 is performed on the next source language word.
If it is determined in Step 3101 that there is no pattern corresponding to the source language word, the Step 3101 is performed on the next source language word.
By using above steps, the target language words to be inflected can be found in the target language translation and can be inflected.
Further, when the verification result of the Step 3105 is that the conditions in more than one patterns are satisfied, at Step 3110, the actions in the more than one patterns are performed respectively on the target language word aligned with the source language word to obtain more than one target language translation candidates.
Then at Step 3115, for each of the more than one candidates, a fluency score of the candidate is calculated based on a language model of the target language, and at Step 3120, a pattern score of the pattern used to obtain the candidate is calculated based on the TLWI model. Next at Step 3125, the fluency score and the pattern score are combined together and the score of the combination can be obtained. For example, the combination can be a product or a weighted summation. Thus the score of the combination is the score of the candidate.
Finally, at Step 3130, the candidate corresponding to the highest score is selected as final target language translation.
The steps of selecting the final target language translation from the more than one target language translation candidates can be represented by the equation in the following:
where e represents the candidate, PLM(•) represents the language model of the target language, fTLWI(•) represents the TLWI model, argmax{•} represents a function used to select maximum value, and ê represents the final target language translation.
It can be seen from above description that the TLWI method of the embodiment can utilize the trained TLWI model to inflect the target language words in the target language translation, thus the translation quality can be improved. Further, the TLWI method can select the optimal target language word inflection from the multiple target language translation candidates by combining the language model and the TLWI model and obtain the optimal target language translation.
Under the same inventive concept,
As shown in
Then at Step 505, the pre-processed source language text is translated into an initial target language translation based on a corpus based translation model. As described above, the corpus based translation model can be a SMT model or the like.
Then at Step 510, the initial target language translation is edited to obtain the final target language translation by using the TLWI method described in above embodiment.
The translation method of the embodiment will be described in detail in conjunction with one example. It is assumed that the source language is Chinese and the target language is English and the corpus based translation model is the SMT model. The inputted sentence is Firstly the sentence is pre-processed and the pre-processed sentence is /pron/n /adv /v /u /no /w”. Then based on the SMT model, the initial English translation is “These/pron boy/n just/adv watch/v TV/n ./w”. And the initial English translation is edited based on the TLWI model. That is, the English word “boy” is inflected into “boys” and the “watch” is inflected into “watched”. Thus the final English translation is “These boys just watched TV.”.
It can be seen from above description that the translation method for translating a source language text into a target language translation of the embodiment can make translation based on the corpus based translation model and further use the TLWI model to inflect the target language word in the target language translation, thus the translation can be more accurately.
Under the same inventive concept,
As described above, the bilingual corpus includes a plurality of aligned corpus pairs of source language and target language and the corpus can be in phrase form, sentence form or paragraph form. Commonly, the bilingual corpus is the bilingual example corpus.
As shown in
As described above, the TLWI model can be a probability model or a pattern recognition model or the like. The training 604 can use the corresponding training algorithm to train the TLWI model.
In the corpus pre-processing unit 602, a source language corpus pre-processing unit pre-processes the source language corpus so that each of source language words in the pre-processed source language corpus is prototypical and tagged with POS. At the same time, a target language corpus pre-processing unit pre-processes the target language corpus so that each of target language words in the pre-processed target language corpus is prototypical and tagged with POS.
For example, when the source language corpus is a Chinese sentence and the target language corpus is an English sentence, in the source language corpus pre-processing unit, firstly a segmenting unit segments the Chinese sentence into a sequence of Chinese words, and then a tagging unit tags each of the Chinese words with POS. In the target language corpus pre-processing unit, each English word in the English sentence is stemmed and tagged with POS.
As described above, the TLWI information can include: POS of the source language word; combinations of the contexts of the source language word as conditions; inflection behavior of the target language word aligned with the source language words as action. The combinations of the contexts of the source language word can be pre-determined, for example, including: previous source language word; previous source language word and next source language word; source language word before the previous source language word; and source language word after the next source language word. Of course, the combinations of the contexts are not limited as the above-described examples and can include other combinations.
It should be noted that the apparatus 600 for training a TLWI model based on a bilingual corpus of this embodiment and its components can be implemented with specifically designed circuits or chips, and also can be implemented by executing corresponding programs on a general computer (processor). Also, the apparatus 600 for training a TLWI model based on a bilingual corpus in the present embodiment may operationally perform the method for training a TLWI model based on a bilingual corpus of the embodiment shown in
Under the same inventive concept,
In this embodiment, a source language text can be translated into the target language translation based on a corpus based translation model, and the source language text is pre-processed so that each of source language words in the source language text is prototypical and tagged with POS, and the pre-processed source language text is stored in a related storage unit.
As shown in
Further, when the verification result of the condition verifier 8022 is that the conditions in more than one patterns are satisfied, the action performing unit 8023 performs the actions in the more than one patterns respectively on the target language word aligned with the source language word to obtain more than one target language translation candidates. These target language translation candidates are stored in a storage unit. For each of the more than one candidates, in a fluency calculator, a fluency score of the candidate calculate is calculated based on a language model of the target language, and in a pattern score calculator, a pattern score of the pattern used to obtain the candidate is calculated based on the TLWI model 801. Then a combination score obtaining unit obtains a score of a combination combining the fluency score with the pattern score, as a score of the candidate. The combination can be a product or a weighted summation. Finally, a selector selects the candidate corresponding to the highest score as final target language translation.
It should be noted that the TLWI apparatus 800 of this embodiment and its components can be implemented with specifically designed circuits or chips, and also can be implemented by executing corresponding programs on a general computer (processor). Also, the TLWI apparatus 800 in the present embodiment may operationally perform the TLWI method of the embodiment shown in
Under the same inventive concept,
As shown in
For example, when the source language corpus is a Chinese sentence, in the text pre-processing apparatus 1001, the Chinese sentence is segmented into a sequence of Chinese words, and then each of the Chinese words with POS.
As described above, the corpus based translation model can be any existing or future corpus based translation model, such as the SMT model.
It should be noted that the translation system 1000 for translating a source language text into a target language translation of this embodiment and its components can be implemented with specifically designed circuits or chips, and also can be implemented by executing corresponding programs on a general computer (processor). Also, the translation system 1000 for translating a source language text into a target language translation in the present embodiment may operationally perform the translation method for translating a source language text into a target language translation of the embodiment shown in
Although a method and apparatus for training a target language word inflection model based on a bilingual corpus, a TLWI method and apparatus, and a translation method and system for translating a source language text into a target language translation are described in detail accompanying with the concrete embodiment in the above, the present invention is not limited the above. It should be understood for persons skilled in the art that the above embodiments may be varied, replaced or modified without departing from the spirit and the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
200710186545.6 | Dec 2007 | CN | national |