Claims
- 1. A method comprising:
obtaining a named entity from text input of a source language; generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model; searching a monolingual resource in the target language for information relating to usage frequency; and providing output comprising at least one of the potential translations based on the usage frequency information.
- 2. The method of claim 1, wherein generating the potential translations of the named entity comprises:
using a first probabilistic model to generate words in the target language and first transliteration scores for the words based on language pronunciation characteristics; using a second probabilistic model to generate second transliteration scores for the words based on a mapping of letter sequences from the target language into the source language; and combining the first transliteration scores and the second transliteration scores into third transliteration scores for the words.
- 3. The method of claim 2, wherein:
using the first probabilistic model comprises generating at least a portion of the words according to unigram probabilities P(w), generating phoneme sequences corresponding to the words with pronunciation probabilities P(e|w), and converting the phoneme sequences into the source language with conversion probabilities P(a|e), the first transliteration scores being governed by 3Pp(w|a)≅∑∀ε P(w)P(e|w)P(a|e); andusing the second probabilistic model comprises generating letters in the source language for the words using the letter sequences mapping with probabilities P(a|w), and generating at least a portion of the words according to a letter trigram model with extended probabilities P(w), the second transliteration scores being governed by Ps(w|a)≅P(w)P(a|w).
- 4. The method of claim 3, wherein combining the first transliteration scores and the second transliteration scores comprises calculating a linear combination, the third transliteration scores being governed by
- 5. The method of claim 1, wherein said obtaining the named entity comprises:
obtaining phrase boundaries of the named entity; and obtaining a category of the named entity.
- 6. The method of claim 5, wherein generating the potential translations of the named entity comprises selectively using a bilingual resource based on the category of the named entity.
- 7. The method of claim 6, wherein selectively using the bilingual resource comprises: if the category comprises an organization or location name, translating one or more words in the named entity using a bilingual dictionary, transliterating the one or more words in the named entity using the pronunciation-based and spelling-based transliteration model, combining the translated one or more words with the transliterated one or more words into a regular expression defining available permutations of the translated one or more words and the transliterated one or more words, and matching the regular expression against a monolingual resource in the target language.
- 8. The method of claim 7, wherein combining the translated one or more words with the transliterated one or more words comprises combining the translated one or more words with n-best transliterations of the transliterated one or more words.
- 9. The method of claim 7, wherein matching the regular expression against the monolingual resource comprises generating scores for the potential translations according to:
- 10. The method of claim 1, wherein providing the output based on the usage frequency information comprises adjusting probability scores of the potential translations based on the usage frequency information.
- 11. The method of claim 10, wherein providing the output further comprises selecting a translation of the named entity from the potential translations based on the adjusted probability scores.
- 12. The method of claim 10, wherein providing the output further comprises selecting a list of likely translations of the named entity from the potential translations based on the adjusted probability scores and a threshold.
- 13. The method of claim 10, wherein the usage frequency information comprises normalized full-phrase hit counts for the potential translations in the monolingual resource, and adjusting the probability scores comprises multiplying the probability scores by the normalized full-phrase hit counts for the potential translations.
- 14. The method of claim 10, wherein adjusting the probability scores comprises:
comparing the named entity with other named entities of a common type in the text input; and if the named entity is a sub-phrase of one of the other named entities, adjusting the probability scores based on normalized full-phrase hit counts corresponding to the one other named entity.
- 15. The method of claim 10, further comprising identifying contextual information in the text input, and wherein searching the monolingual resource comprises searching multiple documents for, the potential translations in conjunction with the contextual information to obtain the usage frequency information.
- 16. The method of claim 10, wherein searching the monolingual resource comprises searching multiple documents available over a communications network.
- 17. The method of claim 16, wherein the multiple documents comprise news stories in the target language.
- 18. The method of claim 17, wherein the target language is English.
- 19. The method of claim 18, wherein the source language is Arabic.
- 20. The method of claim 1, further comprising identifying contextual information in the text input, and wherein generating the potential translations of the named entity comprises:
discovering documents in the target language that include the contextual information; identifying named entities in the documents; generating transliteration scores for the named entities in the documents, in relation to the named entity in the text input, using a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language; and adding the scored named entities to the potential translations.
- 21. The method of claim 1, wherein generating the potential translations of the named entity comprises:
generating phrases in the target language and corresponding transliteration scores with a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language, the potential translations comprising the scored phrases; identifying sub-phrases in the generated phrases; discovering documents in the target language using the sub-phrases; identifying, in the discovered documents, named entities that include one or more of the sub-phrases; generating transliteration scores for the identified named entities in the discovered documents using the probabilistic model; and adding the scored named entities to the potential translations.
- 22. An article comprising a machine-readable medium embodying information indicative of instructions that when performed by one or more machines result in operations comprising:
generating potential translations of a named entity from a source language to a target language using a pronunciation-based and spelling-based transliteration model; searching a monolingual resource in the target language for information relating to usage frequency; and providing output comprising at least one of the potential translations based on the usage frequency information.
- 23. The article of claim 22, wherein generating the potential translations of the named entity comprises:
using a first probabilistic model to generate words in the target language and first transliteration scores for the words based on language pronunciation characteristics; using a second probabilistic model to generate second transliteration scores for the words based on a mapping of letter sequences from the target language into the source language; and combining the first transliteration scores and the second transliteration scores into third transliteration scores for the words.
- 24. The article of claim 23, wherein:
using the first probabilistic model comprises generating at least a portion of the words according to unigram probabilities P(w), generating phoneme sequences corresponding to the words with pronunciation probabilities P(e|w), and converting the phoneme sequences into the source language with conversion probabilities P(a|e), the first transliteration scores being governed by 5Pp(w|a)≅∑∀ε P(w)P(e|w)P(a|e);andusing the second probabilistic model comprises generating letters in the source language for the words using the letter sequences mapping with probabilities P(a|w), and generating at least a portion of the words according to a letter trigram model with extended probabilities P(w), the second transliteration scores being governed by Ps(w|a)≅P(w)P(a|w).
- 25. The article of claim 24, wherein combining the first transliteration scores and the second transliteration scores comprises calculating a linear combination, the third transliteration scores being governed by
- 26. The article of claim 22, wherein generating the potential translations of the named entity comprises selectively using a bilingual resource based on a category of the named entity.
- 27. The article of claim 26, wherein selectively using the bilingual resource comprises: if the category comprises an organization or location name, translating one or more words in the named entity using a bilingual dictionary, transliterating the one or more words in the named entity using the pronunciation-based and spelling-based transliteration model, combining the translated one or more words with the transliterated one or more words into a regular expression defining available permutations of the translated one or more words and the transliterated one or more words, and matching the regular expression against a monolingual resource in the target language.
- 28. The article of claim 27, wherein combining the translated one or more words with the transliterated one or more words comprises combining the translated one or more words with n-best transliterations of the transliterated one or more words.
- 29. The article of claim 27, wherein matching the regular expression against the monolingual resource comprises generating scores for the potential translations according to:
- 30. The article of claim 22, wherein providing the output based on the usage frequency information comprises adjusting probability scores of the potential translations based on the usage frequency information.
- 31. The article of claim 30, wherein providing the output further comprises selecting a translation of the named entity from the potential translations based on the adjusted probability scores.
- 32. The article of claim 30, wherein providing the output further comprises selecting a list of likely translations of the named entity from the potential translations based on the adjusted probability scores and a threshold.
- 33. The article of claim 30, wherein the usage frequency information comprises normalized full-phrase hit counts for the potential translations in the monolingual resource, and adjusting the probability scores comprises multiplying the probability scores by the normalized full-phrase hit counts for the potential translations.
- 34. The article of claim 30, wherein adjusting the probability scores comprises:
comparing the named entity with other named entities of a common type in input containing the named entity; and if the named entity is a sub-phrase of one of the other named entities, adjusting the probability scores based on normalized full-phrase hit counts corresponding to the one other named entity.
- 35. The article of claim 22, wherein the operations further comprise identifying contextual information in input containing the named entity, and wherein searching the monolingual resource comprises searching multiple documents for the potential translations in conjunction with the contextual information to obtain the usage frequency information.
- 36. The article of claim 22, wherein searching the monolingual resource comprises searching multiple documents available over a communications network.
- 37. The article of claim 36, wherein the multiple documents comprise news stories in the target language.
- 38. The article of claim 37, wherein the target language is English.
- 39. The article of claim 38, wherein the source language is Arabic.
- 40. The article of claim 22, wherein the operations further comprise identifying contextual information in the text input, and wherein generating the potential translations of the named entity comprises:
discovering documents in the target language that include the contextual information; identifying named entities in the documents; generating transliteration scores for the named entities in the documents, in relation to the named entity, using a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language; and adding the scored named entities to the potential translations.
- 41. The article of claim 22, wherein generating the potential translations of the named entity comprises:
generating phrases in the target language and corresponding transliteration scores with a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language, the potential translations comprising the scored phrases; identifying sub-phrases in the generated phrases; discovering documents in the target language using the sub-phrases; identifying, in the discovered documents, named entities that include one or more of the sub-phrases; generating transliteration scores for the identified named entities in the discovered documents using the probabilistic model; and adding the scored named entities to the potential translations.
- 42. A system comprising:
an input/output (I/O) system; and a potential translations generator coupled with the I/O system, the potential translations generator incorporating a combined pronunciation-based and spelling-based transliteration model used to generate translation candidates for a named entity.
- 43. The system of claim 42, wherein the I/O system comprises a network interface providing access to a monolingual resource, the system further comprising a re-ranker module that adjusts scores of the translation candidates based on usage frequency information discovered in the monolingual resource using the network interface.
- 44. The system of claim 43, further comprising a bilingual resource, wherein the potential translations generator selectively uses the bilingual resource based on a category of the named entity.
- 45. The system of claim 44, wherein the potential translations generator comprises:
a person entity handling module; a location and organization entity handling module that accesses the bilingual resource; and a re-matcher module that accesses a news corpus to generate scores for translation candidates generated by the location and organization entity handling module.
- 46. The system of claim 43, wherein the re-ranker module incorporates multiple separate re-scoring modules that apply different re-scoring factors.
- 47. The system of claim 43, wherein the re-ranker module adjusts scores of the translation candidates based at least in part on context information corresponding to the named entity.
- 48. The system of claim 42, wherein the potential translations generator generates the translation candidates based at least in part on context information corresponding to the named entity.
- 49. The system of claim 42, wherein the potential translations generator generates the translation candidates based at least in part on sub-phrases identified in an initial set of translation candidates.
- 50. A system comprising:
means for generating potential translations of a named entity from a source language to a target language using spelling-based transliteration; and means for adjusting probability scores of the generating potential translations based on usage frequency information discovered in a monolingual resource.
- 51. The system of claim 50, wherein the means for generating comprises means for selectively using a bilingual dictionary and a news corpus.
- 52. The system of claim 51, wherein the means for adjusting comprises means for re-ranking the potential translations based on context information and identified sub-phrases of the potential translations.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of the priority of U.S. Provisional Application Serial No. 60/363,443, filed Mar. 11, 2002 and entitled “NAMED ENTITY TRANSLATION”.
STATEMENT AS TO FEDERALLY SPONSORED RESEARCH
[0002] The invention described herein was made in the performance of work under Defense Advanced Research Projects Agency (DARPA) grant no. N66001-00-1-8914, pursuant to which the Government has certain rights to the invention, and is subject to the provisions of Public Law 96-517 (35 U.S.C. 202) in which the contractor has elected to retain title.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60363443 |
Mar 2002 |
US |