Claims
- 1. A method for building a translation lexicon from non-parallel corpora, the method comprising:
identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language; generating a seed lexicon including identically spelled words; and expanding the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues.
- 2. The method of claim 1, wherein said expanding comprises using the identically spelled words in the seed lexicon as accurate translations.
- 3. The method of claim 1, further comprising:
identifying substantially identical words in the first and second corpora; and adding said substantially identical words to the seed lexicon.
- 4. The method of claim 3, wherein said identifying substantially identical words comprises applying transformation rules to words in the first corpora to form transformed words; and
comparing said transformed words to words in the second corpora.
- 5. The method of claim 1, wherein said one or more clues includes similar spelling.
- 6. The method of claim 1, wherein said identifying comprises identifying cognates.
- 7. The method of claim 1, wherein said identifying comprises identifying word pairs having a minimum longest common subsequence ratio.
- 8. The method of claim 1, wherein said one or more clues includes similar context.
- 9. The method of claim 1, wherein said identifying comprises:
identifying a plurality of context words; and identifying a frequency of context words in an n-word window around a target word.
- 10. The method of claim 9, further comprising generating a context vector.
- 11. The method of claim 1, wherein said identifying comprises identifying frequencies of occurrence of word in the first and second first corpora.
- 12. The method of claim 1, further comprising:
generating matching scores for each of a plurality of clues.
- 13. The method of claim 12, further comprising adding the matching scores.
- 14. The method of claim 13, further comprising weighting the matching scores.
- 15. A method for generating parallel corpora from non-parallel corpora, the method comprising:
aligning text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; matching strings in the two non-parallel corpora; and generating a parallel corpus including the matched strings as translation pairs.
- 16. The method of claim 15, wherein said matching comprises using a bilingual lexicon comprising translation pairs including corresponding source language words and target language words.
- 17. The method of claim 15, wherein said aligning comprises:
generating a Bilingual Suffix Tree.
- 18. The method of claim 17, further comprising:
traversing the Bilingual Suffix Tree on edges labeled with word pairs; and extracting paths that end at one of a leaf and a node having outgoing edges labeled only with source language words.
- 19. The method of claim 17, further comprising:
generating a Generalized Suffix Tree from the source language corpus; generating a Generalized Suffix Tree from the target language corpus; and matching strings in said Generalized Suffix Trees.
- 20. The method of claim 15, further comprising:
identifying words in the two corpora surrounded by matching strings, one of the words being unknown.
- 21. The method of claim 20, further comprising:
identifying said words as a translation pair.
- 22. The method of claim 20, further comprising:
generating a Bilingual Suffix Tree from the two corpora; generating a reverse Bilingual Suffix Tree; and identifying words in the two corpora surrounded by aligned sequences.
- 23. An apparatus comprising:
a word comparator operative to identify identically spelled words in a first corpus and a second corpus and build a seed lexicon including said identically spelled words, the first corpus including words in a first language and the second corpus including words in a second language; and a lexicon builder operative to expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues.
- 24. The apparatus of claim 23, wherein use the identically spelled words in the seed lexicon as accurate translations.
- 25. An apparatus for generating parallel corpora from non-parallel corpora, the apparatus comprising:
an alignment module operative to align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; and a matching module operative to match strings in the two non-parallel corpora generate a parallel corpus including the matched strings as translation pairs.
- 26. The apparatus of claim 25, wherein the aligning module is operative to build a Bilingual Suffix Tree from a text segment from one of said two non-parallel corpora.
- 27. An article comprising a machine-readable medium including machine-executable instructions, the instructions operative to cause a machine to:
identify identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language; generate a seed lexicon including identically spelled words; and expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues.
- 28. The article of claim 27, wherein the instructions operative to cause the machine to expand comprise instructions operative to cause the machine to use the identically spelled words in the seed lexicon as accurate translations.
- 29. An article comprising a machine-readable medium including machine-executable instructions, the instructions operative to cause a machine to:
align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; match strings in the two non-parallel corpora; and generate a parallel corpus including the matched strings as translation pairs.
- 30. The article of claim 29, wherein the instructions operative to cause the machine to match comprise instructions operative to cause the machine to use a bilingual lexicon comprising translation pairs including corresponding source language words and target language words.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Application Serial No. 60/368,070, filed on Mar. 26, 2002, and U.S. Provisional Application Serial No. 60/368,447, filed on Mar. 27, 2002, the disclosures of which are incorporated by reference.
ORIGIN OF INVENTION
[0002] The research and development described in this application were supported by DARPA under grant number N66001-00-1-8914. The U.S. Government may have certain rights in the claimed inventions.
Provisional Applications (2)
|
Number |
Date |
Country |
|
60368070 |
Mar 2002 |
US |
|
60368447 |
Mar 2002 |
US |