Claims
- 1. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus; iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved; and utilizing the iteratively refined language model in an application to predict a likelihood of another corpus:
- 2. A method according to claim 1, wherein the application is one or more of a spelling and/or grammatical checker, a word-processing application, a language translation application, a speech recognition application, and the like.
- 3. A method according to claim 1, wherein the step of developing an initial language model comprises:
generating a prefix tree data structure from items dissected from the received corpus; identifying sub-strings of N items or less from the prefix tree data structure; and populating the lexicon with the identified sub-strings.
- 4. A method according to claim 3, wherein N is equal to three (3).
- 5. A method according to claim 1, wherein predictive capability is quantitatively expressed as a perplexity measure.
- 6. A method according to claim 5, wherein the language model is refined until the perplexity measure is reduced below an acceptable predictive threshold.
- 7. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 1.
- 8. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 1.
- 9. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 1.
- 10. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus; iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein: the step of iteratively refining the language model comprises:
re-segmenting the corpus by determining, for each segment, a probability of occurrence for that segment; and updating the lexicon from the re-segmented corpus; and computing a predictive measure for the language model using the updated lexicon and the re-segmented corpus, wherein the predictive measure is language model perplexity.
- 11. A method according to claim 10, further comprising:
determining whether the predictive capability of the language model improved as a result of the steps of updating and re-segmenting; and performing additional updating and re-segmenting if the predictive capability improved until no further improvement is identified.
- 12. A method according to claim 10, wherein determining the probability of occurrence for a segment is calculated using an N-gram language model.
- 13. A method according to claim 12, wherein the N-gram language model is a tri-gram language model.
- 14. A method according to claim 10, wherein determining the probability of occurrence for a segment is calculated using two prior segments.
- 15. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 10.
- 16. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 10.
- 17. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus; and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the initial language model is derived using a maximum match technique.
- 18. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 17.
- 19. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 17.
- 20. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the language modeling agent quantitatively determines predictive capability using a perplexity measure.
- 21. A storage medium according to claim 20, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
- 22. A storage medium according to claim 20, further comprising at least a subset of instructions which, when executed, implements an application utilizing the language model developed by the language modeling agent.
- 23. A system comprising:
a storage medium drive, to removably receive a storage medium according to claim 20; and an execution unit, coupled to the storage medium drive, to access and execute at least a subset of the plurality of executable instructions populating the removably received storage medium to implement the language modeling agent.
- 24. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the language modeling agent derives the lexicon and segmentation from the received corpus using a maximum matching technique.
- 25. A storage medium according to claim 24, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
- 26. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein:
the function to iteratively refine the initial language model by determining, for each segment, a probability of occurrence for that segment, and re-segmenting the corpus to reflect an improved segment probability; and the language modeling agent utilizes hidden Markov probability measures to determine the probability of occurrence for each segment.
- 27. A storage medium according to claim 26, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
RELATED APPLICATIONS
[0001] This is a continuation of U.S. patent application Ser. No. 09/609,202, filed Jun. 20, 2000, now U.S. Pat. No. ______, which claims priority to a provisional patent application No. 60/163,850, entitled “An iterative method for lexicon, word segmentation and language model joint optimization”, filed on Nov. 5, 1999 by the inventors of this application, each of which are incorporated herein by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60163850 |
Nov 1999 |
US |
Continuations (1)
|
Number |
Date |
Country |
Parent |
09609202 |
Jun 2000 |
US |
Child |
10842264 |
May 2004 |
US |