Claims
- 1. A method to predict the reading of Japanese ideographs of Japanese words and/or sentences comprising the steps of:
creating underlying readings for a data store having Japanese words with Japanese ideographs, said underlying readings created employing data comprising any of base kanji readings and quasi-phonological rules; generating a decision tree, said decision tree setting forth steps for predicting readings of said Japanese ideographs; and processing said Japanese words and/or sentences to provide readings of said Japanese ideographs of said Japanese words and/or sentences.
- 2. The method as recited in claim 1, wherein said creating step further comprises the step of providing a reading analyzer, said reading analyzer accepting as input said base kanji readings, said quasi-phonological rules, and a training corpus for processing to create said underlying readings, wherein said training corpus comprises said data store having Japanese words with Japanese ideographs.
- 3. The method as recited in claim 1, wherein said generating step further comprises the step of providing a learning algorithm, said learning algorithm setting forth steps to creating said decision tree.
- 4. The method as recited in claim 3, wherein said providing step comprises the step of furnishing an ID3-type machine learning algorithm.
- 5. The method as recited in claim 4, further comprising the steps of:
treating each Japanese ideograph in each Japanese word of said data store having Japanese words with Japanese ideographs as an event, wherein the outcome of each event is the correct underlying reading of said each Japanese ideograph in said Japanese word; classifying said events into sets having the same outcome, wherein said classifying step further comprises the steps of
dividing said sets into subsets where each member of said subsets has the same value of a classification attribute, wherein said classification attribute is a known fact about the event other than the outcome; calculating the entropy of each set before and after being divided to produce an entropy gain; and searching for the sequence of attribute tests that maximizes the entropy gain at each division to create a sequence of tests that classifies the events into homogenous subsets sharing the same outcome.
- 6. The method as recited in claim 1, wherein said processing step further comprises the step of:
accepting as input various data sources comprising any of said decision tree, said underlying readings, said quasi-phonological rules, and morphological analysis by a reading predictor, said reading predictor using said data sources to parse Japanese words and/or sentences to identify Japanese ideographs and their respective readings, wherein said morphological analysis is produced by a morphological analyzer using linguistic morphology rules.
- 7. The method as recited in claim 6, further comprising the steps of:
analyzing Japanese words and/or sentences by morphological analyzer to determine their structure, wherein said structure comprising Japanese ideographs; calculating the classification attributes for said Japanese ideographs; walking said decision tree according to the value of said calculated attributes; selecting the appropriate underlying readings for said Japanese ideographs; and applying said quasi-phonological rules to said underlying readings to produce surface readings.
- 8. A computer readable storage medium comprising computer-executable instructions for instructing a computer to perform the acts recited in claim 1.
- 9. A system to predict readings of Japanese ideographs comprising:
a Japanese reading analyzer, said reading analyzer accepting Japanese language data as input to produce underlying readings for Japanese ideographs in said corpus of Japanese words, and a decision tree used in predicting the reading of Japanese ideographs; and a Japanese reading predictor, said reading predictor accepting said produced decision tree, said Japanese language data, and a morphological analysis as input to operate on Japanese words and/or sentences to provide reading predictions for Japanese ideographs present in said inputted Japanese words and/or sentences.
- 10. The system as recited in claim 9, wherein said Japanese language data comprises any of basic kanji readings, a corpus of Japanese words and morphemes, and quasi-phonological rules.
- 11. The system as recited in claim 9, wherein said morphological analysis is created by a morphological analyzer, said morphological analyzer having the ability to process Japanese words and/or sentences according to pre-defined Japanese language morphology rules.
- 12. The system as recited in claim 10, wherein said morphological analyzer accepts as input Japanese words and/or sentences to calculate classification attributes for Japanese ideographs present in said inputted Japanese words and/or sentences, wherein said classification attributes assist said reading predictor to create a surface reading for Japanese ideographs in said inputted Japanese words and/or sentences.
- 13. The system as recited in claim 12, wherein said classification attributes comprises any of: IsBoundMorpheme, IsStemMorpheme, IsMorphInitial, IsMorphFinal, PrecedesKanji, FollowsKanji, PrecedesHiragana, FollowsHiragana, PrecedesKatakana, FollowsKatakana, AllKanji, IsUnigram, IsBigram, IsTrigram, IsTetragram, IsFactoid, IsBoundR, IsBoundL, MorphIDEquals(X), WordIDEquals(X), NextCharEquals(X), ThirdCharEquals(X), and PrevCharEquals(X).
- 14. The system as recited in claim 13, wherein said classification attributes are rooted in Japanese linguistic rules.
- 15. The system as recited in 9, wherein said reading analyzer comprises a learning algorithm, said learning algorithm providing steps to facilitate the creation of said decision tree.
- 16. The system as recited in claim 15, wherein said learning algorithm is an ID3-type machine learning algorithm.
- 17. The system as recited in claim 9, wherein said system is incorporated as part of a computing application, said computing application providing features that allow for the reading of Japanese ideographs for style checking.
- 18. A method to allow for effective and reliable reading predictions of Japanese ideographs performing the acts of:
providing a reading analyzer, said reading analyzer accepting as input various Japanese language data; operating said reading analyzer in a learning mode, wherein said reading analyzer operates on said inputted data to produce underlying readings for said Japanese language data and to generate a decision tree for use when predicting readings of Japanese ideographs; providing a reading predictor, said reading predictor employing said produced underlying readings and said generated decision tree to determine characteristics for Japanese ideographs in inputted Japanese words and/or sentences, wherein said characteristics contribute to the prediction of readings for said Japanese ideographs.
- 19. The method as recited in claim 18, wherein said providing said reading analyzer act further comprises the act of providing Japanese language data comprising any of base kanji readings, Japanese lexicon, and quasi-morphological rules.
- 20. The method as recited in claim 18, wherein said providing said reading predictor act further comprises the act of furnishing a morphological analysis for said inputted Japanese words and/or sentences, said morphological analysis generated by a morphological analyzer operating on said inputted Japanese words and/or sentences using Japanese linguistic morphology rules.
- 21. A computer readable storage medium comprising computer-executable instructions for instructing a computer to perform the acts recited in claim 18.
- 22. In a computer system having storage, a method of representing analysis of an input string of natural language characters useful to identify readings of said characters comprising the portions of the input string, comprising the computer-implemented steps of:
processing the input string to identify the natural language characters in the string and morphemes in the string; and creating a structure in storage that holds characteristics of said natural language characters, such that the structure may be used to identify the readings of said natural language characters that comprise said input string, said characteristics representative of a decision tree comprising connected nodes including root and leaves, wherein each path of the decision tree from the root to a leaf represents an alternative reading analysis for said natural characters.
- 23. The method as recited in claim 22, wherein the input string comprises Japanese characters having Japanese ideographs.
- 24. The method as recited in claim 22, wherein the step of processing said input string comprises processing the input string using linguistic morphology rules.
- 25. The method as recited in claim 24 further comprising the step of processing said input string by a morphological analyzer.
- 26. The method as recited in claim 22, wherein the step of creating said structure comprises employing a learning algorithm to generate said decision tree.
PRIORITY
[0001] This application is related to and claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Serial No. 60/219,981, filed Jul. 21, 2000, entitled “METHOD FOR PREDICTING THE READINGS OF JAPANESE IDEOGRAPHS,” the contents of which are hereby incorporated by reference in their entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60219981 |
Jul 2000 |
US |