Claims
- 1. A method in a computer system for identifying individual words occurring in a sentence of Chinese text, the method comprising the steps of:
for each of a plurality of textual characters:
storing an indication of the characters that occur in the second position of words that begin with the character; and storing an indication of the positions in which the character occurs in words; for each of a plurality of contiguous groups of characters occurring in the sentence:
determining whether the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group; if it is determined that the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group, determining whether every character of the group is indicated to occur in words in the position in which it occurs in the group; if it is determined that every character of the group is indicated to occur in words in the position in which it occurs in the group, comparing the group of characters to a list of words used in the Chinese language to determine whether the group of characters is a word used in the Chinese language; submitting the groups of characters determined to be words used in the Chinese language to a syntactic parser; receiving from the syntactic parser a parse structure identifying the syntactic structure of the sentence, the parse structure specifying a subset of the submitted groups of characters as being part of the syntactic structure of the sentence; and identifying the groups of characters specified by the parse structure as words occurring in the sentence.
- 2. The method of claim 1, further comprising the steps of:
reading the list of words used in the Chinese language; and generating the stored indications based upon the occurrence of characters in the read list of words.
- 3. The method of claim 1 wherein the steps performed for a plurality of contiguous groups of characters occurring in the sentence are not performed for at least one contiguous group of characters containing characters that are also contained by a larger group of characters determined to be a word that cannot contain smaller words.
- 4. The method of claim 3 wherein the steps performed for a plurality of contiguous groups of characters occurring in the sentence are performed for at least one selected contiguous group of characters, the selected group including a first and second character, the first character being the last character of a preceding group of characters determined to be a word that cannot contain smaller words where:
the second character is indicated to occur in words that begin with the first character; and the preceding group of characters without the first character is a word used in the Chinese language.
- 5. The method of claim 1, further comprising the step of, before the performance of the determining steps, retrieving the stored indications.
- 6. A method in a computer system for identifying individual words occurring in a sentence of text of an unsegmented language, the method comprising the steps of:
for each of a plurality of textual characters:
storing an indication of the characters that occur in the second position of words that begin with the character; and storing an indication of the positions in which the character occurs in words; for each of a plurality of contiguous groups of characters occurring in the sentence:
determining whether the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group; if it is determined that the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group, determining whether every character of the group is indicated to occur in words in the position in which it occurs in the group; if it is determined that every character of the group is indicated to occur in words in the position in which it occurs in the group, comparing the group of characters to a list of words to determine whether the group of characters is a word; submitting the groups of characters determined to be words to a syntactic parser; receiving from the syntactic parser a parse structure identifying the syntactic structure of the sentence, the parse structure specifying a subset of the submitted groups of characters as being part of the syntactic structure of the sentence; and identifying the groups of characters specified by the parse structure as words occurring in the sentence.
- 7. A method in a computer system for identifying individual words occurring in a sentence of text, the method comprising the steps of:
for each of a plurality of textual characters:
storing an indication of the characters that occur in the second position of words that begin with the character; and storing an indication of the positions in which the character occurs in words; for each of a plurality of contiguous groups of characters occurring in the sentence:
determining whether the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group; if it is determined that the character occurring in the second position of the group is indicated to occur in words that begin with the character occurring in the first position of the group, determining whether every character of the group is indicated to occur in words in the position in which it occurs in the group; if it is determined that every character of the group is indicated to occur in words in the position in which it occurs in the group, comparing the group of characters to a list of words to determine whether the group of characters is a word; submitting the groups of characters determined to be words to a syntactic parser; receiving from the syntactic parser a parse structure identifying the syntactic structure of the sentence, the parse structure specifying a subset of the submitted groups of characters as being part of the syntactic structure of the sentence; and identifying the groups of characters specified by the parse structure as words occurring in the sentence.
- 8. A method in a computer system for identifying the words of which a body of natural language text is comprised, the body of natural language text comprising an ordered sequence of characters starting with a first character, ending with a last character, and containing a selected interior character between the first and last characters, the method comprising the steps of:
identifying within the sequence of characters a first word containing the first character and the selected interior character; identifying within the sequence of characters a second word containing the last character but not the selected interior character, such that the first and second words may be concatenated to form the sequence of characters; identifying within the sequence of characters a third word containing the first character but not the selected interior character; identifying within the sequence of characters a fourth word containing the selected interior character and the last character, such that the third and fourth words may be concatenated to form the sequence of characters; submitting the first, second, third, and fourth words to a syntactic parser to generate a parse tree representing the syntactic structure of the sequence of characters, the parse tree containing either the first and second words or the third and fourth words; if the parse tree contains the first and second words, indicating that the first and second words comprise the body of natural language text; and if the parse tree contains the third and fourth words, indicating that the third and fourth words comprise the body of natural language text.
- 9. The method of claim 8 wherein the submitting step includes the step of submitting to the syntactic parser a supersequence of characters containing the sequence of characters and comprising a sentence to generate a parse tree representing the syntactic structure of the sentence.
- 10. A computer-readable medium whose contents cause a computer system to identify the words of which a body of natural language text is comprised, the body of natural language text comprising an ordered sequence of characters starting with a first character, ending with a last character, and containing a selected interior character between the first and last characters, by performing the steps of:
identifying within the sequence of characters a first word containing the first character and the selected interior character; identifying within the sequence of characters a second word containing the last character but not the selected interior character, such that the first and second words may be concatenated to form the sequence of characters; identifying within the sequence of characters a third word containing the first character but not the selected interior character; identifying within the sequence of characters a fourth word containing the selected interior character and the last character, such that the third and fourth words may be concatenated to form the sequence of characters; submitting the first, second, third, and fourth words to a syntactic parser to generate a parse tree representing the syntactic structure of the sequence of characters, the parse tree containing either the first and second words or the third and fourth words; if the parse tree contains the first and second words, indicating that the first and second words comprise the body of natural language text; and if the parse tree contains the third and fourth words, indicating that the third and fourth words comprise the body of natural language text.
- 11. The computer-readable medium of claim 10 wherein the submitting step includes the step of submitting to the syntactic parser a supersequence of characters containing the sequence of characters and comprising a sentence to generate a parse tree representing the syntactic structure of the sentence.
- 12. A method in a computer system for selecting from a sequence of natural language characters combinations of characters that may be words using indications for each of a plurality of characters of the characters that occur in the second position of words that begin with the character and of the positions in which the character occurs in words, the method comprising the steps of, for each of a plurality of contiguous combination of characters occurring in the sequence:
determining whether the character occurring in the second position of the combination is indicated to occur in words that begin with the character occurring in the first position of the combination; if it is determined that the character occurring in the second position of the combination is indicated to occur in words that begin with the character occurring in the first position of the combination, determining whether every character of the combination is indicated to occur in words in the position in which it occurs in the combination; and if it is determined that every character of the combination is indicated to occur in words in the position in which it occurs in the combination, determining that the combination of characters may be a word.
- 13. The method of claim 12, further comprising the step of comparing the combination of characters to a list of words to determine whether the combination of characters is a word.
- 14. A computer-readable medium whose contents cause a computer system to select from a sequence of natural language characters combinations of characters that may be words using indications for each of a plurality of characters of the characters that occur in the second position of words that begin with the character and of the positions in which the character occurs in words, by performing the steps of, for each of a plurality of contiguous combinations of characters occurring in the sequence:
determining whether the character occurring in the second position of the combination is indicated to occur in words that begin with the character occurring in the first position of the combination; if it is determined that the character occurring in the second position of the combination is indicated to occur in words that begin with the character occurring in the first position of the combination, determining whether every character of the combination is indicated to occur in words in the position in which it occurs in the combination; and if it determined that every character of the combination is indicated to occur in words in the position in which it occurs in the combination, determining that the combination of characters may be a word.
- 15. The computer-readable medium of claim 14, further comprising the step of comparing the combination of characters to a list of words to determine whether the combination of characters is a word.
- 16. A computer memory containing a word segmentation data structure for use in identifying individual words occurring in natural language text, the data structure comprising:
for each of a plurality of characters:
an identification of characters that occur in the second position of words that begin with the character, and for words containing the character
an identification of the length of the word and the character position within the word occupied by the character; and for each of a plurality of words:
an indication of whether the sequence of characters that comprises the word may also comprise a series of shorter words.
- 17. The computer memory of claim 16 wherein for each of a plurality of words the data structure further comprises an indication of probability of whether the word occurs in natural language text as a function of adjacent characters.
- 18. The computer memory of claim 17 wherein for each of a plurality of words having an indication of probability, the data structure further comprises an associated list of characters.
- 19. A computer memory containing a word segmentation data structure for use in identifying individual words occurring in natural language text, the data structure comprising:
for each of a plurality of characters:
for words containing the character:
an identification of the length of the word and the character position within the word occupied by the character, such that, when a word candidate having a length is encountered while word segmenting natural language text, a determination that the identification for one of the characters comprising the word candidate does not identify the length of the word candidate and the character position of the character within the word candidate may be used to determine that the word candidate does not constitute a word in the natural language text.
- 20. A computer memory containing a word segmentation data structure for use in identifying individual words occurring in natural language text, the data structure comprising:
for each of a plurality of characters:
an identification of characters that occur in the second position of words that begin with the character, such that, when first and second characters are encountered while word segmenting natural language text, a determination of that the identification for the first character does not identify the second character may be used to determine that the first and second characters do not constitute the beginning of a word in the natural language text.
- 21. A computer memory containing a word segmentation data structure for use in identifying individual words occurring in natural language text, the data structure comprising:
for each of a plurality of words:
an indication of whether the sequence of characters that comprises the word may also comprise a series of shorter words, such that, when a word is encountered while word segmenting natural language text, it may be determined with reference to the indication for the word whether to investigate whether the word should be segmented into shorter words.
- 22. A computer memory containing a word segmentation data structure for use in identifying individual words occurring in natural language text, the data structure comprising:
for each of a plurality of words:
an indication of whether an indication of probability of whether the word occurs in natural language text as a function of adjacent characters.
- 23. The computer memory of claim 22 wherein for each of a plurality of words having an indication of probability, the data structure further comprises an associated list of characters.
- 24. For a selected sequence of characters comprising a single selected Chinese word, a method in a computer system for determining whether the selected sequence of characters may also comprise a series of two or more words, the method comprising the steps of:
if the selected sequence of characters is at least four characters long, determining that the selected sequence of characters may not also comprise a series of two or more words; if all of the characters of the selected sequence do not constitute single-character words, determining that the selected sequence of characters may not also comprise a series of two or more words; if the word contains a word commonly used as a derivational affix, determining that the selected sequence of characters may not also comprise a series of two or more words; and if it not determined that the selected sequence of characters may not also comprise a series of two or more words, determining that the selected sequence of characters may also comprise a series of two or more words.
- 25. The method of claim 24, further comprising the step of, if an adjacent pair of characters in of the selected sequence arc often divided into separate words where they appear adjacently, determining that the selected sequence of characters may also comprise a series of two or more words.
- 26. For a selected sequence of characters comprising a single selected natural language word, a computer-readable medium whose contents cause a computer system to determine whether the selected sequence of characters may also comprise a series of two or more words by performing the steps of:
if the selected sequence of characters is at least four characters long, determining that the selected sequence of characters may not also comprise a series of two or more words; if all of the characters of the selected sequence do not constitute single-character words, determining that the selected sequence of characters may not also comprise a series of two or more words; if the word contains a word commonly used as a derivational affix, determining that the selected sequence of characters may not also comprise a series of two or more words; and if it not determined that the selected sequence of characters may not also comprise a series of two or more words, determining that the selected sequence of characters may also comprise a series of two or more words.
- 27. The computer-readable medium of claim 26 wherein the contents of the computer-readable medium further cause the computer system to perform the step of, if an adjacent pair of characters in of the selected sequence are often divided into separate words where they appear adjacently, determining that the selected sequence of characters may also comprise a series of two or more words.
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application is a continuation-in-part application of application Ser. No. 09/023586 filed Feb. 13, 1998, the content of which is incorporated herein in its entirety.
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
09023586 |
Feb 1998 |
US |
Child |
09087468 |
May 1998 |
US |