The present invention relates to a method, apparatus, and a storage device or storage medium storing a program for causing a computer to classify a document for each document style using frozen patterns included in the document.
A large number of methods have been proposed to extract information from a large quantity of electronic documents. However, there are various document styles, such as a (1) formal written document having grammatically correct sentences, e.g., a newspaper article, (2) a somewhat informal document having sentences or the like that can be understood but are not grammatically correct and often include the spoken language, e.g., a comment on an electronic bulletin board, and (3) a hurriedly written very informal document like a daily report. Because there is, to our knowledge, no document processing technique that can consistently handle those documents of various document styles, it is necessary to select a document processing technique suitable for each document style. Therefore, it is necessary to classify documents for each document style.
A known document classification method classifies documents on the basis of statistical information of words appearing in the documents. For example, JP 6-75995 A and the like disclose a method of using frequencies of appearance or the like of respective keywords in documents belonging to categories as relevance ratios to the categories. The relevance ratios of words appearing in an input document for each category are added or otherwise combined to calculate a relevance ratio to each category. The input document is classified into a category having a largest relevance ratio. In JP 9-16570 A, a decision tree for deciding a classification is formed in advance on the basis of the presence or absence of document information. The decision tree uses keywords decide a classification. In JP 11-45247 A, the similarity between an input document and a typical document in a category is calculated to classify the input document. Other prior art non-patent references of the interest are: JP 6-75995 A; JP 9-16570 A; JP 11-45247 A; “Natural Language Processing” (Edited by Makoto Nagao et al., Iwanami Shoten); J. Ross. Quinlan, “C4.5: Programing for machine learning” Morgan Kaufman Pubiliser (1993)); “A decision-theoretic generalization of on-line learning and an application to boosting.” (Yoav Freund and Robert Schapire, Journal of Computer and System Sciences, 55(1): 119-139, 1997).
In these methods, a document is divided into word units. As a result, in order to acquire a keyword, it is necessary to apply natural language processing, such as morphological analysis, to a document that is not “written word by word” such as a document in Japanese or Chinese.
However, since documents have various document styles such as a newspaper article, a thesis, and an e-mail, it is difficult to accurately resolve into word units in documents of the various document styles, even if the natural language processing is applied to the documents by using a dictionary or the like because of different degrees of new words, abbreviations, errors in writing, grammatical errors, or the like. In addition, since these methods mainly use a word, such as a noun or keyword., to indicate content, the methods are suitable for classifying documents by topic. However, the prior art methods are not suitable for classifying documents by document style, such as classifying input documents into a newspaper article style, a comment style and so on.
It is an object of the present invention to provide a new and improved apparatus for and method of classifying a document by document style, on the basis of document style information, rather than by topic.
It is another object of the invention to realize document classification based on textual analysis without depending upon morphological analysis.
In a set of documents having the same document style, common characteristic patterns are found in expressions, ends of words, and/or the like. In accordance with an aspect of the present invention, frozen patterns that frequently appear in each document style in this way (hereinafter referred to as style-specific frozen patterns) are prepared as a reference dictionary for each document style. A frozen pattern list is extracted for an unclassified input document on the basis of an appearance state of style-specific frozen patterns present in the document. Confidence is calculated for each document style on the basis of the frozen pattern list. A document style to which the input document belongs is determined on the basis of the confidence to classify the document.
As described above, according to one aspect of the present invention, classification according to document style is realized rather than classification according to each document topic. Document processing suitable for a specific document style is selected by classifying documents for each document style. Since a frozen pattern is an expression specific to a document style, there is an advantage that the frozen pattern is less likely to be affected by unknown words, coined words, and the like that generally cause a problem in document classification.
The above and still further objects, features and advantages of the present invention will become apparent upon consideration of the following detailed descriptions of the specific embodiment thereof, especially when taken in conjunction with the accompanying drawings.
Examples of the document style classifications are (1) an introductory article that is a written grammatically correct document, (2) an electronic bulletin board that is a document in a spoken language, (3) a daily report that is a hurriedly written document. In this specification, the document style of an introductory article (document style 1) and the document style of an electronic bulletin board (document style 2) are examples of document styles that are to be classified.
The style-specific frozen patterns are stored for each document style in the style-specific frozen pattern dictionary which is referred to by the textual analyzer 202. An example of style-specific frozen patterns stored in the style-specific frozen pattern dictionary for the document style 1 is shown in Table 1 below.
Next, an example of style-specific frozen patterns stored in the style-specific frozen pattern dictionary 105 for document style 2 is shown in Table 2.
Style-specific frozen patterns to be stored in the style-specific frozen pattern dictionary 105 are automatically extracted from a set of documents. The documents are classified in advance for each document style. The classified documents are stored as the style-specific frozen pattern dictionary 105.
The first step of the extraction method is to extract, from a set of documents, character strings with a high frequency among character strings of an arbitrary length. The extracted strings are considered to be candidate strings. A method of efficiently calculating a frequency statistic of character strings of an arbitrary length is described in detail in “Natural Language Processing” (edited by Makoto Nagao, et al., Iwanami Shoten). Then, for each candidate string, the front side entropy Ef of the candidate strings is calculated from a character set (Wf={wf1, wf2, . . . , wfn} adjacent to the front of the candidate string, while a rear side entropy Er of the candidate strings is calculated from a character set (Wr={wr1, wr2, . . . , wrm)} adjacent to the rear of the candidate string. The calculations of Wf and Wr are in accordance with Expressions (1)-(4).
In Expressions (1)-(4), S is a candidate string, f(S) is the number of times a candidate string appears, f(wfiS) is the number of times a character string wfiS in which wfi is adjacent to the front of S, and f(Swri) is the number of appearances of a character string Swri in which wri is adjacent to the rear of S. The entropy expression (1) has a large value if the character string S is adjacent to various characters in front of the string and there is an equal occurrence probability; that is, if there is a boundary of expression in the front of the character string. Conversely the character string has a small value if there are fewer kinds of characters to which the character string S is adjacent and an occurrence probability has a bias; that is, if the character string S is a part of a larger expression including an adjacent character. Similarly, the entropy of expression (2) has (1) a large value if there is an expression boundary in the rear of the character string S and (2) a small value if the character string S is a part of a larger expression. Then, only a candidate string having both front and rear entropies larger than an appropriate threshold value is extracted as a style-specific frozen pattern.
Table 3 is an example of candidate strings obtained from a set of documents belonging to the document style 1 and entropies thereof, while Table 4 is an example of candidate strings obtained from a set of documents belonging to the document style 2 and entropies thereof.
The generator 203 of a list of frozen pattern generates a frozen pattern list for each sentence. For example, in the case in which an input document has N sentences and there are M document styles that should be classified, N×M frozen pattern lists are generated from the generator 203 of list of frozen pattern. Each frozen pattern list to be generated is a list in which style-specific patterns appearing in each sentence among style-specific frozen patterns stored in the style-specific frozen pattern dictionary 105 are enumerated for each document style. In this document, Joi'x.” will be considered as inputted example sentence 1. Table 5 is a frozen pattern list for document style 1 and document style 2 at the timer the inputted example sentence 1.
A decision tree for document style is stored for each document style in sets of decision trees for document style that are referred to by the calculator 302 of document style confidence. The document style decision tree has a style-specific frozen pattern, which is extracted for each document style, as a characteristic and finds a classification of the document style and confidence at that point. There are two classes of document styles to be classified by the decision tree for document style. For example, in the case of the decision tree for document style 1, the classes are document style 1 and other document styles. The decision tree for document style is learned from a set of documents classified for each document style.
A decision tree algorithm generates classification rules in a form of a tree on the basis of an information theoretical standard from a data set having characteristic vectors and classes. Structuring of the decision tree is performed by dividing the data set recursively according to a characteristic. Details of the decision tree are described in J. Ross. Quinlan, “C4.5: Programing for Machine Learning” Morgan Kaufman Pubiliser (1993) and the like. Using the same method, for example, a decision tree for document style for the document style 1 is constructed by producing a data se represented by a characteristic vector, which is characterized by the style-specific frozen pattern of the document style 1, and a class to which the document style 1 belongs (document style 1/anoher document style).
A document style to which an inputted sentence belongs, and confidence at that point can be found using the document style decision trees of
Since the inputted exemplary sentence 1 does not include any style-specific frozen pattern for document style 1, document style 1 is obtained as a class to which the inputted example sentence 1 belongs; 0.533 is obtained as the confidence from the decision tree for document style for the document style 1 in
For example, in the case of the decision tree for document style for the document style 1 in
Expression 5
C′=1−C (5)
Table 6 is an example of confidence for the inputted example sentence 1. In Table 6, with respect to the inputted sentence 1, document style 1 confidence is calculated using the decision tree for document style of
Details of a method of combining plural classifiers are described in “A decision-theoretic generalization of on-line learning and an application to boosting.” (Yoav Freund and Robert Schapire, Journal of Computer and System Sciences, 55(1): 119-139, 1997. A similar method is used in the classifier of
The decision tree shown in
Operation of the document classifier is described herein by using the flowchart of
The document classifier initially receives (during step 401), a frozen pattern list V of M×N, which is found in the information extractor of frozen pattern from the input document D. Then, in step 405, a confidence vector Cij=(Cij1, Cij2, . . . , Cijk, . . . , Cij1) is calculated using a document style decision tree for document style i stored in the sets of document style decision trees. Vector Cij is calculated from a list of frozen patterns Vij for the document style i. Here, Cijk is the confidence of the style i that is calculated using a k-th document style decision tree from the frozen pattern list of the document style i for the j-th sentence, and 1 is the number of document style decision trees for the document style i stored in the sets of document style decision trees. In the embodiment, since the document style 2 is divided into cluster 1 and cluster 2, decision trees are found for the respective clusters, 1=2. Subsequently, in step 406, style likelihood Lij of document style i of the j-th sentence is calculated from the confidence vector Cij in accordance with:
In Expression (6) αik is a weighting factor representing confidence of the k-th document style decision tree for the document style i, and a value satisfying 0≦αik≦1 and Σαik=1 is given. The value of αik is preferably selected to maximize the rate of correct answer for a training document with a calculated style likelihood Lij. The processing of steps 405 and 406 is repeated with respect to a list of frozen patterns Vij (1≦j≦N) for the document style i of each sentence of the input document D. A document style likelihood SLi of the document style i for the inputted document is found in step 408 from N style likelihoods calculated in accordance with Expression 7.
In Expression (7) Lij is a style likelihood of a j-th sentence for the document style i. βj is a weighting factor for each sentence, and a value satisfying 0≦βj≦1 and Σβj=1 is given. The value of βj is preferably the value that maximizes the rate of a correct answer for a training document with a calculated document style likelihood SLi. This processing of steps 405 to 408 is repeated with respect to each document style i (1≦i≦M). Then, during step 410, the document style having the maximum likelihood of being the correct document style is determined to be the document style of the inputted document from M calculated document style likelihoods SL.
While there has been described and illustrated a specific embodiment of the invention, it will be clear that variations in the details of the embodiment specifically illustrated and described may be made without departing from the true spirit and scope of the invention as defined in the appended claims. For example, the invention is applicable to alphabet based languages and is not limited to character based languages, such as the given Japanese example.
Number | Date | Country | Kind |
---|---|---|---|
2003-348600 | Oct 2003 | JP | national |