This application is based upon and claims the benefit of priority from prior Japanese Patent Application No. 2005-354939, filed Dec. 8, 2005, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a technique for learning a classification model to evaluate whether or not an event indicating a specific content is written in a text data accumulated in a computer.
2. Description of the Related Art
As a technique to collect and screen training examples, a technique described in “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection”, Proc. of 14th International Conference on Machine Learning, 179-186, 1997, Miroslav Kubat and Stan Matwin is known. The present technique makes use of the training examples including an event as-is. Meanwhile, the present technique performs screening of the training examples by removing similar training examples from a number of training examples not including an event. The present technique selects one of the first training examples randomly from the training examples which do not include an event and makes an evaluation on whether or not it should be left as a training example. For this reason, as a result of depending on the first selected training example, a difference occurs in the training examples to be eventually removed. Accordingly, it is not always possible to leave a training example which does not include a suitable event. In addition, in order to evaluate similarities between the training examples, the distance between each training example needs to be measured. For this reason, when there are a large number of attributes comprising the training example or when there are a large number of training examples, a great deal of time is required to evaluate whether or not the training example which does not include an event should be left.
Alternatively, JP-A 2002-222083 (KOKAI) discloses a technique to deduce a classification class which corresponds to an evaluation example by generating an inference rule from within a group of training examples. At this time, by referring to the user on whether the inference result of the evaluation example is correct or not, the training example is collected. In the present technique, it is likely that a well-balanced training example can be collected for each classification class by providing the inference rule with an evaluation example which is to be the basis for generating the training example. However, as there is no special designation on how to select the evaluation example, it is not always possible to generate a suitable training example. In addition, since the training examples should be generated through interactions with users, the burden on users is extremely high.
Regarding the issue of deducing whether or not a particular event is described by assessing a text, a learning text important for distinguishing an event is screened from learning texts comprised of a collected text and a classification class indicating whether or not an event is written thereto. By making use of this screened learning text, may it be an event which occurs rarely, a classification model for distinction is learned with high accuracy. By using the learned classification model, when a new text is provided, a classification class for the text is deduced.
When the classification model which assesses whether or not a particular event is included in a text is subject to machine learning, it is necessary to compose a training example by collecting texts including an event and texts not including an event in balanced manner. However, when texts are merely collected, the number of texts not including an event tends to outnumber the texts including an event. Thus, an imbalanced training example dominated by texts not including an event is generated. From such imbalanced training example, there is a high possibility of learning a disproportionate classification model which tends to overly distinguish that an event is not included. For this reason, it is required to screen a suitable training example from the generated training examples and learn a classification model which, with high accuracy, distinguishes whether or not an event is included.
The classification model learning apparatus for learning a classification model for extracting a particular event from a text desired to be assessed the existence or nonexistence of the particular event based on a plurality of learning texts each possessing both a text and information on the existence or nonexistence of the particular event, according to an aspect of the present invention is characterized by comprising: an evaluation unit configured to evaluate the existence or nonexistence of the particular event for a plurality of learning texts having both a text and information on the existence or nonexistence of the particular event by applying an event related expression for evaluating the existence or nonexistence of the particular event to each learning text of the plurality of learning texts; an extracting unit configured to extract a learning text in accordance with the existence or nonexistence of the particular event evaluated by the evaluation unit; and a learning unit configured to learn a classification model based on the learning text extracted by the extracting unit. Further, the present invention is not limited to an apparatus and may include the invention of a method and program realized thereby.
Embodiments of the present invention will be explained in reference to the drawings.
Hereinafter, a technique for conveniently performing text analysis, which automatically evaluates whether or not the event is written in a new text, by using an acquired classification model is disclosed. Here, the term “text data” refers to, for example, a posting written on the message board of a web site, a daily report in a retailing sector containing a written business report and e-mails received at customer centers at companies.
The classification model learning apparatus shown in
The learning text storing unit 10 stores a group of learning texts, which is a set of a text and existence or nonexistence of a particular event. The event related expression storing unit 20 stores a group of expressions related to an event. The event related expression evaluation unit 30 evaluates the existence or nonexistence of a particular event in each text by applying a group of expressions stored in the event related expression storing unit 20 to each text included in a group of learning texts. The learning text extracting unit 40 extracts a part of a group of learning texts from a group of learning texts based on the existence or nonexistence of a particular event which is a pair with the evaluation result of a text provided by the event related expression evaluation unit 30. The classification model learning unit 50 learns a classification model based on a subset of the learning texts extracted by the learning text extraction unit. The classification model storing unit 60 stores the classification model learnt by the classification model learning unit 50. The evaluation text storing unit 70 stores a text desired to be evaluated the existence or nonexistence of an event. The model event evaluation unit 80 applies the text stored in the evaluation text storing unit 70 to the classification model stored in the classification model storing unit 60 in order to evaluate the existence or nonexistence of an event.
In the above configuration, the classification model learning apparatus according to the embodiment can be realized by, such as, a general-purpose computer (for instance, a personal computer), and the event related expression evaluation unit 30, the learning text extraction unit 40, the classification model learning unit 50 and the model event evaluation unit 80 can each be configured by a program (such as a program module) which realizes the above functions. Alternatively, the classification model learning apparatus may also be configured by hardware (such as a chip) to realize the above function, or may be realized by connecting each unit by a network. Further, in the case of a general-purpose computer, the learning text storing unit 10, the event related expression storing unit 20, the classification model storing unit 60 and the evaluation text storing unit 70 may, for instance, be an external memory unit such as a magnetic-storage device or an optical-storage device, or may also be a server connected via a communication line.
The operation of the classification model learning apparatus configured as above will be explained in reference to
First, the event related expression evaluation unit 30 reads in an event related expression (word) from the event related expression storing unit 20 (step S1). Here, the “event related expression” denotes a keyword or key phrase which is used when evaluating whether or not a particular event exists in a text. For example, when evaluating whether or not a text includes an event such as “unsatisfied”, a keyword shown in
Next, the event related expression evaluation unit 30 reads in a learning text given description or no description of an event from the learning text storing unit 10 (step S2). Whether or not to describe an event on a learning text is usually evaluated by a user who has read the learning text. A learning text given description or no description of an event is thus generated. At this time, since the number of texts including an event is smaller than the number of texts not including an event, the majority of learning texts are learning texts not including an event. Here, an example of a learning text including an event “unsatisfied” is shown in
Next, the event related expression evaluation unit 30 takes out one of the learning texts not including an event from the read in learning text (step S3). In step S3, when there is a learning text to take out, the event related expression evaluation unit 30 evaluates whether or not the taken out learning text includes an event related expression with reference to the read in event related expressions (step S4). In this case, for instance, in the example shown in
In step S4, when the event related expression evaluation unit 30 evaluates that an event related expression is not included in the learning text, the process goes back to step S3. In step S3, when there is no learning text to take out, the classification model learning unit 50 learns a classification model of a tree structure form from a learning text not including an event and a learning text including an event extracted from the learning text extracting unit 40 by using a text mining method (step S6). Text mining method is, for example, described in “Acquisition of a Knowledge Dictionary Symposium, ISMIS 2002, 103-113, 2002, Shigeaki Sakurai, Yumi Ichimura, and Akihiro Suyama”.
The classification model learning unit 50 learns as follows. The text part of a learning text is decomposed to a group of words by morphological analysis. Evaluation values for keywords and key phrases collected from all learning texts are calculated based on their frequency. A group of keywords and key phrases greater than or equal to the threshold value designated by this evaluation value is regarded as an attribute vector, which characterizes a group of learning texts. By evaluating whether or not a keyword and key phrase corresponding to each attribute of the attribute vector occurs for each learning text, the value of the attribute vector corresponding to the learning text is determined. A training example is generated by pairing up this attribute vector with a classification class which indicates that an event is described or undescribed. The classification model of a tree structure is learnt from a group of this training example.
For example, when considering learning a classification model from the learning texts of
This way, a learning text not including event related expressions is removed from the learning text which does not include an event. Thus, when using all learning texts, a classification model reflecting a training example prone to be regarded as a noise can be learnt.
Learning examples of the classification model are shown in
When considering a part of the classification model shown in
The classification model learning unit 50 stores the classification model acquired as above in the classification model storing unit 60 (step S7).
The classification model learning ends with the above steps. Subsequently, by using the acquired classification model, a text is evaluated in steps S8 to S10.
The model event evaluation unit 80 reads in the evaluation text stored in the evaluation text storing unit 70 (step S8). For example, as an evaluation text, a text shown in
An evaluation text is taken out from the evaluation texts read in by the model event evaluation unit 80 (step S9). At this time, when there is no evaluation text to take out, the process terminates, and when there is an evaluation text to take out, the model event evaluation unit 80 evaluates the model event for the evaluation text (step S10).
More specifically, the model event evaluation unit 80 first performs morphological analysis on the taken out evaluation text and evaluates whether or not it includes the keywords corresponding to each attribute of the attribute vector determined by the classification model learning unit 50. Based on the evaluation result, the model event evaluation unit 80 generates, for instance, an evaluation example as shown in
Thus, by learning the classification model from the selected learning text, the classification class corresponding to the evaluation text can be deduced with high accuracy.
The classification model learning apparatus related to the present embodiment is not restricted to the above embodiments. For instance, the keyword or key phrase stored in the event related expressions storing unit 20 can be given with attaching the category information. At the same time, decomposition of a word attached with category information is performed in a morphological analysis performed on the text.
Alternatively, as a keyword and key phrase comprising the attribute vector selected at the classification model learning unit 50, in addition to the evaluation value calculated based on the frequency, it is also fine to have only the keywords and key phrases with a certain alignment in category selected.
Additionally, a text mining method for learning the classification model in a tree structure has been used as the classification model in the classification model learning unit 50, however, by using a text mining method based on SVM (Shigeaki Sakurai, Chong Goh, Ryohei Orihara: “Analysis of Textual Data with Multiple Classes”, Symposium on Mthodologies for Intelligent Systems (ISMIS2005), 112-120, Saratoga, USA, (2005-05)) for instance, a classification model written in hyperplane can be learnt as well.
As mentioned above, by specifying a group of expressions related to the existence of an event and collecting a learning text resembling the related expressions, disproportion of the learning text can be revised. In addition, it is possible to acquire a classification model evaluating a learning text which resembles the expressions and does not include an event and a learning text which resembles the expressions and includes a rare event. Thus, a text including a rare event can be extracted with high accuracy. Further, the evaluation based on the implication of an expression related to the existence of such event is performed only once for each text, therefore, the screening of the learning text can be carried out at high speed. In addition, since the learning text itself can be reduced in numbers, the classification model can be learnt at high speed.
As mentioned above, a suitable training example can be screened from the generated training examples, and a classification model for accurately distinguishing whether or not the event is included can be learnt.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2005-354939 | Dec 2005 | JP | national |