The present invention relates to a text mining apparatus, a text mining method, and a text mining program that structure and analyze an electronic text stored on a computer with syntax analysis, etc. In particular, the present invention relates to a text mining apparatus, a text mining method, and a text mining program that are capable of determining and analyzing sentence structures having a similar meaning as an identical structure.
In general, as an example of a text mining apparatus, a structure shown in
The conventional text mining apparatus shown in
In general, the following structures generated by the language analysis device are frequently used.
Herein, the information about the attached word indicates an attached concept including tense such as present or perfect, modality such as easy or difficult, and negation. The information about the attached word is added to a clause by the attached word.
Further, all the information in the structure can be expressed by a structure comprising the nodes having labels without the attribute values and only the directional branch without the attribute value.
Clauses kare (He)”, shashu A (type A of vehicle)”, kakaku (price)”, sagenu (has been down)”, and shiru (know)” in the sentence are represented by nopes having labels without the attribute value (e.g., a label “surface case ha” is added to the node kare (He)”, labels “information about the attached word perfect” and “surface case wo” are added), and the directional branch from the node on the modifier to the modifee does not have the attribute value.
The above-mentioned conventional system has the following problems. The following problems and the analysis for them are based on the research and examination result of the present inventors. Contents shown in
As a first problem, it is exemplified that, upon detecting a frequent pattern, patterns with structures having a similar meaning and different connecting configurations are determined as entirely different patterns.
The connecting configuration indicates a configuration obtained by taking notice only on the node of the structure, a character string of words, a connecting relationship of the directional branch, and the direction and by omitting attached attribute information.
The reason why the first problem is caused is that the conventional text mining apparatus does not comprise means that determines the structures having different connecting configurations and a similar means, as the identical structure.
Examples of the difference between the structures having the different connecting configurations and the similar meaning are as follows upon using a sentence structure with the attribute value.
(B1) Difference between directions of the dependency,
(B2) Difference between dependency orders,
(B3 Difference due to replacement with synonyms, and
(B4) Difference between parallel syntax structures and meaning structures.
In the example shown in
In the example shown in
In the example shown in
In the example shown in
As a second problem, it is exemplified that structures having different attribute values and a similar meaning upon detecting a frequent pattern are determined as completely different patterns.
Because it is not considered in the conventional text mining apparatus that the structures having different attribute values are determined as an identical one.
Examples of the difference between the structures having different attribute values and the similar meaning upon using the sentence structure with the attribute value are the difference between the information about the attached word, the difference between the surface cases etc.
In the example shown in
In the example shown in
As a third problem, it is exemplified that it cannot be adjusted how similar structures are determined as an identical one by a user of the text mining apparatus upon detecting the frequent pattern.
Because it is not considered in the conventional text mining apparatus to adjust how similar structures are determined as an identical one by a user upon detecting the frequent pattern.
Accordingly, it is one object of the present invention to provide a text mining apparatus, method, and program in which structures having a similar meaning and different connecting configurations are determined as an identical pattern and a frequent pattern is detected.
It is another object of the present invention to provide a text mining apparatus, method, and program capable of determining whether or not structures having a similar meaning and different attribute values are as an identical one and of adjusting the detection of a frequent pattern.
It is further another object of the present invention to provide a text mining apparatus, method, and program capable of adjusting the determination as how similar structures are an identical one by a text mining user and the detection of a frequent pattern.
The present invention disclosed in this application has the following schematic structure so as to accomplish the objects.
According to a first aspect of the present invention, a text mining apparatus comprises means that generates a sentence structure from an input document, means that generates a similar structure of patterns having a similar meaning of a partial structure of the sentence structure by performing predetermined conversion operation of the partial structure, and means that determines the patterns having the similar meaning as the identical pattern and detects the pattern.
According to the present invention, the means for generating the similar structure comprises means that performs parallel modification of the sentence structure, means that generates a partial structure of the sentence structure, means that performs non-directional branching of a directional branch of the sentence structure and/or partial structure, means that replaces a synonym in the sentence structure and/or partial structure by referring to a synonym dictionary, and means that performs non-ordering of ordering trees of the sentence structure and/or partial structure, and uses the similar structures as an equivalent class of the partial structure of the sentence structure. The equivalent class means that elements in a set of structures are used with an identical structure. When two equivalent classes include at least one identical element, the two equivalent classes are determined as the identical equivalent class. According to the present invention, the generated similar structure is used as the equivalent class of the sentence structure on the generation side, and the frequent pattern is detected.
According to a second aspect of the present invention, a text mining apparatus comprises frequent-similar-pattern detection means that ignores the difference between the attribute values in the structure and detects the frequent pattern, in place of the frequent-pattern detection means included in the text mining apparatus according to the first aspect. The frequent-similar-pattern detection means determines similar structures having different attribute values as an identical one, and detects the frequent pattern. According to the present invention, the similar structures having different attribute values therein are determined as an identical one, and the frequent pattern is detected.
According to a third aspect of the present invention, a text mining apparatus comprises a storage unit that stores a set of documents as a text mining object, an analyzing unit that reads and analyzes the document from the storage unit and obtains a sentence structure, a similar-structure generation adjustment unit that generates a first determination item for determining, from a user input, whether or not the structures are identical one every type of differences between the sentence structures, a similar-structure determination adjustment unit that generates a second determination item for determining, from a user input, whether or not the structures are identical ones every type of differences between attribute values, a similar-structure generating unit that performs predetermined conversion operation of a partial structure of the sentence structure obtained by the analyzing unit in accordance with the first determination item generated by the similar-structure generation adjustment unit and generates similar structures having a similar meaning of the partial structure, and a similar-pattern detecting unit that uses the similar structure generated by the similar-structure generating unit as an equivalent class of the partial structure on the generation source and detects the frequent pattern by ignoring the difference between the attribute values in accordance with the second determination item of the similar-structure determination adjustment unit. According to the present invention, a determination input for adjusting whether or not the structures are identical is received.
Further, according to a fourth aspect of the present invention, a method comprises
a step of generating a sentence structure from an input document,
a step of generating a similar structure of patterns having a similar meaning of a partial structure of the sentence structure by performing predetermined conversion operation of the partial structure, and
a step of determining the patterns having the similar meaning as the identical pattern and detecting the pattern.
Furthermore, according to a fifth aspect of the present invention, a method comprises
a step of analyzing a document in a storage unit that stores a set of documents as a text mining object and obtaining a sentence structure,
a step of generating a similar structure of patterns having a similar meaning of a partial structure of the sentence structure, and
a step of using the generated similar structure as an equivalent class of the partial structure on the generation source and detecting a pattern by ignoring the difference between attribute values.
In addition, according to a sixth aspect of the present invention, a method comprises
a step of analyzing a document from a storage unit that stores a set of documents as a text mining object and obtaining the sentence structure,
a step of generating, from input information of a user input from an input device, a first determination item for determining whether or not the structures are identical ones every type of differences between sentence structures (connecting configurations) and a second determination item for determining whether or not the structures are identical ones every type of differences between attribute values,
a step of generating a similar structure having a similar meaning of the partial structure of the sentence structure in accordance with the first determination item for determining whether or not the structures are identical ones every type of differences between sentence structures (connecting configurations), and
a step of using the generated similar structure as an equivalent class of the partial structure on the generation source and detecting the frequent pattern by ignoring the difference between the attribute values in accordance with the second determination item for determining whether or not the structures are identical ones every type of differences between attribute values.
In addition, according to a seventh aspect of the present invention, a program enables a computer forming a text mining apparatus to execute
processing for analyzing a document in a storage unit that stores a set of documents as a text mining object and obtaining a sentence structure,
processing for performing predetermined conversion operation of a partial structure of the sentence structure and generating a similar structure having a similar meaning of the partial structure, and
processing for using the generated similar structure as an equivalent class of the partial structure on the generation source and detecting a predetermined pattern.
Hereinbelow, a specific description is given of embodiments of the present invention with reference to drawings.
Referring to
The data processing device 2 comprises language analysis means 21, similar-structure generating means 22, and frequent-pattern detection means 23. These means are schematically operated as follows.
The language analysis means 21 reads a set of texts from the text DB 11, consequently analyzes the texts in the set, and obtains a sentence structure.
The similar-structure generating means 22 extracts all partial structures forming each sentence structure in the set of sentence structures sent from the language analysis means 21, generates all similar structures in each partial structure, and thus sets the similar structure and the partial structure on the generation source as an equivalent class.
The frequent-pattern detection means 23 detects the frequent pattern from the set of equivalent classes of the partial structure sent from the similar-structure generating means 22, and sends the detected frequent pattern to the output device 3.
First, the language analysis means 21 reads the set of texts from the text DB 11. The language analysis means 21 analyzes the texts in the set of texts, generates the sentence structure as the analysis result, and sends the generated sentence structure to the similar-structure generating means 22 (step Al in
Subsequently, the similar-structure generating means 22 generates all similar structures of the partial structure in the set of given sentence structures and thus sets the similar structure as the equivalent class of the partial structure on the generation source. Thereafter, the similar-structure generating means 22 sends a set of the equivalent classes to the frequent-pattern detection means 23 (step A2 in
Further, the frequent-pattern detection means 23 detects the frequent pattern from the equivalent class of the given partial structure (step A3 in
The frequent-pattern detection means 23 outputs the detected frequent pattern to the output device 3 (step A4 in
Referring to
Subsequently, “Generate the partial structure” is performed so as to detect the pattern from the partial structure as well as from all the sentence structures (step A2-2 in
Subsequently, “Non-directional branching of a directional branch” corresponding to the difference between dependency directions is performed (step A2-3 in
Subsequently, “Replace synonym” corresponding to the difference between the synonyms is performed (step A2-4 in
“Non-ordering of ordering tress” corresponding to the difference between the dependency orders is performed (step A2-5 in
Finally, the similar structure is set as an element of the equivalent class in the partial structure on the generation source, thereby performing “Generate the equivalent class” (step A2-6 in
Hereinbelow, a description is given of the operation and the advantage of the apparatus according to the first embodiment of the present invention.
The apparatus according to the first embodiment uses the similar structure generated by the similar-structure generating means 22 as the equivalent class in the original structure and detects the frequent pattern. Thus, it can be determined that the structures having different connecting configurations and the similar meaning are determined as the identical one and the frequent pattern can be detected.
Next, a specific description is given of the second embodiment of the present invention with reference to the drawings.
Referring to
According to the second embodiment, the frequent-similar-pattern detection means 24 ignores the difference between the attribute values and detects the frequent pattern from the set of the equivalent classes in the partial structure sent from the similar-structure generating means 22, and sends the detected frequent pattern to the output device 3.
According to the first embodiment, the frequent-pattern detection means 23 does not determine the structures having the identical connecting configuration and different attribute values as the identical one and detects the frequent pattern.
However, according to the second embodiment, the frequent-similar-pattern detection means 24 determines that, for the set of the equivalent classes given from the similar-structure generating means 22, the structures having the identical connecting configuration and different attribute values as the identical structure, detects the frequent pattern, and sends the detected frequent pattern to the output device 3 (step B3 in
Next, a description is given of the operation and the advantage of the apparatus according to the second embodiment of the present invention.
According to the second embodiment of the present invention, the frequent-similar-pattern detection means 24 determines even the structures having the identical connecting configuration and different attribute values as the identical structure and detects the frequent pattern. Therefore, the structures having different attribute values and the similar meaning can be determined as the identical structure and the frequent pattern can be detected.
Next, a specific description is given of the third embodiment of the present invention with reference to the drawings.
Referring to
The input device 6 receives, from a user,
The determination inputs received by the input device 6 are as follows.
The similar-structure generation adjustment means 25 determines, in accordance with the determination given from the input device 6, whether or not the structures are identical every type of differences between the connecting configurations, and sends the determination item to the similar-structure generating means 22.
Further, the similar-structure determination adjustment means 26 determines, in accordance with the determination given from the input device 6, whether or not the difference between the attribute values is ignored every type of attribute values, and sends the determination item to the frequent-similar-pattern detection means 24.
The similar-structure generating means 22 generates the similar structure of the partial structures in the individual structures of the set given from the language analysis means 21 in accordance with the similar-structure generation adjustment means 25, and thus sets the generated similar structure as the equivalent class of the partial structure on the generation source.
The frequent-similar-pattern detection means 24 detects the frequent pattern from the set of equivalent classes given from the similar-structure generating means 22 in accordance with the determination from the similar-structure determination adjustment means 26 by ignoring the difference between the attribute values.
First, the language analysis means 21 reads the set of texts from the text DB 11.
The language analysis means 21 analyzes each text in the set of ones, generates the sentence structure as the analysis result, and sends the generated sentence structure to the similar-structure generating means 22 (step A1 in
Subsequently, the input device 6 receives, from a user, an input for determining whether or not the structures are identical every type of differences between the sentence structures and an input for determining whether or not the difference between the attribute values is ignored every type of attribute values, and sends the received inputs to the similar-structure generation adjustment means 25 and the similar-structure determination adjustment means 26, respectively (step C1 in
The similar-structure generation adjustment means 25 receives the determination from the input device 6, generates a determination item for determining whether or not the structures are identical every type of differences between the sentence structures, and sends the generated determination item to the similar-structure generating means 22. Further, the similar-structure determination adjustment means 26 receives the determination from the input device 6, generates a determination item for determining whether or not the difference between the attribute values is ignored every type of attribute values, and sends the generated determination item to the frequent-similar-pattern detection means 24 (step C2 in
The similar-structure generating means 22 generates the similar structure of the partial structure forming the sentence structure in the set given from the language analysis means 21 in accordance with the determination from the similar-structure generation adjustment means 25, thus sets the generated similar structure as the equivalent class of the partial structure on the generation source, and sends the set of equivalent classes to the frequent-similar-pattern detection means 24 (step C3 in
The frequent-similar-pattern detection means 24 ignores the attribute value in accordance with the determination from the similar-structure determination adjustment means 26, and detects the frequent pattern from the set of equivalent classes given from the similar-structure generating means 22 (step C4 in
Finally, the frequent-similar-pattern detection means 24 outputs the detected frequent pattern to the output device 3 (step A4 in
Referring to
In the determination in step C3-2 whereupon the non-directional branching of the directional branch is determined, the similar-structure generating means 22 performs the non-directional branching of the directional branch (step A2-3 in
In the determination in step C3-3 whereupon the replacement of the synonym is determined, the similar-structure generating means 22 replaces the synonym (step A2-4 in
In the determination in step C3-3 whereupon the non-ordering of ordering trees is determined, the non-ordering of ordering trees is performed (step A2-5 in
In step A2-6, the equivalent class is generated. The non-ordering of ordering trees and the generation of the equivalent class are the same as those in steps A2-5 and A2-6 in
As mentioned above, according to the third embodiment, it is adjusted, in accordance with the determination given from the similar-structure generation adjustment means 25, whether or not the parallel modification (step A2-1 in
A user refers to the output pattern, returns to step C1 whereupon the user inputs the determination as how similar structures are identical, and detects the frequent pattern again according to the present invention.
Next, a description is given of the operation and the advantage of the apparatus according to the third embodiment of the present invention.
According to the third embodiment, the similar-structure generation adjustment means and the similar-structure determination adjustment means adjust, in accordance with the user determination, how similar structures are determined as the identical one. As a consequence, the user can adjust the determination as how similar structures are identical and the detection of the frequent pattern.
Next, the fourth embodiment of the present invention will be described in detail with reference to the drawings.
Referring to
A text mining program 7 is read to a data processing device 8, and adjusts the operation of the data processing device 8. The data processing device 8 adjusts the text mining program 7 so as to execute the following processing, that is, the same processing as those of the data processing devices 2, 4, and 5 according to the first to third embodiments.
Next, a specific description is given of examples according to the present invention.
First, a first example of the present invention will be described with reference to the drawings. The first example of the present invention is an example of the first embodiment.
An apparatus according to the first example comprises a personal computer serving as the data processing device 2 shown in
A personal computer 2 comprises a central processing unit (CPU) functioning as the language analysis means 21, the similar-structure generating means 22, and the frequent-pattern detection means 23. The magnetic disk storage device stores a set of texts serving as the text DB 11.
The language analysis means 21 analyzes the language of each text in the set of texts shown in
Subsequently, the similar-structure generating means 22 generates all similar structures in the partial structure forming the sentence structures shown in
In the first example, a description is given of an example of a state for generating the equivalent class of the partial structure from the sentence structure of the sentence 2 Hayaku yasui shashu A (a fast and cheap type A of vehicle)”) shown in
Referring to
Referring to
Further, the similar-structure generating means 22 generates a partial structure 2b-0 indicating a relationship between two words that are not included in the partial structure 2a-0 from the similar structure 2a-1.
Incidentally, the structures generated from both the partial structure 2a-0 and the similar structure 2a-1 are used as one.
Further, the partial structure 2a-0 and the similar structure 2a-1 used for generating the partial structure herein are used as the partial structure and the similar structure in the future generation of the similar structure.
Subsequently, the similar-structure generating means 22 performs the non-directional branching of the directional branch (step A2-3 in
Subsequently, the synonym is replaced (step A2-4 in
Referring to
The partial structure and the similar structure generated at this time do not include the replaced word kousoku (high velocity)”. Therefore, in step A24, the modification is not performed. Herein, a diagram for modification in step A2-4 is omitted.
Subsequently, the ordering trees are non-ordered (step A2-5 in
Incidentally, other methods for non-ordering the ordering trees may be used as follows.
Among the generated partial structure and similar structure, the partial structure and the similar structure excluding similar structures 2a-1 and 2a-3 (refer to
Finally, the similar structure is set as the equivalent class of the partial structure on the generation source, thereby generating the equivalent class (step A2-6 in
An equivalent class 2b comprises the partial structure 2b-0, and the similar structure 2b-1 generated by performing the non-directional branching of the directional branch of the partial structure 2b-0. An equivalent class 2c comprises the partial structure 2c-0, and the similar structure 2c-1 generated by performing the non-directional branching of the directional branch of the partial structure 2c-0. An equivalent class 2g comprises the partial structure 2g-0, and the similar structure 2g-1 generated by performing the non-directional branching of the directional branch of the partial structure 2g-0. The partial structure 2d-0, 2e-0, and 2f-0 have the identical structure and the identical partial structure.
Referring to FIGS. 18 to 21, in the examples in which the similar-structure generating means 22 generates the equivalent classes from the sentence structures (refer to
Referring to
First, the partial structure 3a-0 indicating the sentence structure of the sentence 3 is subjected to the parallel modification (step A2-1 in
Subsequently, the partial structure is generated from the partial structure 3a-0 (step A2-2 in
Subsequently, the directional branch is non-directional branched in the partial structure 3a-0 (step A2-3 in
Subsequently, the synonym is replaced in the similar structure 3a-1 (step A2-4 in
Subsequently, the ordering trees are non-ordered in the similar structure 3a-1 (step A2-5 in
For the above-generated similar structure, the equivalent class is generated (step A2-6 in
As mentioned above, the similar-structure generating means 22 generates the partial structure, the similar structure, and the equivalent class, thereby generating an equivalent class shown in
Originally, in the middle steps of the modification in
Subsequently, the frequent-pattern detection means 23 detects the frequent pattern (frequent equivalent class) from the set of equivalent classes shown in FIGS. 23 to 25 (step A3 in
In this case, the frequent-pattern detection means 23 determines the equivalent classes having at least one identical element as the identical equivalent class and detects the frequent pattern.
For example, in the examples, in both a similar structure 1c-1 serving as an element of an equivalent class 1 c shown in
Therefore, the frequent-pattern detection means 23 determines the equivalent class 1c shown in
Referring to FIGS. 23 to 25,
“the similar structure 1c-1, the similar structure 2b-1, and the similar structure 3c-1”,
“a partial structure 1d-0, a partial structure 2d-0, and a similar structure 3e-1”,
“a partial structure 1e-0, a partial structure 2f-0, and a partial structure 3f-0”, and
“a partial structure 1f-0 and a partial structure 2e-0” have the identical structure.
On the basis of the feature of the equivalent class that “The equivalent classes having at least one identical element are determined as the identical equivalent class”, among the equivalent classes shown in FIGS. 23 to 25,
“the equivalent classes 1c, 2b, and, 3c”,
“the equivalent classes 1d, 2d, and, 3e”,
“the equivalent classes 1e, 2f, and, 3f”, and
“the equivalent classes 1f and 2e”
are determined as the identical equivalent classes.
In the examples, the equivalent class that appears at three or more times is determined as the frequent pattern. Before executing the text mining, a user can detect how many appearance times of the equivalent class as the frequent pattern.
In this case,
“the equivalent classes 1c, 2b, and 3c”,
“the equivalent classes 1d, 2d, and 3e”, and
“the equivalent classes 1e, 2f, and 3f”
are detected as the frequent patterns.
Finally, the structure indicating the frequent pattern extracted above is displayed on the output device 3 (step A4 in
The similar structure is generated, the equivalent class is generated, and the frequent pattern is detected. As a consequence, the “partial structure 1c-0 (
Next, the second example of the present invention will be described with reference to the drawings. The second example corresponds to the second embodiment.
An apparatus in the second example comprises a personal computer instead of the data processing device 4, a magnetic disk storage device instead of the memory device 1, and a display instead of the output device 3.
The personal computer 4 comprises a central processing unit (CPU) functioning as the language analysis means 21, the similar-structure generating means 22, and the frequent-similar-pattern detection means 24. The magnetic disk storage device stores a set of texts as the text DB 11. Similarly to the first example, the sentences 1 to 3 shown in
The language analysis means 21 analyzes the language of each text in the set of texts shown in
Subsequently, the similar-structure generating means 22 generates all similar structures of the partial structures forming the sentence structures shown in
Further, the frequent-similar-pattern detection means 24 detects the frequent pattern (frequent equivalent class) by ignoring the difference between the attribute values from the set of equivalent classes shown in FIGS. 23 to 25 (step B3 in
The frequent-similar-pattern detection means 24 determines the equivalent classes having at least one identical element as the identical equivalent class and detects the frequent pattern. However, the frequent-similar-pattern detection means 24 in the second example determines the similar structures as the identical structure by ignoring the difference between the surface cases or the difference between the attribute values of the information about the attached word. In view of this point, the frequent-similar-pattern detection means 24 is different from the frequent-pattern detection means 23 in the first example.
For example, both the similar structure la-1 shown in
In the second example, referring to FIGS. 23 to 25, the frequent-similar-pattern detection means 24 individually determines, as the identical structures,
“the similar structure 1a-1, the similar structure 2a-3, and the similar structure 3a-1”,
“the similar structure 1b-1, the similar structure 2c-1, and the similar structure 3b-1”,
“the similar structure 1c-1, the similar structure 2b-1, and the similar structure 3c-1”,
“the partial structure 1d-0, the partial structure 2d-0, and the similar structure 3e-1”,
“the partial structure 1e-0, the partial structure 2f-0, and the partial structure 3f-0”, and
“the partial structure 1f-0, the partial structure 2e-0, and the partial structure 3d-0”.
The frequent-similar-pattern detection means 24 determines the equivalent classes having at least one identical element as the identical equivalent classes and therefore individually determines, as the identical equivalent classes,
“the equivalent classes 1a, 2a, and 3a”,
“the equivalent classes 1b, 2c, and 3b”,
“the equivalent classes 1c, 2b, and 3c”,
“the equivalent classes 1d, 2d, and 3e”,
“the equivalent classes 1e, 2f, and 3f”, and
“the equivalent classes 1f, 2e, and 3d”.
In the second example, similarly to the first example, the equivalent class that appears three or more times is determined as the frequent pattern. In this case,
“the equivalent classes 1a, 2a, and 3a”,
“the equivalent classes 1b, 2c, and 3b”,
“the equivalent classes 1c, 2b, and 3c”,
“the equivalent classes 1d, 2d, and 3e”,
“the equivalent classes 1e, 2f, and 3f”, and
“the equivalent classes 1f, 2e, and 3d”
are detected as the frequent patterns.
Finally, the structure indicating the above-extracted frequent pattern is displayed on the output device 3 (step A4 in
In the second example, the frequent pattern output by the output device 3 is expressed as shown in
As mentioned above, the difference between the attribute values is ignored and the frequent pattern is detected, the following partial structures having the similar meaning and different attribute values are thus determined as the identical partial structures. That is,
“the partial structure 1b-0 (
“the partial structure 1f-0 (
The partial structures can be detected as the frequent patterns.
Next, a description is given of the third example of the present invention with reference to the drawings. The third example corresponds to the third embodiment of the present invention.
An apparatus in the third example comprises a personal computer instead of the data processing device 5, a magnetic disk storage device instead of the memory device 1, a display instead of the output device 3, and a keyboard instead of the input device 6.
The personal computer comprises a central processing unit (CPU) functioning as the language analysis means 21, the similar-structure generating means 22, the frequent-similar-pattern detection means 24, the similar-structure generation adjustment means 25, and the similar-structure determination adjustment means 26. The magnetic disk storage device stores a set of texts as the text DB 11. The sentences shown in
The language analysis means 21 analyzes the language of each text in the set of texts shown in
Subsequently, a user performs, with the input device, (in step C1 in
In the third example, e.g., it is assumed that
“it is determined that, with respect to the difference between the connecting configurations, if the difference between the dependency directions and the difference between the dependency order exist, the connecting configurations are identical, and it is not determined that if the difference due to the synonym replacement exists, the connecting configurations are not identical. With respect to the difference between the attribute values, if the difference between the information about the attached word and the difference between the surface cases exist, it is determined that the connecting configurations are identical”.
The input device 6 sends the inputs received from the user to the similar-structure generation adjustment means 25 and the similar-structure determination adjustment means 26.
Subsequently, the similar-structure generation adjustment means 25 receives the user determination from the input device 6, and adjusts the operation of the similar-structure generating means 22 (step C2 in
In the third example, the similar-structure generation adjustment means 25 receives from the input device 6,
“with respect to the difference between the connecting configurations, if the difference between the dependency directions and the difference between the dependency orders exist, it is determined the connecting configurations are identical and if the difference due to the synonym replacement exists, it is not determined that the connecting configurations are identical. With respect to the difference between the attribute values, if the difference between the information about the attached word and the difference between the surface cases exist, it is determined that the connecting configurations are identical”.
In this case, the similar-structure generating means 22 executes modification processing upon generating the similar structure from the partial structure of the sentence structure, i.e., the modification of the parallel structure (step A2-1 in
On the other hand, the similar-structure determination adjustment means 26 receives the user inputs from the input device 6, and adjusts the operation of the frequent-similar-pattern detection means 24 (step C2 in
In the third example, the similar-structure determination adjustment means 26 determines on the basis of information from the input device 6 that “with respect to the difference between the connecting configurations, if the difference between the dependency directions and the difference between the dependency orders exist, the connecting configurations are identical, and if the difference due to the synonym replacement exists, it is not determined that the connecting configurations are identical.
With respect to the difference between the attribute values, the similar-structure determination adjustment means 26 receives the determination indicating that if the difference between the information about the attached word and the difference between the surface cases exist, it is determined the attribute values are identical”, and further adjusts the operation so that the frequent-similar-pattern detection means 24 performs the processing for determining whether or not the attribute values are identical by ignoring the difference between the surface cases and the difference between the information about the attached word.
Subsequently, the similar-structure generating means 22 skips the synonym replacement (step A2-4 in
Hereinbelow, a description is given of the modification of one partial structure of the sentence structure of the sentence 3 shown in
First, the partial structure 3a-0 indicating the sentence structure of the sentence 3 is subjected to the parallel modification (step A2-1 in
Subsequently, the partial structure is generated from the partial structure 3a-0 (step A2-2 in
Subsequently, the directional branch of the partial structure 3a-0 is subjected to the non-directional branching (step A2-3 in
The synonym replacement (step A24 in
Subsequently, the ordering trees of the similar structure 3a-2 are non-ordered (step A2-5 in
The equivalent class of the above-generated similar structure is generated (step A2-6 in
In the modification in the third example, since the synonym replacement (step A2-4 in
In the third example, as mentioned above, the similar-structure generating means 22 generates the partial structure, the similar structure, and the equivalent class. Thus, the equivalent class shown in
Subsequently, the frequent-similar-pattern detection means 24 detects the frequent pattern from the set of the equivalent classes shown in
The frequent-similar-pattern detection means 24 determines the equivalent classes having at least one identical element as the identical equivalent class, and detects the frequent pattern.
In the third example, the frequent-similar-pattern detection means 24 determines, on the basis of the determination of the similar-structure determination adjustment means 26, the difference between which attribute values is ignored and whether or not the similar structures are identical.
In the third example, the similar-structure determination adjustment means 26 determines the similar structures as the identical structure so as to adjust the operations for
“ignoring the difference between the surface cases”, and
“ignoring the difference between the information about the attached word”. Therefore, the frequent-similar-pattern detection means 24 determines whether or not the similar structures are identical, similarly to the second example.
In the third example, referring to
“the similar structure 1a-1 and the similar structure 2a-3”,
“the partial structure 2c-0 and the partial structure 3b-0”,
“the similar structure 1b-1, the similar structure 2c-1, and the similar structure 3b-1”,
“the partial structure 1c-0 and the similar structure 2b-0”,
“the similar structure 1c-1 and the similar structure 2b-1”,
“the partial structure 1d-0 and the partial structure 2d-0”,
“the partial structure 1e-0, the partial structure 2f-0, and the partial structure 3f-0”, and
“the partial structure 1f-0, the partial structure 2e-0, and the partial structure 3d-0”.
The frequent-similar-pattern detection means 24 determines the equivalent classes having at least one identical element as the identical equivalent class and thus individually determines, as the identical equivalent classes,
“the equivalent classes 1a, 2a, and 3a”,
“the equivalent classes 1b, 2c, and 3b”,
“the equivalent classes 1c, 2b, and 3c”,
“the equivalent classes 1d, 2d, and 3e”,
“the equivalent classes 1e, 2f, and 3f”, and
“the equivalent classes 1f, 2e, and 3d”.
In the third example, similarly to the first and second examples, the equivalent class that appears three or more times is determined as the frequent pattern.
In this case,
“the equivalent classes 1b, 2c, and 3b”,
“the equivalent classes 1e, 2f, and 3f”, and
“the equivalent classes 1f, 2e, and 3d”
are detected as the frequent patterns.
Finally, the structures indicating the frequent pattern as extracted above are displayed on the output device 3 (step A4 in
In the third example, the frequent pattern output by the output device 3 is expressed as shown in
When a user has a complaint about the detection of the frequent pattern, the processing returns to step C1 in
As mentioned above, on the basis of the user determination,
“if the difference due to the synonym replacement exists, it is not determined that the structures are identical”, referring to
“the partial structure 1a-0, the partial structure 2a-0, and the partial structure 3a-0”,
“the partial structure 1c-0, the partial structure 2b-0, and the partial structure 3c-0”, and
“the partial structure 1d-0, the partial structure 2d-0, and the partial structure 3e-0”
that have the similar manning and are different from the user inputs are not determined as the identical structures and the frequent pattern is detected. Thus, the user can adjust the determination as how similar structures are identical.
According to the present invention, it is possible to determine the structures having different connecting configurations and the similar meanings as the identical structure and to detect the frequent pattern. Further, according to the present invention, it is possible to determine, as the identical structure, the similar structures of the set of structures without the attribute value and to detect the frequent pattern.
Because the generated similar structure is used as the equivalent class of the original structure and the frequent pattern is detected according to the present invention. According to the present invention, it is possible to determine the similar structures of the set of the structures having the attribute value as the identical structures and to detect the frequent pattern.
Further, according to the present invention, it is possible to determine the structures having the similar meaning and different attribute values as the identical structure and to detect the frequent pattern.
Because, according to the present invention, the frequent-similar-pattern detection means ignores the difference between the attribute values and detects the frequent pattern.
Furthermore, according to the present invention, it is possible to adjust the operation so that the user of the text mining apparatus determines how similar structures are identical and to detect the frequent pattern.
Because, according to the present invention, the similar-structure generation adjustment means and the similar-structure determination adjustment means adjust, on the basis of the inputs from the user, the operation for determining how similar structures are identical.
The present invention can be applied to a text mining apparatus that is frequently used to analyze features of a complaint email or survey result from a client, stored on a computer, and a program for enabling the computer to form the text mining apparatus.
Number | Date | Country | Kind |
---|---|---|---|
2004-079077 | Mar 2004 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP05/05440 | 3/17/2005 | WO | 2/26/2007 |