This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2009-112045, filed on May 1, 2009,the entire contents of which are incorporated herein by reference.
The embodiments described herein relate to an apparatus and a method of creating an information map.
Generally, text mining products, patent analysis systems, etc., have an information-map creating and display function for assisting search and analysis. An information map illustrates a relationship among words or data items included in retrieved information or information to be analyzed (bibliographic information etc. of a patent or a document) as a network chart as shown in
To improve the legibility of drawings in information maps, it is important to simplify the drawings. To simplify the drawings, a technology for thinning out edges has been developed. For thinning out edges, a method for deleting the edges in increasing order of the strength of the association is common. However, such a simple thinning-out method has the possibility of concentrating edges on a specific node. This sometimes results in creating an information map having no sense (no information) as a network chart. For example,
Thus, a technology for avoiding concentration of edges on a specific node by inventing a method for thinning out edges (for example, limiting the maximum number of edges of each node) is proposed (for example, Japanese Patent No. 4167855). The technology described in this document can avoid concentration of edges on a specific node, as shown in
According to an embodiment of the invention, an information-map creating apparatus that creates an information map representing associations among information elements, the apparatus including duplicating unit for summing first strengths of the associations of individual information elements and creating a duplicate of an information element selected on the basis of a sum of the strengths, along with the associations of the selected information element.
Degree-of-association calculating unit for calculating second strengths, including direct paths, of the associations among the information elements in a state in which some association whose strength is relatively low is excluded from the associations of one of the information elements of a duplicate origin and the information element of a duplicate target.
Degree-of-association summing unit for summing the second strengths of the associations of each of the information elements of the duplicate origin and the duplicate target.
Duplication eliminating unit for excluding, from the object to be displayed, an association whose second strength is relatively low among the associations of one information element, whose strength summed by the summing unit is higher than the others, of the information elements of the duplicate origin and the duplicate target.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed. Additional aspects and/or advantages will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the invention.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below to explain the present invention by referring to the figures.
An embodiment will be specifically described with reference to the drawings.
First, an outline of a method for thinning out relation lines or the association lines (hereinafter referred to as “edges”) of an information map, disclosed in an embodiment, will be described. An embodiment is directed to avoiding concentration of edges on a specific node on which edges are concentrated in an information map by duplicating (or dividing) the node.
In
However, when duplicating a node, a problem occurs in determining which of the duplicate origin and the duplicate target should be the distribution destination for the individual edges concentrated on the node. In the case of
This will be described specifically hereinbelow.
A program for achieving the processes of the information-map creating apparatus 10 is provided from a recording medium 101, such as a CD-ROM. When the recording medium 101 on which the program is recorded is set in the drive unit 100, the program is installed from the recording medium 101 through the drive unit 100 into the auxiliary storage unit 102. However, the program may not necessarily be installed using the recording medium 101 and may be downloaded from another computer via a network. The auxiliary storage unit 102 stores the installed program as well as necessary files, data, etc.
When an instruction to start the program, the memory unit 103 reads the program from the auxiliary storage unit 102 and stores it. The CPU 104 achieves the function of the information-map creating apparatus 10 in accordance with the program stored in the memory unit 103. The display unit 105 displays a GUI (graphical user interface) etc. according to the program. The input unit 106 is a keyboard, a mouse, or the like and is used to input various operational instructions.
The document management DB 11 is a database that systematically manages documents (document data) using the auxiliary storage unit 102 (
The searching unit 12 searches the document management DB 11 for document data that meets input search criteria and outputs a set of retrieved documents as an object document set 21. In other words, the object document set 21 contains content(s) of the individual documents (bibliographic information, sentences, etc.). In an embodiment, the object document set 21 is an information set corresponding to the object of the information map. The object document set 21 may be provided from outside the information-map creating apparatus 10. Specifically, the object document set 21 may be input to the information-map creating apparatus 10 via a network, a portable recording medium, or the like. Accordingly, the information-map creating apparatus 10 may not necessarily have the document management DB 11 and the searching unit 12.
The information-extraction summing unit 13 executes an analysis including a common (known or well-known) information analysis process to extract information elements from the object document set 21, output statistical information on the information elements, and analyze the associations among the information elements, etc. The processing result of the information-extraction summing unit 13 is output as an extraction summation result 22. Accordingly, the extraction summation result 22 includes the extracted information elements, the statistical information, and the information of the associations (association information). The common information analysis process includes a morphological analysis process for dividing sentences into words, a modification analysis process for extracting subjects, predicates, objects, modification relations, etc., a statistical process for determining the frequency of appearance and the level of importance of words, etc., and a co-occurrence relation summation process for summing the number of times two words appear at the same time. The information elements are the components of an information set, which is the object of an information map, such as words extracted from the information set or the values of items in bibliographic information in a document, and can be nodes in the information map. In an embodiment, since the information elements are elements extracted from a document (i.e., the components of the document), they can also be referred to as document elements. The bibliographic information in the document includes, using a patent document as an example, items in an application or the title of the invention in a specification.
The output-element selecting unit 14 selects (chooses) information elements, to be displayed as nodes on an information map, from the information elements included in the extraction summation result 22. The selected information elements, statistical information and association information about the information elements, etc. are output as a selection result 23.
The extended thinning-out unit 15 executes a process for thinning out associations among the information elements. The associations are information represented as edges on an information map. Accordingly, the thinning-out of associations is substantially synonymous with thinning-out of edges (excluding from the object to be displayed). The former is an expression based on the viewpoint of computer processing, and the latter is an expression based on a visual viewpoint of an information map. Likewise, the information element and the node are substantially synonymous.
The extended thinning-out unit 15 includes a degree-of-association calculating section (unit) 151, a degree-of-concentration-of-associations calculating section (unit) 152, a duplicating section (unit) 153, a duplication eliminating section (unit)154, and a thinning-out section (unit) 155. The degree-of-association calculating unit 151 calculates the degrees of associations of individual information elements with the other information elements including indirect associations (indirect paths). The degree-of-concentration-of-associations calculating unit 152 calculates the degrees of concentration of associations of the individual information elements by calculating the total sum of the degrees of associations of the individual information elements. The duplicating unit 153 creates duplicates of information elements, including associations of thereof, selected on the basis of the degree of concentration of associations. The duplication eliminating unit 154 eliminates the duplication of the associations duplicated by the association duplicating unit 153. In other words, the duplication eliminating unit 154 thins out one of the associations of the duplicate origin and the associations of the duplicate target. The thinning-out unit 155 executes an association thinning out process including using a known method. Accordingly, the object to be thinned out by the thinning-out unit 155 is not limited to an association duplicated due to duplication.
The visualization processing unit 16 visualizes the information map on the basis of the result of the thinning out process of the extended thinning-out unit 15.
The procedure of the information-map creating apparatus 10 will be described below.
In response to, for example, an input of search criteria by the user, the searching unit 12, for example, searches the document management DB 11 for a set of documents that meets the search criteria and records the retrieved object document set 21 in the memory unit 103 or the auxiliary storage unit 102 (hereinafter referred to as “recording unit” (S101). The object document set 21 sometimes includes only one document depending on the search criteria. Next, the information-extraction summing unit 13, for example, analyzes the object document set 21 and outputs the extraction summation result 22 including the statistical information, the association information, etc. of the information elements to the recording unit (S102).
Subsequently, the output-element selecting unit 14 selects (extracts) information elements to be displayed on an information map on the basis of the statistical information 221 and records the selection result 23 in the recording unit (select object to be displayed at S103 in
Next, the extended thinning-out unit 15 executes an association thinning out process (S104). The details of the association thinning out process is described in detail below. Subsequently, the visualization processing unit 16 visualizes an information map on the basis of the information generated by the extended thinning-out unit 15 (S105). For example, the visualization processing unit 16 displays the information map on the display unit 106. Alternatively, the information map may be printed by a printer (not shown). On the information map, the placement positions of the individual nodes are determined depending on the degrees of associations of relation lines connecting the nodes. In other words, the individual relation lines are regarded as springs, and the lengths and strengths of the springs are determined in accordance with the degrees of associations of the relation lines. By causing repulsive force to be exerted on the individual nodes, the placement positions of the nodes are determined at positions where the relationship between the tensions and the initial lengths of the relation lines that has become springs and the repulsive force between the nodes becomes stable. A method for determining the placement positions of the nodes is described in details in “Visualization of Keyword Association for Text Mining”, article by Watanabe, Isamu, and Kazuo Miki, Information Processing Society of Japan, the 55th Fundamental Informatics, 1999.
Subsequently, the details of operation S104 is described in detail below.
In operation S201, the degree-of-association calculating unit 151, for example, adds the degrees of associations of indirect paths to the degrees of associations (degrees of direct associations) of the individual information elements and records the result of calculation in the recording unit.
To add the degrees of associations of indirect paths to the individual degrees of associations stored in the table A1, a matrix operation of squaring the matrix shown in the table A1 should be executed. The result of the matrix operation is shown in a table B1. Accordingly, in operation S201, the contents of the table B1 are recorded in the recording unit. The contents of the table A1 are also stored in the recording unit.
Meanwhile, the matrix is squared in consideration of an indirect path (i.e., an indirect path having s distance corresponding to two associations). Accordingly, for an indirect path using three or more associations, the matrix may be raised to the third power or more. In an embodiment, digits are rounded to one decimal place for the sake of convenience. The degrees of associations obtained by adding the degrees of associations of indirect paths (the degrees of associations stored in the table B1) are hereinafter referred to as “indirect-path-added degree of association”.
Subsequently, the degree-of-concentration-of-associations calculating unit 152 calculates the total sum of the degrees of associations (degrees of concentration of associations) for the individual information elements (sums up the degrees of associations) (S202).
Subsequently, the duplicating unit 153 determines whether an information element whose degree of concentration of associations exceeds a threshold value (that is, whose associations are concentrated) (S203). The value of threshold should be determined depending on the degree of concentration of associations permitted to a desired information map. For example, assuming that the threshold value is 10, an affirmative determination is made due to the presence of the information element X.
If there is an information element whose degree of concentration of associations exceeds a threshold value (S203: YES), the duplicating unit 153 duplicates an information element whose degree of concentration of associations is the highest among information elements whose degrees of concentration of associations exceed the threshold value and the associations of the information element (S204).
In
Subsequently, the duplication eliminating unit 154 selects one of the information elements of the duplicate origin and the duplicate target at random (randomly) and thins out an association whose degree of association is the lowest of the associations of the selected information element (S205). Thus, the duplication of one of associations duplicated due to the duplication is eliminated. The selection of an information element may not be performed at random but by a predetermined method (“selecting a duplicate origin” or the like).
Subsequently, the degree-of-association calculating unit 151 adds the degrees of associations of indirect paths to the degrees of associations of the individual information elements (the degrees of associations of direct associations) on the basis of the table Al in
Subsequently, the thinning-out unit 155 determines an association to be thinned out on the basis of the indirect-path-added degrees of associations of the table C1 and thins out the association (S207). However, with the thinning out by the thinning-out unit 155, the degree of association of the association to be thinned out is not set at zero at this time (the time in operation S207). This is because, if the degree of association of the association thinned out by the thinning-out unit 155 is brought to zero, information on the indirect path is lost, which makes calculation using information on the indirect path difficult thereafter. Accordingly, the thinning-out unit 155 gives the association to be thinned out information indicating that it is thinned out (for example, flag information).
For example, if a method that the thinning-out unit 155 thins out all the associations of the information elements other than one association whose degree of association is the highest is employed, the result of thinning-out is as shown in
For an information element having a plurality of associations whose degree of association is the highest, the plurality of associations are not to be thinned out at this point of time. However, a thinning-out method by the thinning-out unit 155 is not limited to a specific one. The thinning-out may be executed using another known or well-known method. For example, not associations other than one association whose degree of association is the highest but associations up to the Nth lowest association may be thinned out. Alternatively, associations to the Nth lowest association may be thinned out not for each information element but for a set of all the associations. The thinning out method described in Japanese Patent No. 4167855 may be used. Alternatively, for associations duplicated due to duplication, an association with a lower degree of association on the table C1 may be thinned out. If the method of thinning out all the associations other than one whose degree of association is the highest is adopted, one of associations duplicated due to duplication is thinned out at high possibility. In other word, although elimination of the duplication of associations is responsible for the duplication eliminating unit 154, the duplication of associations may be eliminated by the thinning-out unit 155 intentionally or accidentally. Since elimination of the duplication of associations is one of conditions to terminate the process, as described in detail below, the speeding-up of the process can be expected by eliminating the duplication of associations also by the thinning-out unit 155.
Subsequently, the degree-of-concentration-of-associations calculating unit 152 calculates the total sum of the indirect-path-added degrees of associations (degree of concentration of associations) for each information element (S208).
Subsequently, if, among the duplicated associations, there are associations of both the duplicate origin and the duplicate target not excluded from the object to be displayed, the duplication eliminating unit 154 thins out an association whose degree of association is the lowest of the associations of information elements having high degree of concentration of associations from the information elements of the duplicate origin and the duplicate target (S209).
In
Subsequently, the degree-of-association calculating unit 151 adds the degrees of associations of indirect paths to the degrees of associations of the individual information elements (the degrees of associations of direct associations) on the basis of the table Al in
Subsequently, the thinning-out unit 155 determines an association to be thinned out by the same process as in operation S207 and thins out the association (S211). As a result, the table Al becomes the table shown in
Subsequently, the duplication eliminating unit 154 determines whether there is a duplicated association left on the basis of the table A1 (S212). Specifically, if the degree of one of the associations of in one row of the column of the information element X1 and the column of the information element X2 on the table A1 is not zero, the associations of the row are determined to be duplicated.
If there is duplicated associations (S212: YES), the extended thinning-out unit 15 repeats operations S208 to S211 until the duplicated associations are eliminated (until one of the duplicate origins and the duplicate targets of all duplicated associations is excluded from the object to be displayed)
If the duplication of associations is eliminated (S212: NO), the process returns to operation S202, where the degree of concentration of associations is calculated by the degree-of-concentration-of-associations calculating unit 152 on the basis of the table D1. If the highest value of the calculated degree of concentration of associations is smaller than or equal to the threshold value (S203), the process in
If the highest value of the calculated degree of concentration of associations exceeds the threshold value, the process after operation S204 is repeated. Accordingly, an information element whose degree of concentration of associations is the highest is duplicated, and the associations of the information element are distributed to a duplicate origin and a duplicate target. The duplication is repeatedly executed until an information element whose degree of concentration of associations is larger than the threshold value is eliminated, so that concentration of associations is appropriately eliminated. However, the numbers of rows and columns of the table A1 increase in accordance with the duplication of the information elements. Accordingly, in the case where the degree of concentration of associations is calculated on the basis of the indirect-path-added degree of association, there is a possibility that the degree of concentration of associations becomes higher than that before the duplication. Accordingly, in this case, the threshold value in operation S203 may be changed with the number of times of duplication.
Thereafter, in operation S105 described in
As described above, according to an embodiment, an information element (node) to which associations (edges) concentrate is duplicated (divided), and the associations of the information element are distributed to a duplicate origin and a duplicate target. This allows concentration of edges to a specific node to be avoided while leaving an edge corresponding to a strong association.
Furthermore, when an information element is duplicated, associations connected to the information element are also duplicated. At that time, the associations of the duplicated information element X1 and the associations of the information element X2 have completely the same degrees of associations (strengths) (for example, an association X1-A and an association X2-A have the same degree of association). This makes it impossible to determine which of the associations should be given a higher priority for deletion. Thus, an embodiment introduces a degree of association reflecting an indirect path so that a group of information elements having mutually strong associations gathers around the duplicated information element.
Specifically, the degrees of associations (indirect-path-added degrees of associations) of individual information elements that take indirect paths into account are calculated, and the thinning out process by the thinning-out unit 155 is performed on the basis of the indirect-path-added degrees of associations. This allows information elements having mutually strong associations to be gathered around the duplicate origin or the duplicate target. In other words, a node group connected to one node can be divided to node groups having mutually strong associations. For example, for the information elements A, B, C, D, E, and F in
In the elimination of the duplication of associations by the duplication eliminating unit 154, the associations of information elements whose degrees of concentration of associations are higher are thinned out. This allows, for example, the same associations of the duplicate origin and the duplicate target to be thinned out in balance (allows the associations to be distributed to the information element of the duplicate origin and the information element of the duplicate target in balance).
Here, the number of associations that the duplication eliminating unit 154 thins out may be two ore more; however, eliminating duplication one by one in the loop of operations S208 to S211, as in an embodiment, can improve the balance of distribution of associations.
Thus, according to an embodiment, a highly legible information map can be created without losing important information, which CaO contribute to improving the accuracy of document search and analysis and saving time and labor and overall efficiency.
According to an embodiment a computer readable medium having a program stored therein causes a computer to execute an operation of information mapping including duplicating a node related to elements has a degree of relationship beyond a predetermined threshold, and creating an information mapping by eliminating associations of one or more elements of a duplicate origin and a duplicate target.
In an embodiment, although the process in
While embodiments of the present invention have been described in detail, the invention is not limited to such a specific embodiment, and various modifications and changes can be made without departing from the spirit of the invention described in the claims.
The embodiments can be implemented in computing hardware (computing apparatus) and/or software, such as (in a non-limiting example) any computer that can store, retrieve, process and/or output data and/or communicate with other computers. The results produced can be displayed on a display of the computing hardware. A program/software implementing the embodiments may be recorded on computer-readable media comprising computer-readable recording media. The program/software implementing the embodiments may also be transmitted over transmission communication media. Examples of the computer-readable recording media include a magnetic recording apparatus, an optical disk, a magneto-optical disk, and/or a semiconductor memory (for example, RAM, ROM, etc.). Examples of the magnetic recording apparatus include a hard disk device (HDD), a flexible disk (FD), and a magnetic tape (MT). Examples of the optical disk include a DVD (Digital Versatile Disc), a DVD-RAM, a CD-ROM (Compact Disc—Read Only Memory), and a CD-R (Recordable)/RW.
Further, according to an aspect of the embodiments, any combinations of the described features, functions and/or operations can be provided.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although a few embodiment(s) of the present invention(s) has(have) been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention, the scope of which is defined in the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2009-112045 | May 2009 | JP | national |