1. Field of Invention
The present invention relates to the data mining field, and more particularly, to time-series relation mining. According to the present invention, an apparatus and a method for categorizing entities based on time-series relation graphs are provided.
2. Description of Prior Art
With the rapid development of globalization, more complicated business relations are formed among corporations than ever. Further, a developing process of a corporation is much faster than ever, during which other corporations having business relations with it play a critical role in its development.
On the other hand, with developing of informatization, a large amount of business news occurs in mediums such as Internet. These pieces of business news contain a lot of information about business relations among corporations. All the business news accumulated heretofore may cover almost all the information about business relations in all industries. These pieces of information form a time-series business information process. If a business consultation trade may obtain the information therefrom, create a time-series business information process from the information, and derive some relations of the industries and sub-industries as well as some corresponding business events useful for users, which mainly are corporation consulters, then it is a promising technology.
The business relations form a varying network over time. After a time-series model is created for the varying network, there is a problem how to find an industry structure (that is, how many industries are included, how many sub-industries are included in each of the industries, and who is a representative corporation in each of the industries and in each of the sub-industries) therefrom.
Generalizing the business relation to a general relation such as social relation, after a time-series relation graph is given, there is a problem how to determine which nodes belong to a category, how to divide a category into sub-categories and how to find a representative of each category and each sub-category therefrom.
In existing methods, there are technologies for categorizing connection-graph-based relations, such as those described in reference 1, C. H. Ding, X. He, H. Zha, M. Gu, and H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, Proceedings of IEEE ICDM 2001, pp. 107-114, 2001, and in reference 2, J. Shi and J. Malik, Normalized cut and image segmentation, IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8): 888-905, August 2000. However, these technologies only apply to simple graphs, and there is no method for categorizing the graphs created for the time-varying business relations.
Further, in detecting business events, there is a technology for detecting important nodes based on time sequence, such as that disclosed in Japanese Patent No. JP 2005-352817. However, there is no technology for detecting events after categorizing a time-series graph into industries.
The present invention creates time-series relation graphs for time-varying relations, performs graph-partition-based categorizing on the time-series relation graphs, and then carries out post-processing, so as to achieve finally categorized nodes and corresponding relations.
Also, when the present invention is applied to the business field, corporations and relations in the business field are further divided in terms of industries based on the categorized nodes and relations, and finally business events are obtained by detecting business event in the individual industries.
To achieve the above object, the present invention provides an apparatus for categorizing entities based on time-series relation graphs, wherein in each of the time-series relation graphs within a prescribed time period, nodes represent entities, and links between nodes represent entity relations in a corresponding time unit, the apparatus for categorizing entities based on time-series relation graphs comprising: a time-series relation graph categorizing means for categorizing the nodes in each of the time-series relation graphs to generate a node category result for the corresponding time unit in time sequence; and a category result post-processing means for post-processing all the node category results for the corresponding time units in time sequence generated by the time-series relation graph categorizing means to generate finally categorized nodes.
Preferably, the apparatus for categorizing entities based on time-series relation graphs further comprises: a time-series relation graph generating means for processing inputted relation instances to generate corresponding time-series relation graphs.
Preferably, the time-series relation graph generating means comprises: a time-series relation generating unit for calculating scores for the relation instances, resolving internal conflicts, performing interpolation on absent time points, to obtain time-series relations; a relation synthesizing unit for synthesizing various types of the time-series relations among entities generated by the time-series relation generating unit to obtain respective time-series comprehensive relations between respective two entities; and a time-series relation graph creating unit for creating one graph for the relations for each time unit within the prescribed time period so as to form the time-series relation graphs.
Preferably, the time-series relation graph categorizing means performs categorization on the nodes in the time-series relation graph for each time unit by using a hierarchical categorizing method.
Preferably, the category result post-processing means comprises: a category result mapping unit for mapping each category of all the node category results for the corresponding time units in time sequence generated by the time-series relation graph categorizing means to obtain a merged node category structure; a node occurrence counting unit for counting, for each category of the merged node category structure, the occurring times of each node therein based on the merged node category structure generated by the category result mapping unit and a mapping relation of each node category result therewith; and a node categorizing unit for allocating each node to a corresponding category of the merged node category structure based on the counting result of the node occurrence counting unit.
Preferably, the category result post-processing means further generates a merged node category result, and the apparatus for categorizing entities based on time-series relation graphs further comprises: an event detecting means for performing event detection on the entity relations based on the merged node category result and outputting event results.
Preferably, the entities are corporations, the relations are business relations, and the categories are industries.
To achieve the above object, the present invention provides an method for categorizing entities based on time-series relation graphs, wherein in each of the time-series relation graphs within a prescribed time period, nodes represent entities, and links between nodes represent entity relations in a corresponding time unit, the method for categorizing entities based on time-series relation graphs comprising: a time-series relation graph categorizing step of categorizing the nodes in each of the time-series relation graphs to generate a node category result for the corresponding time unit in time sequence; and a category result post-processing step of post-processing all the node category results for the corresponding time units in time sequence generated in the time-series relation graph categorizing step to generate finally categorized nodes.
Preferably, the method for categorizing entities based on time-series relation graphs further comprises: a time-series relation graph generating step of processing inputted relation instances to generate corresponding time-series relation graphs.
Preferably, the time-series relation graph generating step comprises: a time-series relation generating sub-step of calculating scores for the relation instances, resolving internal conflicts, performing interpolation on absent time points, to obtain time-series relations; a relation synthesizing sub-step of synthesizing various types of the time-series relations among entities generated in the time-series relation generating sub-step to obtain respective time-series comprehensive relations between respective two entities; and a time-series relation graph creating sub-step of creating one graph for the relations for each time unit within the prescribed time period so as to form the time-series relation graphs.
Preferably, in the time-series relation graph categorizing step, categorization on the nodes in the time-series relation graph for each time unit is performed by using a hierarchical categorizing method.
Preferably, the category result post-processing step comprises: a category result mapping sub-step of mapping each category of all the node category results for the corresponding time units in time sequence generated in the time-series relation graph categorizing step to obtain a merged node category structure; a node occurrence counting sub-step of counting, for each category of the merged node category structure, the occurring times of each node therein based on the merged node category structure generated in the category result mapping sub-step and a mapping relation of each node category result therewith; and a node categorizing sub-step of allocating each node to a corresponding category of the merged node category structure based on the counting result of the node occurrence counting sub-step.
Preferably, in the category result post-processing step, a merged node category result is further generated, and the method for categorizing entities based on time-series relation graphs further comprises: an event detecting step of performing event detection on the entity relations based on the merged node category result and outputting event results.
Preferably, the entities are corporations, the relations are business relations, and the categories are industries.
According to the present invention, the following technical problems are efficiently solved:
Creating the time-series relations from the time-varying relation instances, and categorizing the nodes; and
Performing business event detection based on the time-series business relations and the results of categorizing the same.
The above and further objects, features and advantages of the present invention will be more apparent from the following description of the preferred embodiments thereof with reference to the drawings, wherein:
a is an overall block diagram showing a system for categorizing and analyzing time-series relations;
b is an overall block diagram showing a system for categorizing and analyzing time-series business relations;
a is a block diagram and also a data flow chart showing a time-series relation graph generating module 2;
b-2e show illustrations of detailed time-series relations and time-series comprehensive relation graphs (hereinafter, the time-series comprehensive relation graph is referred to as “time-series relation graph”) generated by the time-series relation generating unit 21 during processing, wherein
a shows an example of a category result;
b and 3c show the category result at time point t1 corresponding to
a is a block diagram and also a data flow chart showing a category result post-processing module 4;
b shows a merged category result corresponding to
The preferred embodiments of the present invention are described in detail hereinafter with reference to the drawings. Details and functions which are not necessary for the present invention are omitted so as not to confuse the understanding of the present invention. Further, in the following description, an apparatus and a method for categorizing entities based on time-series relation graphs according to the present invention are described in detail with corporations as an example of the entities and business relations as an example of the relations. It is to be noted, however, that the entities set forth in the present invention are not limited to the corporations, and may represent entities such as natural persons, nations or products. Accordingly, the relations set forth in the present invention are not limited to the business relations, and may be applicable to other social relations such as human relations and relations among nations.
System Overview
a is an overall block diagram showing a system for categorizing and analyzing time-series relations according to the first embodiment of the present invention. The reference symbol 1 denotes inputted relation instances. A time-series relation graph generating module 2 processes the inputted relation instances 1 to generate corresponding time-series relation graphs. A time-series relation graph categorizing module 3 categorizes the time-series relation graphs generated by the time-series relation graph generating module 2 to generate a category result for each time unit in time sequence. A category result post-processing module 4 post-processes the category results generated by the time-series relation graph categorizing module 3 to generate a time-series comprehensive category result and generate finally categorized nodes and relations.
Detailed Description of the Modules
The relation instance 1 means that there is a relation between two entities, and has the following data structure.
For example, in the business field, the entity may represent a corporation, and the type of relation may be competition, cooperation, share holding, supply, incorporation, acquisition and so on. In the following expressions, RI(A,B,X,t′) is used to denote a relation instance, which means that there is a relation instance X between entity A and entity B at time point t′.
A block diagram and a data flow chart of the time-series relation graph generating module 2 are shown in
Specifically, a time-series relation generating unit 21 calculates scores for the relation instances, resolves internal conflicts, and performs interpolation on absent time points so as to obtain time-series relations. These steps may be implemented by existing methods, such as a business relation mining apparatus and method as described in attorney docket No. IA078650. It is to be noted, however, that the business relation is only an example of the relations involved in the present invention, and is not intended to limit the scope of the present invention. Finally, various types of time-series entity relations with scores are obtained. That is, within a period of a prescribed time unit, there is a type of time-series relation as well as a score thereof between two entities, wherein the score refers to a credibility at which there exists this relation during such time unit. An example of the data structure thereof is shown in Table 2.
sA,B,X(t) is used to denote the score for the business relation X between entity A and entity B in the time unit t.
For example,
A relation synthesizing unit 22 synthesizes the various types of time-series entity relations to obtain time-series comprehensive relations between respective two entities. sA,B(t) is used to denote the comprehensive relation between two entities. This comprehensive relation is undirected, that is, sA,B(t)=sB,A(t). For example, the comprehensive relation between the corporations represents how close the corporations associate with each other. The closer two corporations associate with each other, it is more possible for them to belong to one industry or sub-industry. The comprehensive relations may be calculated by accumulating the various types of relations using a number of summing methods or weighted summing methods. The calculating formula is show as follows.
Wherein fx( ) is any monotonously increasing function or monotonously decreasing function corresponding to relation X, and g( ) is any monotonously increasing function for standardizing or normalizing the final score.
An example of the above function is provided as follows.
Wherein w(X) is the weight of the respective relation, which may be an experience value or may be obtained by a statistical method. For example, the statistical method may be that a probability that a relation occurs is counted to be used as the weight.
Another example is provided as follows.
A time-series relation graph creating unit 23 creates one graph for the relations for each time unit within the range of the time sequence. The nodes of the graph are the entities, the links between the nodes represent the time-series comprehensive relations between the respective two entities, and the weights of the respective links are the scores of the time-series comprehensive relations between the respective two entities. Thus, an undirected graph with weights is generated for each time unit.
For example,
The time-series relation graph categorizing module 3 performs categorization on the nodes in the time-series relation graph for each time unit by using a hierarchical categorizing method. For example, a graph-bipartition-based categorization may be performed on the graph for each time unit by using existing graph based categorizing methods. The existing methods comprise, for example, those described in reference 1, C. H. Ding, X He, H. Zha, M. Gu, and H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, Proceedings of IEEE ICDM 2001, pp. 107-114, 2001, and in reference 2, J. Shi and J. Malik, Normalized cut and image segmentation, IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8): 888-905, August 2000. The category result is a bipartite structure of multiple levels.
In the category result as shown in
b and 3c show the category result at time point t1 corresponding to
The category result post-processing module 4 post-processes the time-series category results generated by the time-series relation graph categorizing module 3. It comprehensively processes the category results for all the time units within the prescribed time period to obtain the category result for the prescribed time period.
Specifically,
For each time unit within the prescribed time period, there is one category result such as one shown in
A category result mapping unit 41 maps each category of the n category graphs by using, for example, a Kuhn-Munkres algorithm (L. Lovasz and M. Plummer, Matching Theory), and finally obtains a category structure merged from the n graphs.
A node occurrence counting unit 42 counts the occurring times of each node in the merged category structure based on the category structure generated by the category result mapping unit 41 and a mapping relation of each category graph therewith.
A node categorizing unit 43 allocates each node to a corresponding category of the merged category structure based on the counting result of the node occurrence counting unit 42.
b shows the merged comprehensive category result corresponding to
Example of Categorizing and Analyzing Business Relations
b is an overall block diagram showing a system for categorizing and analyzing time-series business relations. In
The business events 7 refer to high-level events derived from an industry analyzing perspective, which have heuristic meanings for users or other corporations. For example, corporation A was a core corporation in its industry from January 1998 to January 2001; corporation B had developed rapidly from January 1999 to January 2000 and so on.
An industry classifying unit 61 divides all the relations and nodes in terms of industries for each time unit, selects the time-series category results according to an industry subdividing threshold, and for each category (each industry), classifies all the nodes and links in the time-series relation graphs to classify all the corporations and business relations into the respective industries.
A corporation importance calculating unit 62 calculates, for each industry within each time unit, the importances of the respective corporations in the industry. The existing algorithms may be adopted, such as a Page Rank method or an HITS algorithm, or any other feasible methods.
A business event detecting unit 63 selects, for each industry within each time unit, only the corporations and business relations of the industry, and detects the business events in conjunction with the corporation importances.
Specifically,
sA(t) is used to denote the importance of corporation A in a certain industry at time t.
If the business importance of corporation A in a certain industry SA(t)>Th1,t0≦t≦t1, then A is a key corporation in the certain industry from t0 to t1;
For corporation A in a certain industry, if
then A has developed rapidly in the certain industry from t0 to t1;
For corporation A in a certain industry, if
then there is something wrong with A in the certain industry from t0 to t1;
For corporations A and B in a certain industry, if
then the relation between A and B has developed rapidly from t0 to t1;
For corporations A and B in a certain industry, if
then the relation between A and B has deteriorated from t0 to t1.
The present invention is described with reference to the preferred embodiments thereof. It is to be understood that, for those skilled in the art, various changes, replacements and additions may be made thereto without departing from the spirit and scope of the invention. Therefore, the scope of the present invention is not limited to those embodiments described above, and is only defined by the appended claims.
Appendix
* relevant contents of attorney docket No. IA078650 (
Time-series Corporation Relation Extracting Sub-module 22″
A corporation business relation instance strength calculating unit 221″ calculates a strength SI(A,B,X,t) of the corporation business relation of A, B, X within a corresponding time unit of t based on each corporation business relation instance RI(A,B,X,t′).
Within the time unit of t, the corporation business relation instance A, B, X may occur several times. For example, it may be mentioned in different news webs, and may be mentioned several times within t. Ct is used to denote the number of times the corporation business relation instance occurs within the time unit of t. Thus, SI(A,B,X,t) may be calculated by the following equation.
where ni is a corresponding ith instance, ms(n1) is a matching score of the news of this instance. In fact, the strength is a sum of the scores of all the instants within the time unit of t.
A time-series interpolating unit 222″ calculates a score of a corporation relation, for which no corporation business relation instant occurs during a prescribed period, by interpolation, so that finally any one of continuous relations between any corporations within the prescribed period has its score at any time point. The continuous corporation relation means that the relation continues for a period, while is not a one-time event-like relation. For example, the competition, cooperation, share holding and supply are all continuous business relations. For example, there was no competition relation between corporation A and corporation B in June 2000, but this relation had occurred before in January 2000. Then, the score in June 2000 is calculated by interpolation by using the preceding score of this relation. For example, the method for performing interpolation is as follows.
It is assumed that a relation RI between two corporations first occurs at t0, and last occurs at tm.
For calculating the corporation relation strength at tm, it is assumed that an instance occurring just before tn occurs at tk, and an instance occurring just after tn occurs at tl, then
In the above example, the score of the relation exponentially decreases or increases over time. However, as is well-known to those skilled in the art, the variation may be linear decrease or increase over time.
An event-like business relation and conflict processing unit 223″ processes the event-like business relations. The event-like business relations means one-time events rather than continuous business relations. For example, the incorporation and acquisition are both event-like business relations, while the competition, cooperation, share holding and supply are all continuous business relations. The process comprises processing of the scores of such relations per se, processing upon conflict, and processing of other affected relations. For example, the processing method is as follows.
First, the problem of conflict is handled. The solution of conflict is as follows.
Time conflict: Theoretically, the event-like relation should occur only once. However, the information on the Internet is not completely reliable. Therefore, there may be a conflict. If there is a conflict, that is, there are both RI(A,B,X,t1) and RI(A,B,X,t2) (t1<t2), then an adjusted new corporation relation strength is:
s
A,B,X(t1)=siA,B,X(t1)+siA,B,X(t2)
s
A,B,X(t2)=0.
Direction conflict: The direction conflict deals specifically with directional event-like relations such as acquisition. For such relations, there is only one correct direction for two corporations. When there are both RI(A,B,X,t1) and RI(B,A,X,t2) (t1<t2), if
s
A,B,X(t1)≧sB,A,X(t2),
then
s
A,B,X(t1)=sA,B,X(t1)
s
B,A,X(t2)=0;
otherwise
s
A,B,X(t1)=0
s
B,A,X(t2)=sB,A,X(t2).
Next, the influences on other business relations are handled. If X is a relation of incorporation or acquisition and sA,B,X(t1)>TH, where TH is a predetermined threshold, then A and B are incorporated into one corporation after t1, and there is no continuous relation maintained between A and B. After incorporation, the scores of the relations between corporation A (B) and other corporations are adjusted as follows.
s
A′,C,X(t)=sA,C,X(t)+sB,C,X(t)
After completing the above process, the event-like business relation and conflict processing unit 223″ outputs the time-series scored corporation business relation 32″.
A time-series comprehensive corporation business relation score calculating unit 224″ calculates the time-series comprehensive business relation score between two corporations and the average total business relation score (in the invention of the attorney docket No. IA078649, there is no need to calculate the time-series comprehensive business relation score, and the calculation of the time-series comprehensive entity relations is achieved by the relation synthesizing unit 22). Specifically, a weighted average of the scores of the various relations is calculated so as to obtain the time-series comprehensive business relation score, that is
s
A,B(t)=Σw(X)·sA,B,X(t)
where w(X) is the weight of respective relations, which may be an experience value or may be obtained by a statistical method. The statistical method may be that a probability that a relation occurs in each industry is counted to be used as the weight. Thereafter, the total business relation score is obtained by averaging over all the time. After the process described above, the time-series comprehensive corporation business relation score calculating unit 224″ outputs the time-series comprehensive corporation business relation score 33″.
Number | Date | Country | Kind |
---|---|---|---|
2007-10169206.7 | Nov 2007 | CN | national |