GRAPH GENERATING METHOD, GRAPH GENERATING PROGRAM AND DATA MINING SYSTEM

Information

  • Patent Application
  • 20070203870
  • Publication Number
    20070203870
  • Date Filed
    July 21, 2006
    19 years ago
  • Date Published
    August 30, 2007
    18 years ago
Abstract
The invention has the object of obtaining, at a high rate of success, graphs indicating the relationships between variables indicating the states of observed items which are the subjects of data mining, and improving the reliability of the outputted graphs. A method for generating a graph showing the relationships between variables comprises a step S2 of establishing a number of graphs to be generated, a step S5 of randomly establishing an order of variables X forming the set of all variables V, a step S6 of performing a process of reconstructing a graph showing the relationships between variables, and a step S10 of outputting a comprehensive graph including all edges existing in any of the graphs generated with each graph generation. In the graph reconstruction process, an inverse matrix of the correlation coefficient matrix is calculated, and the operation of determining the conditional independence relating to two variables which are the subject of the conditional independence determination is skipped if any of the diagonal elements relating to the two variables is greater than a predetermined threshold value.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a diagram showing an example of an independent directed acyclic graph.



FIG. 2 is a diagram showing an example of an independent directed acyclic graph with partial regression coefficients appended.



FIG. 3 is a diagram showing orientation rules.



FIG. 4 is a diagram showing an example of an undirected graph generated in the process of generating an independent directed acyclic graph.



FIG. 5 is a diagram showing an example of a partially undirected graph generated in the process of generating an independent directed acyclic graph.



FIG. 6 is a flow chart showing the algorithm for a graph generating method according to Embodiment 1.



FIG. 7 is a diagram showing an example of a comprehensive graph with the probability of existence of each edge added.



FIG. 8 is a flow chart showing an algorithm for a relational graph reconstruction process.



FIG. 9 is a flow chart showing an algorithm for an edge elimination process based on conditional independence determination.



FIG. 10 is a flow chart showing an algorithm for an edge elimination process based on conditional independence determination.



FIG. 11 is a diagram showing an example of the structure of a system for performing data mining using the graph generating method of the present invention.


Claims
  • 1. A graph generating method for outputting a relationship between variables, comprising: a step of establishing nodes corresponding to all variables in a given set of all variables and establishing a completely undirected graph formed by connecting all pairs of nodes with an undirected edge;a step of selecting a first variable and a second variable from the set of all variables formed from the variables arranged in a predetermined order, and selecting a partial set given as the null set or a set consisting of at least one variable other than said first variable and said second variable;a step of determining whether said first variable and said second variable are conditionally independent when given said partial set, and if conditionally independent, deleting the undirected edge connecting the node corresponding to said first variable and the node corresponding to said second variable;a step of converting undirected edges to arrows based on a determination relating to V-structures; anda step of converting undirected edges to arrows based on at least one orientation rule;wherein an inverse matrix of a correlation coefficient matrix is calculated for a variable sequence consisting of said first variable and said second variable which are the subject of the conditional independence determination and said partial set used in the conditional independence determination, and the operation of determining the conditional independence of said first variable and said second variable is skipped when the diagonal element relating to said first variable in said inverse matrix is greater than a predetermined threshold value or the diagonal element relating to said second variable in said inverse matrix is greater than the predetermined threshold value.
  • 2. A graph generating method comprising: a step of establishing a number of graphs to be generated;a step of randomly establishing the order of variables forming a given set of all variables each time a graph is generated;a step of establishing nodes corresponding to all variables in the set of all variables and establishing a completely undirected graph formed by connecting all pairs of nodes with an undirected edge;a step of selecting a first variable and a second variable from the set of all variables formed of variables arranged in the established order and selecting a partial set given as the null set or a set consisting of at least one variable other than said first variable and said second variable;a step of determining whether or not said first variable and said second variable are conditionally independent when given said partial set, and if conditionally independent, deleting the undirected edge connecting the node corresponding to said first variable and the node corresponding to said second variable;a step of converting undirected edges to arrows based on a determination relating to V-structures;a step of converting undirected edges to arrows based on at least one orientation rule; anda step of outputting a comprehensive graph including all edges present on any of the graphs generated to express the relationships between variables for each graph generated.
  • 3. A graph generating method in accordance with claim 2, comprising a step of calculating a probability of existence obtained by dividing the cumulative number of times each edge exists in a graph by the predetermined number of times in which the set of graphs are generated; wherein the probability of existence corresponding to each existing edge is shown on the outputted comprehensive graph.
  • 4. A graph generating method in accordance with claim 2, comprising: a step of calculating, for each edge, at least the cumulative number of undirected edges, the cumulative number of arrows pointing in a first direction and the cumulative number of arrows pointing in a second direction opposite to the first direction; anda step of calculating, for each edge, the probability of existence corresponding to each type of edge obtained by dividing the cumulative number of undirected edges, the cumulative number of arrows pointing in the first direction and the cumulative number of arrows pointing in the second direction by the number of graphs generated;wherein the outputted comprehensive graph indicates the type of edge having the highest probability of existence and the probability of existence of that type of edge.
  • 5. A graph generating program for outputting a graph showing the relationships between variables; the program performing: a step of establishing nodes corresponding to all variables in a given set of all variables and establishing a completely undirected graph formed by connecting all pairs of nodes with an undirected edge;a step of selecting a first variable and a second variable from the set of all variables formed from the variables arranged in a predetermined order, and selecting a partial set given as the null set or a set consisting of at least one variable other than said first variable and said second variable;a step of determining whether said first variable and said second variable are conditionally independent when given said partial set, and if conditionally independent, deleting the undirected edge connecting the node corresponding to said first variable and the node corresponding to said second variable;a step of converting undirected edges to arrows based on a determination relating to V-structures; anda step of converting undirected edges to arrows based on at least one orientation rule;wherein an inverse matrix of a correlation coefficient matrix is calculated for a variable sequence consisting of said first variable and said second variable which are the subject of the conditional independence determination and said partial set used in the conditional independence determination, and the operation of determining the conditional independence of said first variable and said second variable is skipped when the diagonal element relating to said first variable in said inverse matrix is greater than a predetermined threshold value or the diagonal element relating to said second variable in said inverse matrix is greater than the predetermined threshold value.
  • 6. A graph generating program performing: a step of establishing a number of graphs to be generated;a step of randomly establishing the order of variables forming a given set of all variables each time a graph is generated;a step of establishing nodes corresponding to all variables in the set of all variables and establishing a completely undirected graph formed by connecting all pairs of nodes with an undirected edge;a step of selecting a first variable and a second variable from the set of all variables formed of variables arranged in the established order and selecting a partial set given as the null set or a set consisting of at least one variable other than said first variable and said second variable;a step of determining whether or not said first variable and said second variable are conditionally independent when given said partial set, and if conditionally independent, deleting the undirected edge connecting the node corresponding to said first variable and the node corresponding to said second variable;a step of converting undirected edges to arrows based on a determination relating to V-structures;a step of converting undirected edges to arrows based on at least one orientation rule; anda step of outputting a comprehensive graph including all edges present on any of the graphs generated to express the relationships between variables for each graph generated.
  • 7. A graph generating program in accordance with claim 6, wherein the program performs a step of calculating a probability of existence obtained by dividing the cumulative number of times each edge exists in a graph by the predetermined number of times in which the set of graphs are generated; wherein the probability of existence corresponding to each existing edge is shown on the outputted comprehensive graph.
  • 8. A graph generating program in accordance with claim 6, wherein the program performs: a step of calculating, for each edge, at least the cumulative number of undirected edges, the cumulative number of arrows pointing in a first direction and the cumulative number of arrows pointing in a second direction opposite to the first direction; anda step of calculating, for each edge, the probability of existence corresponding to each type of edge obtained by dividing the cumulative number of undirected edges, the cumulative number of arrows pointing in the first direction and the cumulative number of arrows pointing in the second direction by the number of graphs generated;wherein the outputted comprehensive graph indicates the type of edge having the highest probability of existence and the probability of existence of that type of edge.
  • 9. A data mining system for generating a graph indicating relationships between variables indicating states of observed items from a group of observed data; comprising: input means for inputting at least observed data and a number of graphs to be generated;operation means for generating a plurality of graphs while randomly establishing the order of variables forming a given set of all variables each time a graph is generated, calculating a probability of existence obtained by dividing the cumulative number of times each edge exists in a graph by the predetermined number of times in which the set of graphs are generated, and outputting data relating to the structure of a graph showing the relationships between variables and probabilities of existence of edges;memory means for storing at least observed data, the number of graphs to be generated, data relating to the structures of the graphs and probabilities of existence of the edges, and offering a workspace for performing numerical operations; anddisplay means for displaying a graph at least based on the outputted data;wherein the edges whose probability of existence is greater than 0 are all displayed on said display means in a comprehensive graph showing the relationships between variables.
  • 10. A data mining system in accordance with claim 9, wherein the probabilities of existence are appended to the edges on said display means.
  • 11. A data mining system in accordance with claim 9, wherein the thicknesses of the edges or the colors of the edges are changed depending on the probabilities of existence on said display means.
Priority Claims (1)
Number Date Country Kind
2006-027247 Feb 2006 JP national