1. Field of the Invention
Methods for Markov boundary discovery are important recent developments in pattern recognition and applied statistics, primarily because they offer a principled solution to the variable/feature selection problem and give insight about local causal structure. The present invention is a novel method to discover Markov boundaries from datasets that may contain hidden (i.e., unmeasured or unobserved) variables. In general, the inventive method transforms a dataset with many variables into a minimal reduced dataset where all variables are needed for optimal prediction of some response variable. For example, medical researchers have been trying to identify the genes responsible for human diseases by analyzing samples from patients and controls by gene expression microarrays. However, they have been frustrated in their attempt to identify the critical elements by the highly complex pattern of expression results obtained, often with thousands of genes that are associated with the phenotype. A method has been discovered to transform the gene expression microarray dataset for thousands of genes into a much smaller dataset containing only genes that are necessary for optimal prediction of the phenotypic response variable. Likewise, the invention described in this patent document can transform a dataset containing frequencies of thousands of words and terms used in the articles into a much smaller dataset with only words/terms that are necessary for optimal prediction of the subject category of the article.
The power of the invention is first demonstrated in data simulated from Bayesian networks from several problem domains, where the invention can identify Markov boundaries more accurately than the baseline comparison methods. The broad applicability of the invention is subsequently demonstrated with 13 real datasets from a diversity of application domains, where the inventive method can identify Markov boundaries of the response variable with larger median classification performance than other baseline comparison methods.
2. Description of Related Art
Markov boundary discovery can be accomplished by learning a Bayesian network or other causal graph and extracting the Markov boundary from the graph. This is called a “global” approach because it learns a model involving all variables. A much more recent and scalable invention is “local” methods that learn directly the Markov boundary without need to learn first a large and complicated model, an operation that is unnecessarily complex in most cases and often may be intractable as well. There exist two major local method families for identification of Markov boundaries from data. The first family contains methods that directly implement the definition of the Markov boundary (Pearl, 1988) by conditioning on the iteratively improved approximation of the Markov boundary and assessing conditional independence of remaining variables. For example, GS and IAMB-style methods belong to this class (Margaritis and Thrun, 1999; Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003a). The second family contains compositional Markov boundary methods that are more sample efficient and thus often more accurate in practical applications. Methods of this class operate by first learning a set of parents and children of the response/target variable using a specially designated sub-method, then using this sub-method to learn a set of parents and children of the parents and children of the response variable, and finally using another sub-method to eliminate all non-Markov boundary members. An example of such compositional Markov boundary method is GLL-MB (Aliferis et al., 2009a; Aliferis et al., 2009b; Aliferis et al., 2003; Tsamardinos et al., 2003b). Methods in both classes identify correctly a Markov boundary of the response/target variable under the assumptions of faithfulness and causal sufficiency (Spirtes et al., 2000). The latter assumption implies that every common cause of any two or more variables is observed in the dataset. However, this assumption is very restrictive and is violated in most real datasets. Closer examination of the assumptions of methods that directly implement the definition of Markov boundary reveals that these methods can identify a Markov boundary even when the causal sufficiency assumption is violated. This is primarily because these methods require only the composition property which does hold when some variables are not observed in the data (Peña et al., 2007; Statnikov, 2008). However, in the datasets with hidden variables compositional Markov boundary methods may miss some Markov boundary members. The present invention circumvents this limitation of compositional Markov boundary methods and describes a new method that can discover Markov boundaries from the datasets with hidden variables and do so in a much more sample efficient manner than methods that directly implement the definition of Markov boundary.
Table 1 shows the Core method.
Table 2 shows the generative method CIMB1.
Table 3 shows the generative method CIMB2.
Table 4 shows the generative method CIMB3.
Table 5 shows the pseudo-code to implement generative method CIMB1 on a digital computer.
Table 6 shows the method CIMB*. Sub-routines Find-Spouses1 and Find-Spouses2 are described in Tables 7 and 8, respectively.
Table 7 shows the sub-routine Find-Spouses1 that is used in the method CIMB*.
Table 8 shows the sub-routine Find-Spouses2 that is used in the method CIMB*.
Table 9 shows the sensitivity of Markov boundary discovery for evaluation of Markov boundary methods using data from Bayesian networks. The larger is this metric, the more accurate is the method.
Table 10 shows the specificity of Markov boundary discovery for evaluation of Markov boundary methods using data from Bayesian networks. The larger is this metric, the more accurate is the method.
Table 11 shows the error of Markov boundary discovery (computed as distance from the optimal point in ROC space with sensitivity=1 and specificity=1) for evaluation of Markov boundary methods using data from Bayesian networks. The error is computed as described in (Frey et al., 2003). The smaller is the error, the more accurate is the method.
Table 12 shows classification performance of the invention and baseline comparison methods in 13 real datasets listed in Table S2. The classification performance is measured by area under ROC (AUC) curve metric.
Table 13 shows the proportion of selected features applying the invention and baseline comparison methods in 13 real datasets listed in Table S2.
Table S1 shows a list of 7 Bayesian networks used in experiments to evaluate CIMB*.
Table S2 shows a list of 13 real datasets used in experiments to evaluate CIMB*.
Table S3 shows a method to process graphs of Bayesian networks without hidden variables to generate experiment tuples for evaluation of Markov boundary methods.
This specification teaches a novel method for discovery of a Markov boundary of the response/target variable from datasets with hidden variables (specifically, the method identifies a Markov boundary of the response/target variable in the distribution over observed variables). The novel method relies on the assumption that the distribution over all variables (observed and unobserved) involved in the underlying causal process is faithful to some DAG (Spirtes et al., 2000) (whereas the distribution over a subset consisting of the observed variables may be unfaithful). In general, the inventive method transforms a dataset with many variables into a minimal reduced dataset where all variables are needed for optimal prediction of some response variable. Notation and key definitions are described in the Appendix.
The Core method for finding a Markov boundary of the response/target variable in the distributions where possibly not all variables have been observed is described in Table 1. Several ways to apply this methodology are described herein. In particular, three generative methods CIMB1, CIMB2, CIMB3 are described in Tables 2, 3, 4, respectively. The term “generative method” refers to a method that can be instantiated (parameterized) in a plurality of ways such that each instantiation provides a specific process to solve the problem of finding a Markov boundary of T in the distributions where possibly not all variables have been observed such that the distribution over all (observed and unobserved) variables involved in the causal process is faithful.
The invention consists of:
Implementations of the method CIMB3 can be obtained by instantiating its steps as follows (refer to Table 4 for steps mentioned below):
The method CIMB* described in Table 6 is an instantiation of the Core method and also can be seen as a variant of CIMB1. First, CIMB* uses an efficient strategy to consider only potential members of the Markov boundary. In other words, it does not iterate over all Z ∈ V\(TMB(T)∪{T}), but it iterates only over a subset of V\(TMB(T)∪{T}). Second, the approach used for identification of a collider path to T (that is used in the sub-routine of CIMB1) is based on recursive application of the GLL-PC method (to build regions of the network) and subsequent application of the collider orientation rules that are described in the sub-routines Find-Spouses1 (Table 7) and Find-Spouses2 (Table 8) and in steps 19-29 of the CIMB* method (Table 6).
The examples provided below motivate the reasoning behind collider orientation rules that are described in steps 19-29 of the CIMB* method (and denoted as Case A and B in the CIMB* pseudo-code):
The following describes several ways to obtain variants of the method CIMB* by modifying pseudo-code of the method:
Illustration of the Limitations of Compositional Markov Boundary Methods
As it was mentioned in this patent document, compositional Markov boundary methods may miss some Markov boundary members if the causal sufficiency assumption is violated (Spirtes et al., 2000). The latter assumption implies that every common cause of any two or more variables is observed in the dataset. Consider a graphical structure shown in
Results of Experiments with Simulated Data from Bayesian Networks
Table S1 shows a list of Bayesian networks used to simulate data. These Bayesian networks were used in prior evaluation of Markov boundary and causal discovery methods (Aliferis et al., 2009a; Aliferis et al., 2009c; Tsamardinos et al., 2006a) and were chosen on the basis of being representative of a wide range of problem domains (emergency medicine, veterinary medicine, weather forecasting, financial modeling, molecular biology, and genomics). For each of these Bayesian networks, data was simulated using a logic sampling method (Russell and Norvig, 2003). Specifically, 5 datasets of 200, 500, 1000, 2000, and 5000 samples were simulated. Notice that all these datasets do not contain hidden variables and thus cannot be used in the original form to demonstrate benefits of the invention. That is why the method stated in Table S3 was applied to generate experiment tuples of the form <T, S, MBS(T)>, where each tuple instructs first to run the invention and baseline comparison method on a target variable T after removing from the dataset variables S and then to compare the output variable set with the correct answer MBS(T).
The following Markov boundary methods were applied to those datasets with G2 test of statistical independence (Agresti, 2002): CIMB*, IAMB (Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003b), BLCD-MB (Mani and Cooper, 2004), FAST-IAMB (Yaramakala and Margaritis, 2005), HITON-PC (Aliferis et al., 2009a; Aliferis et al., 2009b), and HITON-MB (Aliferis et al., 2009a; Aliferis et al., 2009b). In addition, IAMB (Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003b) with mutual information (Cover et al., 1991) (this method is denoted as “IAMB-MI”) was applied. The results for sensitivity, specificity, and error of Markov boundary discovery are shown in Tables 9, 10, 11, respectively. The results for sensitivity and error of Markov boundary discovery are also plotted in
Results of Experiments with Real Data from Different Application Domains
Table S2 shows a list of real datasets used in experiments. The datasets were used in prior evaluation of Markov boundary methods (Aliferis et al., 2009a; Aliferis et al., 2009c) and were chosen on the basis of being representative of a wide range of problem domains (biology, medicine, economics, ecology, digit recognition, text categorization, and computational biology) in which Markov boundary induction and feature selection are essential. These datasets are challenging since they have a large number of features with small-to-large sample sizes. Several datasets used in prior feature selection and classification challenges were included. All datasets have a single binary response variable. It is also likely to assume that these datasets have hidden variables (because these are real-life data from domains where only a subset of variables are observed with respect to all known observables in each domain) and the causal sufficiency assumption is violated with certainty. Thus these datasets can be used to demonstrate the benefits of the inventive method.
The following Markov boundary methods were applied to those datasets with G2 test of statistical independence (Agresti, 2002): CIMB*, IAMB (Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003b), BLCD-MB (Mani and Cooper, 2004), FAST-IAMB (Yaramakala and Margaritis, 2005), HITON-PC (Aliferis et al., 2009a; Aliferis et al., 2009b), and HITON-MB (Aliferis et al., 2009a; Aliferis et al., 2009b). In addition, IAMB (Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003b) with mutual information (Cover et al., 1991) (this method is denoted as “IAMB-MI”) was applied, and likewise the set of all variables in the dataset (denoted as “ALL”) was also included in the comparison. Once features were selected, SVM classifiers were trained and tested on selected features according to the cross-validation protocol stated in Table S2 (Vapnik, 1998). The results are shown in Table 12 (classification performance, measured by area under ROC curve) and Table 13 (proportion of selected features). As can be seen from the row “Median” of Table 12, CIMB* yields larger median classification performance than other methods, including using all variables in the dataset. Specifically, CIMB* achieves the largest classification performance in ACPLEtiology, Gisette, Sylva, and HIVA datasets. In terms of mean classification performance, its results are comparable to the best baseline comparison method (HITON-MB) (Table 12, row “Mean”). At the same time according to Table 13, the proportion of features selected by CIMB* is only a few percent larger than for other Markov boundary methods.
Software and Hardware Implementation:
Due to large numbers of data elements in the datasets, which the present invention is designed to analyze, the invention is best practiced by means of a computational device. For example, a general purpose digital computer with suitable software program (i.e., hardware instruction set) is needed to handle the large datasets and to practice the method in realistic time frames. Based on the complete disclosure of the method in this patent document, software code to implement the invention may be written by those reasonably skilled in the software programming arts in any one of several standard programming languages. The software program may be stored on a computer readable medium and implemented on a single computer system or across a network of parallel or distributed computers linked to work as one. The inventors have used MathWorks Matlab® and a personal computer with an Intel Xeon CPU 2.4 GHz with 4 GB of RAM and 160 GB hard disk. In the most basic form, the invention receives on input a dataset and a response variable index corresponding to this dataset, and outputs a Markov boundary (described by indices of variables in this dataset) which can be either stored in a data file, or stored in computer memory, or displayed on the computer screen. Likewise, the invention can transform an input dataset into a minimal reduced dataset that contains only variables that are needed for optimal prediction of the response variable (i.e., Markov boundary).
In this specification capital letters in italics denote variables (e.g., A, B, C) and bold letters denote variable sets (e.g., X, Y, Z). The following standard notation of statistical independence relations is adopted: I(T, A) means that T is independent of variable set A. Similarly, if T is independent of variable set A given (conditioned on) variable set B, this denoted as I(T, A|B). If I( )” is used instead of “I( ), this means dependence instead of independence.
If a graph contains an edge X→>Y, then X is a parent of Y and Y is a child of X. The edge XY means that X and Y are confounded by hidden variable(s) (i.e., they share at least one unobserved common cause). The edge X o→Y denotes either X→Y or XY. Finally, the edge X o-o Y denotes either X→Y, or XY, or X←Y.
The set of all variables involved in the causal process is denoted by A=V∪H, where V is the set of observed variables (including the response/target variable T) and H is the set of unobserved (hidden) variables.
DEFINITION OF BAYESIAN NETWORK <V, G, J>: Let V be a set of variables and J be a joint probability distribution over all possible instantiations of V. Let G be a directed acyclic graph (DAG) such that all nodes of G correspond one-to-one to members of V. It is required that for every node A ∈ V, A is probabilistically independent of all non-descendants of A, given the parents of A (i.e. Markov Condition holds). Then the triplet <V, G, J> is called a Bayesian network (abbreviated as “BN”), or equivalently a belief network or probabilistic network (Neapolitan, 1990).
DEFINITION OF MARKOV BLANKET: A Markov blanket M of the response/target variable T ∈ V in the joint probability distribution P over variables V is a set of variables conditioned on which all other variables are independent of T, i.e. for every X ∈(V\M\{T}), I(T, X|M).
DEFINITION OF MARKET BOUNDARY: If M is a Markov blanket of T in the joint probability distribution P over variables V and no proper subset of M satisfies the definition of Markov blanket of T, then M is called a Markov boundary of T. The Markov boundary of T is denoted as MB(T).
DEFINITION OF THE SET OF PARENTS AND CHILDREN: X belongs to the set of parents and children of T (denoted as PC(T)) if and only if X is adjacent with T in the underlying causal graph G over variables V.
DEFINITION OF PUTATIVE PARENT: X is a putative parent of Y if X is a parent of Y or X and Y are confounded by hidden variable(s), i.e. X→Y or XY. This can be also denoted as X o→Y.
DEFINITION OF PUTATIVE CHILD: X is a putative child of Y if X is a child of Y or X and Y are confounded by hidden variable(s), i.e. X←Y or XY. This can be also denoted as X←o Y.
DEFINITION OF COLLIDER PATH: X is connected to Y via a collider path p if the length of p is at least two edges and every variable on the path p is a collider. Here are a few examples of collider paths between X and Y:
DEFINITION OF BIDIRECTIONAL PATH: X is connected to Y via a bidirectional path p if every edge on the path is Here are a few examples of bidirectional paths between X and Y:
Benefit of U.S. Provisional Application No. 61/145,652 filed on Jan. 19, 2009 is hereby claimed.
Number | Date | Country | |
---|---|---|---|
61145652 | Jan 2009 | US |