The present invention relates to techniques for discovering conditional functional dependencies (CFDs) and, more particularly, to CFD discovery techniques that reduce the number of discovered redundant CFDs.
Conditional functional dependencies were introduced for data cleaning. See, e.g., W. Fan et al., “Conditional Functional Dependencies for Capturing Data Inconsistencies,” TODS, Vol. 33, No. 2 (June, 2008), incorporated by reference herein. Generally, conditional functional dependencies extend standard functional dependencies (FDs) by enforcing patterns of semantically related constants. CFDs are generally considered more effective than FDs in detecting and repairing inconsistencies of data (often referred to as dirtiness of data). It is expected that conditional functional dependencies will be adopted by data-cleaning tools that currently employ standard FDs (e.g., M. Arenas et al., “Consistent Query Answers in Inconsistent Databases,” TPLP, Vol. 3, No. 4-5, 393-424 (2003) and J. Chomicki and J. Marcinkowski, “Minimal-Change Integrity Maintenance Using Tuple Deletions,” Information and Computation, Vol. 197, Nos. 1-2, 90-121 (2005).
For CFD-based cleaning methods to be effective in practice, however, it is necessary to have techniques to automatically discover or learn CFDs from sample data, to be used as data cleaning rules. Indeed, it is often unrealistic to rely solely on human experts to design CFDs via an expensive and long manual process. It has been suggested that cleaning-rule profiling is critical to commercial data quality tools.
This practical concern highlights the need for studying the discovery problem for CFDs: given a sample instance r of a relation schema R, the discovery problem finds a canonical cover of all CFDs that hold on r (i.e., a set of CFDs that is logically equivalent to the set of all CFDs that hold on r). To reduce redundancy, each CFD in the canonical cover should be minimal (i.e. nontrivial and left-reduced). For a more detailed discussion of nontrivial and left-reduced FDs, see, for example, S. Abiteboul et al., “Foundations of Databases,” Addision-Wesley (1995).
The discovery problem is nontrivial. For example, for traditional FDs, a canonical cover of FDs discovered from a relation r is inherently exponential in the arity of the schema of r (i.e., the number of attributes in R). Since CFD discovery subsumes FD discovery, the exponential complexity carries over to CFD discovery. Moreover, CFD discovery requires mining of semantic patterns with constants, a challenge that was not encountered when discovering FDs.
A number of techniques have been proposed or suggested for discovering CFDs. For example, L. Golab et al., “On Generating Near-Optimal Tableaux for Conditional Functional Dependencies,” VLDB (2008), showed that for a fixed traditional FD, fd, that it is np-complete to find useful patterns that, together with fd, make quality CFDs. L. Golab et al. provide heuristic algorithms for discovering patterns from samples with respect to a fixed FD.
F. Chiang and R. Miller, “Discovering Data Quality Rules,” VLDB (2008), presented an algorithm for discovering CFDs, including both traditional FDs and their associated patterns. The disclosed discovery algorith, however, does not avoid the redundancy of discovered CFDs.
A need therefore exists for improved methods and apparatus for identifying conditional functional dependencies. A further need exists for CFD discovery techniques that reduce the number of discovered redundant CFDs.
Generally, methods and apparatus are provided for identifying one or more conditional functional dependencies defined over a schema, R, given a sample relation, r, of said schema, R, and a support threshold, k. Minimal CFDs are disclosed based on both the minimality of attributes and the minimality of patterns. Generally, minimal CFDs contain neither redundant attributes nor redundant patterns. Frequent CFDs are addressed that hold on a sample dataset r, namely, CFDs in which the pattern tuples have a support in r above a certain threshold, k.
A CFDMiner algorithm is disclosed for constant CFD discovery. The connection between minimal constant CFDs and closed and free patterns is explored. CFDMiner finds constant CFDs by leveraging a latest mining technique, which mines closed itemsets and free itemsets in parallel following a depth-first search scheme.
A CTANE algorithm extends TANEF a well-known algorithm for mining FDs, to discover general CFDs. CTANE is based on an attribute-set/pattern tuple lattice, and mines CFDs at level k+1 of the lattice ( i.e., when each set at the level consists of k+1 attributes) with pruning based on those at level k. CTANE discovers only minimal CFDs, and does not return unnecessarily redundant CFDs.
A FastCFD algorithm discovers general CFDs by employing a depth-first search strategy instead of following the levelwise approach. FastCFD is a nontrivial extension of FastFD, an algorithm for FD profiling, by mining pattern tuples. A pruning technique is employed by FastCFD, by leveraging constant CFDs found by CFDMiner.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
The present invention provides methods and apparatus for identifying CFDs. The present invention recognizes that CFDs support patterns of semantically related constants and can be used as rules for cleaning relational data. According to one aspect of the invention, CFD discovery techniques are disclosed that discover minimal CFDs based on both the minimality of attributes and the minimality of patterns. According to another aspect of the invention, a CFD discovery technique, referred to as CFDMiner, is disclosed that is based on mining closed itemsets. The disclosed CFDMiner algorithm can discover constant CFDs with only constant patterns, without paying the price of discovering all CFDs. It has been found that constant CFD discovery is often several orders of magnitude faster than general CFD discovery. Constant CFDs are important for both data cleaning and data integration.
According to yet another aspect of the invention, general minimal CFDs are discovered using a CTANE algorithm based on the levelwise approach or FastCFD algorithm that employs a depth-first approach (which optionally leverages closed-itemset mining to reduce search space).
As previously indicated, CFD discovery requires mining of semantic patterns with constants, as illustrated by the following example.
Example 1. The following relational schema cust is taken from W. Fan et al., “Conditional Functional Dependencies for Capturing Data Inconsistencies,” TODS, Vol. 33, No. 2 (June, 2008). The relational schema cust specifies a customer in terms of the customer's phone (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)).
Traditional FDs that hold on r0 include the following:
f1: [CC,AC]→CT
f2: [CC,AC,PN]43 STR
Here, f1 requires that two customers with the same country- and area-codes also have the same city (similarly for f2 ).
In contrast, the CFDs that hold on r0 include not only the FDs f1 and f2, but also the following (and more):
φ0: ([CC,ZIP]→STR, (44, _∥_))
φ1: ([CC,AC]→CT, (01, 908∥MH))
φ2: ([CC,AC]→CT, (44, 131∥EDI))
φ3: ([CC,AC]→CT, (01, 212∥NYC))
In FD φ0, (44, _∥_) is the pattern tuple that enforces a binding of semantically related constants for attributes (CC, ZIP, STR) in a tuple. FD φ0 states that for customers in the United Kingdom, the zip code (ZIP) uniquely determines the street (STR). FD φ0 is an FD that only holds on the subset of tuples with the pattern “CC=44,” rather than on the entire relation r0. CFD φ1 ensures that for any customer in the United States (country code 01) with area code 908, the city of the customer must be Murray Hill (MH), as enforced by its pattern tuple (01, 908∥MH) (similarly for φ2 and φ3). These conditional functional dependencies cannot be expressed as FDs.
More specifically, a CFD is of the form (X→A,tp), where X→A is an FD and tp is a pattern tuple with attributes in X and A. The pattern tuple consists of constants and an unnamed variable ‘_’ that matches an arbitrary value. To discover a CFD, it is necessary to find not only the traditional FD, X→A, but also its pattern tuple tp. With the same FD, X→A, there are possibly multiple CFDs defined with different pattern tuples (e.g., φ0-φ3). Hence, a canonical cover of CFDs that hold on r0 is typically much larger than its FD counterpart. Indeed, it was recently shown that provided a fixed FD, X→A, is already given, the problem for discovering sensible patterns associated with the FD alone is NP-complete.
It is noted that the pattern tuple in each of φ1-φ3 consists of only constants in both its left-hand-side (LHS) and right-hand-side (RHS). Such CFDs are referred to as constant CFDs. Constant CFDs are instance-level FDs that are particularly useful in object identification, an issue essential to both data quality and data integration.
Three exemplary algorithms are provided for CFD discovery: one algorithm for discovering constant CFDs, and the other two algolithms for general CFDs.
(1) A notion of minimal CFDs is disclosed based on both the minimality of attributes and the minimality of patterns. Intuitively, minimal CFDs contain neither redundant attributes nor redundant patterns. Frequent CFDs are addressed that hold on a sample dataset r, namely, CFDs in which the pattern tuples have a support in r above a certain threshold. Frequent CFDs accommodate unreliable data with errors and noise. The disclosed algorithms find minimal and frequent CFDs to help users identify quality cleaning rules from a possibly large set of CFDs that hold on the samples.
(2) A first algorithm, referred to as CFDMiner, is for constant CFD discovery. The connection between minimal constant CFDs and closed and free patterns is explored. Based on this, CFDMiner finds constant CFDs by leveraging a latest mining technique proposed in J. Li et al., “Mining Statistically Important Equivalence Classes and Delta-Discriminative Emerging Patterns,” KDD (2007), incorporated by reference herein, which mines closed itemsets and free itemsets in parallel following a depth-first search scheme.
(3) A second algorithm, referred to as CTANE, extends TANE, a well-known algorithm for mining FDs, to discover general CFDs. CTANE is based on an attribute-set/pattern tuple lattice, and mines CFDs at level k+1 of the lattice ( i.e., when each set at the level consists of k+1 attributes) with pruning based on those at level k. CTANE discovers minimal CFDs only, and does not return unnecessarily redundant CFDs found by the TANE-extension of F. Chiang and R. Miller, referenced above.
(4) A third algorithm, referred to as FastCFD, discovers general CFDs by employing a depth-first search strategy instead of following the levelwise approach. FastCFD is a nontrivial extension of FastFD, an algorithm for FD profiling, by mining pattern tuples. A novel pruning technique is introduced by FastCFD, by leveraging constant CFDs found by CFDMiner. As opposed to CTANE, FastCFD does not take exponential time in the arity of sample data when a canonical cover of CFDs is not exponentially large.
It has been found that CFDMiner often outperforms CTANE and FastCFD by three orders of magnitude. It has also been found that FastCFD scales well with the arity: it is up to three orders of magnitude faster than CTANE when the arity is between 10 and 15, and it performs well when the arity is greater than 30; in contrast, CTANE may not run to completion when the arity is above 17. On the other hand, CTANE is more sensitive to support threshold and outperforms FastCFD when the threshold is large and the arity is of a moderate size. It has also been found that the disclosed pruning techniques via itemset mining are effective: it improves the performance of FastCFD by a factor of 5-10 and makes FastCFD scale well with the sample size.
These results provide a guideline for when to use CFDMiner, CTANE or FastCFD in different applications. For example, when only constant CFDs are needed, one can use CFDMiner without paying the price of mining general CFDs. CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD profiling. CTANE usually works well when the arity of a sample relation is small and the support threshold is high, but it scales poorly when the arity of a relation increases. When the arity of a sample dataset is large, FastCFD can be employed. NaiveFast and FastCFD are more efficient than CTANE when the arity of the relation is large. Thus, when k-frequent CFDs are needed for a large k, one could use CTANE. The disclosed optimization technique based on closed-itemset mining is effective: FastCFD significantly outperforms NaiveFast, especially when the arity is large.
Conditional Functional Dependencies
Consider a relation schema R defined over a fixed set of attributes, denoted by attr(R). For each attribute A ε attr(R), dom(A) denotes its domain.
A conditional functional dependency (CFD) φ over R is a pair (X→A,tp), where (1) X is a set of attributes in attr(R), and A is a single attribute in attr(R), (2) X→A is a standard FD, referred to as the FD embedded in φ; and (3) tp is a pattern tuple with attributes in X and A, where for each B in X ∪ {A}, tp[B] is either a constant ‘a’ in dom(B), or an unnamed variable ‘_’ that draws values from dom(B).
X is denoted as LHS(φ) and A as RHS(φ). If A also occurs in X, AL and AR indicate the occurrence of A in the LHS(φ) and RHS(φ), respectively. The X and A attributes are separated in a pattern tuple with ‘∥’.
Standard FDs are a special case of CFDs. Indeed, an FD X→A can be expressed as a CFD (X→A,tp), where tp[B]=_ for each B in X ∪ {A}.
Example 2. The FD f1 of Example 1 can be expressed as a CFD ([CC, AC]→CT, (_, _∥_); similarly for f2. All of f1,f2 and φ0-φ3 are CFDs defined over schema cust. For φ0, for example, LHS(φ0) is [CC,ZIP] and RHS(φ0) is STR.
To give the semantics of CFDs, an order ≦ is defined on constants and the unnamed variable ‘_’: η1≦η2 if either η1=η2, or η1 is a constant a and η2 is ‘_’.
The order ≦ naturally extends to tuples, e.g., (44, “EH4 1DT”, “EDI”)≦(44, _, _) but (01, 07974, “Tree Ave.”) ≦ (44, _, _). A tuple t1 matches t2 if t1≦t2. We write t1<<t2 if t1≦t2 but t2≦t1, i.e., when t2 is “more general” than t1. For instance, (44, “EH4 1DT”, “EDI”)<<(44, _,_).
An instance r of R satisfies the CFD φ (or φ holds on r), denoted by r|=φ, if and only if (iff) for each pair of tuples t1,t2 in r, if t1[X]=t2[X]≦tp[X] then t1[A]=t2[A]≦tp[A]. Intuitively, φ is a constraint defined on the set rφ={t|t ε r,t[X]≦tp[X]} such that for any t1,t2 ε rφ, if t1[X]=t2[X], then (a) t1[A]=t2[A], and (b) t1[A]≦tp[A]. Here (a) enforces the semantics of the embedded FD on the set rφ, and (b) assures the binding between constants in tp[A] and constants in t1[A]. That is, φ constrains the subset rφ of r identified by tp[X], rather than the entire instance r.
Example 3. The instance r0 of
An instance r of R satisfies a set Σ of CFDs over R, denoted by r|=Σ, if r|=φ for each CFD φ ε Σ.
For two sets Σ and Σ′ of CFDs defined over the same schema R, Σ is equivalent to Σ′, denoted by Σ≡Σ′, iff for any instance r of R, r|=Σ iff r|=Σ′.
CFDs can also be defined as (X→Y,tp), where Y is a set of attributes and X→Y is an FD. As in the case of FDs, such a CFD is equivalent to a set of CFDs with a single attribute in their RHS.
A CFD (X→A,tp) is called a constant CFD if its pattern tuple tp consists of constants only, i.e., tp[A] is a constant and for all B ε X, tp[B] is a constant. A CFD is called a variable CFD if tp[A]=_, i.e., the RHS of its pattern tuple is the unnamed variable ‘_’.
Example 4. Among the CFDs given in Example 1, f1,f2,φ0 are variable CFDs, while φ1,φ2,φ3 are constant CFDs.
It has been shown that any set Σ of CFDs over a schema R can be represented by a set Σc of constant CFDs and a set Σv of variable CFDs, such that Σ≡Σc ∪ Σv. In particular, for a CFD φ=(X→A,tp), if tp[A] is a constant a, then there is an equivalent CFD φ′=(X′→A, (tp[X′]∥a)), where X′ consists of all attributes B ε X such that tp[B] is a constant. That is, when tp[A] is a constant, all attributes B can be dropped in the LHS of φ with tp[B]=‘_’.
Lemma 1: For any set Σ of CFDs over a schema R, there exist a set Σc of constant CFDs and a set Σv of variable CFDs over R, such that Σ is equivalent to Σc ∪ Σv.
Discovery of CFDs
Given a sample relation r of a schema R, an algorithm for CFD discovery aims to find CFDs defined over R that hold on r. The set of all CFDs that hold on r should not be returned, since the set contains trivial and redundant CFDs and is unnecessarily large. Thus, a canonical cover is desired, i.e., a non-redundant set consisting of minimal CFDs only, from which all CFDs on r can be derived via implication analysis. Moreover, real-life data is often dirty, containing errors and noise. To exclude CFDs that match errors and noise only, frequent CFDs are considered, which have a pattern tuple with support in r above a threshold.
The notions of minimal CFDs and frequent CFDs are formalized before stating the discovery problem for CFDs.
Minimal CFDs. A CFD φ=(X→A,tp) over R is said to be trivial if A ε X . If φ is trivial, then either it is satisfied by all instances of R (e.g., when tp[AL]=tp[AR]), or it is satisfied by none of the instances in which there is a tuple t such that t[X]≦tp[X] ( e.g., if tp[AL] and tp[AR] are distinct constants). A constant CFD (X→A, (tp∥a)) is said to be left-reduced on r if for any Y X, r|≠(Y→A, (tp[Y]∥a)).
A variable CFD (X→A, (tp∥_)) is left-reduced on r if (1) r|≠(Y→A,(tp[Y]∥_)) for any proper subset YX, and (2) r|≠(X→A,(tp′[X]∥_)) for any tp′ with tp<<tp′. Intuitively, these requirements ensure the following: (1) none of its LHS attributes can be removed, i.e., the minimality of attributes, and (2) none of the constants in its LHS pattern can be “upgraded” to ‘_’, i.e., the pattern tp[X] is “most general”, or in other words, the minimality of patterns. A minimal CFD φ on r is a nontrivial, left-reduced CFD such that r|−φ. Intuitively, a minimal CFD is non-redundant.
Example 5. On the sample r0 of
Consider CFDs f11=(f1,(01,_∥_)), f12=(f1,(44,_∥_)), f13=(f1,(—,908∥_)), f14=(f1,(—,212∥_)), and f15=(f1,(—,311∥_)). While these CFDs hold on r0, they are not minimal CFDs, since they do not satisfy requirement (2) for left-reduced variable CFDs. Indeed, (f1,(_,_∥_)) is a minimal CFD on r0 with a pattern more general than any of f1i for i ε [1,5]; in other words, these f1i's are redundant.
Frequent CFDs. The support of a CFD φ=(X→A,tp) in r, denoted by sup(φ,r), is defined to be the set of tuples t in r such that t[X]≦tp[X] and t[A]≦tp[A], i.e., tuples that match the pattern of φ. For a natural number k≧1, a CFD φ is said to be k-frequent in r if sup(φ,r)≧k. For instance, φ1,φ2 of Example 1 are 3-frequent and 2-frequent, respectively. Moreover, f1,f2 are 8-frequent.
It is noted that the notion of frequent CFDs is quite different from the notion of approximate FDs. An approximate FD ψ on a relation r is an FD that “almost” holds on r, i.e., there exists a subset r′ ⊂ r such that r′|=ψ and the error |r\r′|/|r| is less than a predefined bound. It is not necessary that r|=ψ. In contrast, a k-frequent CFD φ in r is a CFD that must hold on r, i.e., r|=φ, and moreover, there must be sufficiently many (at least k) witness tuples in r that match the pattern tuple of φ.
A canonical cover of CFDs on r with respect to k is a set Σ of minimal, k-frequent CFDs in r, such that Σ is equivalent to the set of all k-frequent CFDs that hold on r. Given an instance r of a relation schema R and a support threshold k, the discovery problem for CFDs is to find a canonical cover of CFDs on r with respect to k. Intuitively, a canonical cover consists of non-redundant frequent CFDs on r, from which all frequent CFDs that hold on r can be inferred.
Discovering Constant CFDs
According to one aspect of the present invention, a CFDMiner algorithm is provided for constant CFD profiling. Given an instance r of R and a support threshold k, CFDMiner finds a canonical cover of k-frequent minimal constant CFDs of the form (X→A,(tp∥a)).
The exemplary CFDMiner algorithm is based on the connection between left-reduced constant CFDs and free and closed itemsets. A similar relationship was established for so-called non-redundant association rules. In that context, left-reduced constant CFDs coincide with non-redundant association rules that have 100% confidence and have a single attribute in their antecedent.
Free and Closed Itemsets. An itemset is a pair (X,tp), where X ⊂ attr(R) and tp is a constant pattern over X. Given an instance r of the schema R, the support of (X,tp) in r, denoted by supp(X,tp,r), is defined as the set of tuples in r that match with tp on the X-attributes. (Y,sp) is more general than (X,tp) denoted by (X,tp)≦(Y,sp), if Y ⊂ X and tp[Y]=sp. Furthermore, (Y,sp) is strictly more general than (X,tp) denoted by (X,tp)<(Y,sp), if Y ⊂ X and tp[Y]=sp. Clearly, if (X,tp)≦(Y,sp) then supp(X,tp,r) ⊂ supp(Y,sp,r). For a natural number k≧1, an itemset (X,tp) is k-frequent if |supp(X,tp,r)|≧k.
An itemset (X,tp) is closed in r if there exists no itemset (Y,sp) such that (Y,sp)≦(X,tp) for which supp(Y, sp,r)=supp(X,tp,r). Intuitively, a closed itemset (X,tp) cannot be extended without decreasing its support. For an itemset (X,tp), clo(X,tp) denotes the unique closed itemset that extends (X,tp) and has the same support in r as (X,tp).
Similarly, an itemset (X,tp) is called free in r if there exists no itemset (Y,sp) such that (X,tp)≦(Y,sp) for which supp(Y,sp,r)=supp(X,tp,r). Intuitively, a free itemset (X,tp) cannot be generalized without increasing its support.
A closed (resp. free) itemset (X,tp) is k-frequent if the itemset (X,tp) is k-frequent and closed (resp. free).
The connection between k-frequent free and closed itemsets and k-frequent left-reduced constant CFDs is as follows.
Proposition 1. For an instance r of R and any k-frequent left-reduced constant CFDφ=(X→A,(tp∥a)), r|=φ iff (i) the itemset (X,tp) is free, k-frequent and it does not contain (A,a); (ii) clo(X,tp)≦(A,a); and (iii) (X,tp) does not contain a smaller free set (Y,sp) with this property, i.e., there exists no (Y,sp) such that (X,tp)≦(Y,sp), Y X, and clo(Y,sp)≦(A,a).
From proposition 1 and the closed and free itemsets 210, 220 shown in
CFDMiner. Proposition 1 forms the basis for the constant CFD discovery algorithm. Suppose that for a given instance r and a support threshold k, all k-frequent closed sets and their corresponding k-frequent free sets are available. As mentioned above, there have been various algorithms that provide these sets. The exemplary embodiment employs the
Generally,
CTANE: A Levelwise Algorithm
According to another aspect of the invention, a CTANE levelwise algorithm is provided for discovering minimal, k-frequent CFDs. CTANE is an extension of the TANE algorithm for discovering FDs. See, e.g., Y. Huhtala, “TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies,” Comput. J. Vol. 42, No. 2, 100-111 (1999), incorporated by reference herein.
CTANE mines CFDs by traversing an attribute-set/pattern lattice L in a levelwise way. More precisely, the lattice L consists of elements of the form (X,tp), where X ⊂ attr(R) and tp is a pattern tuple over X. The patterns now consist of both constants and unnamed variables (_). (Y,sp) is more general than (X,tp) if Y ⊂ X and tp[Y]<<sp. This relationship defines the lattice structure on the attribute-set/pattern pairs.
CTANE for mining 1-frequent minimal CFDs is described first, followed by a discussion of how to modify CTANE to discover k-frequent minimal CFDs for a support threshold k.
CTANE starts from singleton sets (A,α) for A ε attr(R) and α ε dom(A) ∪ {_}. CTANE then proceeds to larger attribute-set/pattern levels in L. When CTANE considers (X,sp), it tests for CFDs (X\{A}→A,(sp[X\{A}]∥sp[A])), where A ε X. This guarantees that only non-trivial CFDs are considered. Furthermore, CTANE maintains for each considered element (X,sp) a set, denoted by C+(X,sp), that is used to determine whether CFD(X\{A}→A,(sp[X \{A}]∥sp[A])) is minimal. The set C+(X,sp), as will be explained in more detail below, can be maintained during the levelwise traversal. Apart from testing for minimality, C+(X,sp) also provides an effective pruning strategy, making the levelwise approach feasible in practice.
Pruning Strategy. TANE's pruning strategy is extended herein. For each element (X,sp) in L, a set C+(X,sp) is provided that consists of elements (A,cA) ε attr(R)×{dom(A) ∪{_}}, satisfying the following conditions: (i) if A ε X, then cA=sp[A]; (ii) for all B ε X, r|≠(X\{A,B}→B,(sp[X\{A,B}]∥sp[B])); and (iii) for all B ε X\{A}, r|≠(X\{A}→A,(spB∥cA)), where spB[C]=sp[C] for all C≠B and spB[B]=_. Intuitively, condition (i) prevents the creation of inconsistent CFDs; condition (ii) ensures that the LHS cannot be reduced; and finally, condition (iii) ensures that the pattern tuple is most general.
Lemma 2: Let X ⊂ attr(R), sp be a pattern over X, A ε X and assume that r|=φ=(X\{A}→A,(sp[X\{A}]∥sp[A])). Then φ is minimal iff for all B ε X, (A,sp[A]) ε C+(X\{B},sp,[X\{B}]).
In terms of pruning, Lemma 2 says that any element (X,sp) of L for which C+(X,sp)=θ need not be considered. Moreover, if C+(X,sp)=θ then also C+(Y,tp)=θ for any (Y,sp) that contains (X,tp) in the lattice. Therefore, the emptiness of C+(X,sp) potentially prunes away a large part of elements in L that otherwise need to be considered by CTANE.
Algorithm CTANE.
As shown in
1. Computes candidate RHS for minimal CFDs with their LHS in Ll. That is, for each (X,sp) ε Ll compute
2. For each (X,sp) ε Ll look for valid CFDs; i.e. for each A ε X, (A,cA) ε C+(X,sp) do the following:
(a) Check whether
r|=φ=(X\{A}→A,(sp[X\{A}]∥cA));
(b) If r|=φ then output φ. Indeed, if φ holds on r then, by Lemma 2 and Step 1, φ is indeed a minimal CFD;
(c) If r|=φ then for all (X,up) ε Ll such that up[A]=cA and up[X\{A}]<<sp[X\{A}], update C+(X,up) by removing from it (A,cA) and (B,cB), for B ε attr(R)\X;
3. Next, prune Ll. That is, for each (X,sp) ε Ll remove (X,sp) from Ll provided that C+(X,sp)=θ:
4. Finally, generate Ll+1 as follows:
(a) Initially Ll+1=θ;
(b) For each two distinct (X,sp),(Y,tp) ε Ll that agree on the first l−1 attributes:
i. Let Z=X ∪ Y and up=(sp,tp[Yn]); here Yn denotes the last attribute in Y;
ii. If there is a tuple in the projection πZ(r) that matches up then continue with (Z,up);
iii. If for all A ε Z, (Z\{A},up[Z\{A}]) ε Ll, then add (Z,up) to Ll+1;
(c) Set l=l+1.
Lemma 2 ensures that Steps 1 and 2(a) correctly generate minimal CFDs. It is easily verified that Steps 1 and 2(c) correctly update C+(X,sp):
Lemma 3: Suppose that for all (Y,tp) ε Ll, C+(Y,tp) is correctly computed. Then, steps 1 and 2(c) in
CTANE for finding k-frequent CFDs. CTANE can be modified such that it only discovers k-frequent minimal CFDs. First, observe the following: Let φ=(X→A,(tp,cA)) be a CFD that holds on r. (Xc,tpc) denotes the itemset consisting of the constant part of (X,tp). Then φ is k-frequent iff supp(Xc,tpc,r)≧k when X≠θ and |r|≧k. This indicates that for any reasonable choice of k (i.e., smaller than the size of r), only the elements (X,sp) ε Ll need to be restricted to elements for which (Xc,spc) is a k-frequent itemset. This can be achieved by (1) initializing L1 to L1={(A,_)|A ε attr(R)}∪ {(A,a1)|supp(A,a1,r)≧k,A ε attr(R)}; and (2) by replacing Step 4.b(ii) in CTANE by a step that only considers (Z,up) if supp(Zc,upc,r)≧k. Both modifications increase the amount of pruning, and thus improve the efficiency of CTANE when finding k-frequent CFDs.
Generally, there are four primary computational aspects important for an efficient implementation: (i) the maintenance of the sets C+(X,sp) (Step 1); (ii) the validation of the candidate minimal CFDs(Step 2.b); (iii) the generation of Ll+1 (Step 4); and (iv) the checking of support when discovering k-frequent CFDs(Step 4.b(ii)). The technique underlying (i) and (ii) is based on so-called partitions. More specifically, given (X,sp), two tuples u, v ε r are equivalent with respect to (X,sp) if u[X]=v[X]≦sp[X]. Any (X,sp) therefore induces an equivalence relation on a subset of r. If [u](X,s
Consider again the cust relation of
As shown in
(A) Initially Ll consists of all single attribute/value pairs that appear at least k times, and each attribute occurs together with an unnamed variable. Note that k limits the number of values dramatically for, e.g., the STR attribute. At this point, all sets C+(A,cA) contain (A,cA). Since r does not satisfy any CFD with an empty LHS, none of the C+-sets is updated in Step 2. Similarly, none of the sets is removed from L1 in Step 3.
(B) In Step 4, CTANE pairs attributes together and creates consistent patterns. Note that for (CC,AC) the constant 44 does not appear anywhere (while it did at the lower level), because k=3.
(C) For the gray shaded patterns, Step 2 finds valid CFDs: (ZIP→CC,(07974∥_)), (ZIP→CC,(07974∥01)), (ZIP→AC,(07974∥_)), (ZIP→AC,(07974∥908)), and (STR→ZIP,(_∥_)). This implies that, e.g., C+([CC,ZIP],(—,07974)) and C+([AC,ZIP],(—,07974)) are updated in Step 2 by removing (CC,_) and (AC,_), respectively.
(D) Step 4 now creates triples of attributes. Only the patterns for (CC,AC,ZIP) are shown. In Step 2, CTANE finds the CFD([CC,AC]→ZIP,(_,_∥_)).
(E) As a result, CTANE updates the C+-sets in Step 2.c, not only of the current pattern but also of those with a more specific pattern on the LHS-attributes. That is, (ZIP,_) is removed from the C+-set from the first three patterns. This ensures that CFDs to be generated later only have the most general LHS-pattern.
(F) Finally, in Step 1 of CTANE, the C+ set of the pattern tuple (_,—,07974) is computed. However, recall that both C+([CC,ZIP],(—,07974)) and C+([AC,ZIP],(—,07974)) have been updated. As a result, neither (CC,_) nor (AC,_) will be included in the C+-set of (_,—,07974). This illustrates that the only chance of finding a minimal CFD in this case is to test ([AC,CC]→ZIP, (_,13 ∥07974)), which in this case does not hold on r. However, this shows that the C+-sets indeed reduce the possible RHS for candidate minimal CFDs.
FastCFD: A Depth First Approach
According to another aspect of the invention, a FastCFD algorithm is provided as an alternative algorithm for discovering minimal CFDs. Given an instance r and a support threshold k, FastCFD finds a canonical cover of all minimal CFDs φ such that sup(φ,r)k. In contrast to the breadth-first approach of CTANE, FastCFD discovers k-frequent minimal CFDs in a depth-first way. It is inspired by FastFD, a depth-first algorithm for discovering FDs.
Consider X ⊂ attr(R) and an attribute A in attr(R)\X. fixlhs(X,A,r,k) denotes the set of all CFDsφ=(Y→A,tp) such that Y ⊂ X, φ is minimal, and moreover sup(φ,r)k. All k-frequent CFDs in r can therefore be found by computing Aεattr(R) fixlhs(attr(R)\{A},A,r,k). Algorithm FastCFD does this: for each A ε attr(R), it calls a procedure FindCover that computes fixlhs(attr(R)\{A},A,r,k). The remainder of this section is devoted to the description of the procedure FindCover.
Difference sets. To compute fixlhs(attr(R)\{A},A,r,k) in a depth-first way, a difference set is defined for a pair of tuples t1,t2 ε r by D(t1,t2;r)={B ε attr(R)|t1[B]≠t2[B]}, i.e., the set of attributes in which t1 and t2 differ. The difference set of r is D(r)={D(t1,t2;r)|t1,t2 ε r}.
{circumflex over (D)}A(r) denotes the set {Y\{A}|Y ε D(r), A ε Y}, i.e., the set of attribute sets Y\{A} such that there exist tuples in r that disagree on all of the attributes in Y, including A. Furthermore, DA(r)={Y ε {circumflex over (D)}A(r)|(Y′ ε {circumflex over (D)}A(r))̂(Y′ ⊂ Y Y′=Y)} denotes the minimal difference sets of {circumflex over (D)}A(r).
Let Z ⊂ attr(R) and X ⊂ P(attr(R)) (i.e.,the power set of attr(R)). Z covers X iff ∀ Y ε X, Y ∩ Z≠θ. Furthermore, Z is a minimal cover for X in case no Z′ ⊂ Z covers X.
The relationship between difference sets and the validity of CFDs is revealed by Lemma 4. For a pattern tp, rt
Lemma 4: Given a constant CFDφ=(X→A,(tp∥a)), then r|=φ and sup(φ,r)≧k iff |rt
Lemma 4 forms the basis for finding minimal k-frequent CFDs. First, to find a minimal k-frequent constant CFD(X→A,(tp∥a)) a k-frequent itemset (X,tp) in r must be found such that DA(rt
Efficient Pattern Pruning Strategy. In general, all k-frequent itemsets are considered as candidates of constant patterns in CFDs φ=(X→A,(tp∥_)). However, given all k-frequent free and closed itemsets, the following lemma implies that it suffices to consider only k-frequent free itemsets as candidates for constant patterns in the process of discovering minimal variable CFDs. This strategy prunes away a large part of the constant pattern candidates and significantly improves the efficiency of the disclosed technique.
Lemma 5: Let φ=(X→A,(tp∥_)) be a variable CFD that satisfies r|=φ and sup(φ,r)≧k. If φ is minimal then the constant pattern in tp, denoted by (Xc,tpc), is a k-frequent free itemset.
Depth-First Strategy. Assume an ordering <attr on attr(R). FindCover maintains a list of possible k-frequent free itemsets Patt(R). The reason that only k-frequent free itemsets are considered is given in Lemma 5. For an itemset (Xc,tpc) in Patt(R), rt
Procedure FindCover. Let A be an attribute in attr(R), and Patt(R)={(X,tpc)} the set of k-frequent patterns over attr(R) where X ⊂ attr(R). FindCoverinvokes Algorithm FindMin, discussed hereinafter in conjunction with
As shown in
If Conditions (i) and (ii) hold, output CFD(XY→A,(tp∥_)).
As shown in
It is noted that (X′,tpc[X′]) in Step 3.b(ii) must be a k-frequent itemset due to the anti-monotonicity property of frequent itemsets. Thus, there exist closed itemsets (Z,sp) such that (Z,sp)≦(X′,tpc[X′]). It is noted that:
|supp(X′,tpc[X′])|=max{|supp(Z,sp)|},
Thus, DA(rt
Step 4.b is an optimization that allows a dynamic reordering of the attributes while doing the depth-first traversal through the subsets of attr(R). Our algorithm supports the use of a cost model as in FastFD to dynamically reorder attributes such that attributes that cover the most difference sets are treated first.
FastCFD Illustration. As noted above, FastCFD invokes FindCover(attr(R)\{A},r,k)) for each A ε attr(R). Given a k-frequent itemset (X,tpc) in r, FindCover invokes FindMin(A,(X,tpc),DA(rt
(A) Given a pattern (CC,01), rCC=01={t1,t2,t3,t4,t8}. The algorithm computes its minimal difference sets, i.e.,
D
STR(rCC=01)={[PN],[AC,CT]}.
The corresponding covers Y of DSTR(rCC=01) computed in Step 3 of FindMin 500 are [AC,PN] and [CT,PN]. Those covers Y are computed in a recursive process invoked in Step 4, which is illustrated in the depth-first search tree 610 in
φ′=([CC,AC,PN]→STR,(01,_,_∥_))
in Step 3.b. Although the algorithm verifies that φ′ is minimal for rCC=01 in Step 3.b(i), it still needs to inspect whether [CC,AC,PN] covers DSTR(r) in Step 3.b(ii), where Ø is the only immediate subset of pattern (CC,01). In this case, it finds out that [CC,AC,PN] covers DSTR(r) which indicates that r|=([CC,AC,PN]→STR,(_,_,_∥_) Thus, φ′ is not a minimal CFD.
(B) Given a pattern (CC,44), rCC=44={t5,t6,t7}. The algorithm computes its difference sets, and the corresponding minimal difference sets, respectively.
{circumflex over (D)}STR(rCC=44)={[AC,PN,CT,ZIP],[AC,CT,ZIP]}.
D
STR(rCC=44)={[AC,CT,ZIP]}
The covers of DSTR(rCC=44) are AC, CT, and ZIP. Consider the cover AC, FindMin needs to inspect if its CFD
φ=([CC,AC]→STR,(44,_∥_))
is minimal. In Step 3.b(i), it verifies that φ is minimal for rCC=44, but it still needs to inspect whether [CC,AC] covers DSTR(rØ) ( i.e., DSTR(r)) in Step 3.b(ii) where again Ø is the only immediate subset of pattern (CC,44). As shown by the cust relation, D(t2, t4)={PN,STR}, and [PN] ε DSTR(r). This implies that [CC,AC] cannot be a cover for DSTR(r). Thus, φ is a minimal CFD.
(C) Given a pattern tpc=([CC, AC],[01,908]), rt
D
STR(rt
The corresponding cover of DSTR(rt
φ″=([CC,AC,PN]→STR,(01,908,_∥_))
in Step 3.b. Although FindMin verifies that φ″ is minimal for rt
Implementation Details and Optimizations. The key differences between FastCFD and its FD-counterpart FastFD are: (1) the more complicated condition for testing the validity of a minimal CFD φ in terms of the minimality of the constant pattern and unnamed variables in LHS(φ); and (2) the fact that k-frequent CFDs are discovered instead of 1-frequent FDs only. Whereas for FDs, the only difference sets needed are DA(r) for A ε attr(R), Lemma 4 states that for CFDs, difference sets DA(rt
NaiveFast. The first approach is inspired by the stripped partition-based approach used by FastFD (C. M. Wyss et al., “FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances—Extended Abstract,” DaWak (2001)). Here, for a given (X,tp) the stripped partition of rt
FastCFD. The second approach relies on the availability of Closed2(r), that is all 2 -frequent closed itemsets in r. Given (X,tp), it can be inferred for any two tuples in rt
Finally, since CFDMiner produces Closedk(r) as a side-product, CFDMiner can be used for constant CFD discovery and FastCFD can be used for variable CFDs only. For this, Step 3.a is eliminated in FindCover. This combination often leads to a very large overall improvement in efficiency.
Minimal CFDs can be discovered from a dataset r when both its arity and its size are large by sampling r (i.e., to find a subset rs of r by selectively drawing tuples from r such that rs accurately represents r and is small enough to be efficiently processed by FastCFD or CTANE).
System and Article of Manufacture Details
As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer readable medium having computer readable code means embodied thereon. The computer readable program code means is operable, in conjunction with a computer system, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer readable medium may be a recordable medium (e.g., floppy disks, hard drives, compact disks, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic media or height variations on the surface of a compact disk.
The computer systems and servers described herein each contain a memory that will configure associated processors to implement the methods, steps, and functions disclosed herein. The memories could be distributed or local and the processors could be distributed or singular. The memories could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from or written to an address in the addressable space accessed by an associated processor. With this definition, information on a network is still within a memory because the associated processor can retrieve the information from the network.
It is to be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.