This application claims foreign priority from UK Patent Application Serial No. 1203420.3, filed 28 Feb. 2012.
Typically, data quality in a digital repository such as a database for example can be improved using data quality rules to identify records that violate those rules and modifying the data to remove the violations. Both the data being considered and the rules are generally visible to the party that is processing and cleaning the data.
In order to preserve the privacy of the data and the rules that are being used it is typical to use complex cryptographic techniques or to rely on a trusted third party. Cryptographic techniques are typically computationally expensive, and can involve introducing a third party into the system which may not be desirable in many security settings. At the same time, for a large amount of data, the speed at which data is examined for violations and subsequently repaired important.
According to an example, there is provided a privacy preserving system and method for detecting inconsistent data in a database.
According to an example, there is provided a computer-implemented method for detecting a set of inconsistent data records in a database including multiple records, comprising selecting a data quality rule representing a functional dependency for the database, transforming the data quality rule into at least one rule vector with hashed components, selecting a set of attributes of the database, transforming at least one record of the database selected on the basis of the selected attributes into a record vector with hashed components, computing a dot product of the rule and record vectors to generate a measure representing violation of the data quality rule by the record. Hashed components of the vectors can be fixed-size hashcodes. The data quality rule can be a conditional functional dependency (CFD) representing a functional dependency of the database which is extended with a pattern tableau specifying conditions under which the functional dependency holds for records of the database. The CFD can be a constant CFD including rule attributes which are constants. The CFD can be a variable CFD including rule attributes which are variable. The measure representing violation of the data quality rule by the record can be provided only to the owner of the database. The record can include an attribute matching a corresponding determinant attribute for the CFD. A violation occurs if there is a disagreement between a dependent attribute of the record and the corresponding attribute of the CFD. A pair of records of the database is transformed into the record vector, the hashed components of the vector determined from a random selection of attribute values from the pair of records. In an example, the privacy of a data quality rule is preserved in the case where it is not violated by the records. Transforming the data quality rule can include generating a pair of vectors for a rule representing components for the left and right hand sides of the rule. The pair of vectors for a data quality rule which is a variable conditional functional dependency can be concatenated to form a single vector.
According to an example, there is provided a computer program embedded on a non-transitory tangible computer readable storage medium, the computer program including machine readable instructions that, when executed by a processor, implement a method for detecting a set of inconsistent data records in a database including multiple records, comprising selecting a set of attributes and generating a set of data vectors from hashcodes of the corresponding attribute values of a set of records of the database, selecting a conditional functional dependency rule and generating a rule vector from hashcodes of the rule, computing a secure dot product of the data and rule vectors to determine an inconsistent record in the database. Generating a rule vector can include generating respective vectors for constants of the left and right sides of the rule, and wherein computing a secure dot product includes computing a dot product using each such vector with the data vector for a constant CFD inconsistency detection. Generating a rule vector can include generating respective vectors for constants of the left and right sides of the rule and concatenating the vectors to provide a rules vector for a variable CFD inconsistency detection.
According to an example, there is provided a system suitable for performing a method or for executing machine readable instructions implementing a method as described herein.
An embodiment of the invention will now be described, by way of example only, and with reference to the accompanying drawings, in which:
Typically, the data in a database in which it is desired to detect inconsistencies between records is available to a data quality tool in plain form. Such data may be private accordingly, if collaboration with different parties to help assess and improve its quality is desired, it will be important to maintain privacy during the process to detect the inconsistencies.
For example, in the scenario that an organization hires a data quality certifying agent to assess the quality of its data, the organization may have legal and business restrictions that prevent the sharing of some or all of its sensitive data. A specific example can be in a healthcare setting, where access to patients' medical records is typically restricted. Likewise, customer credit card information cannot be revealed to a third party agent. Complying with such legal and business requirements will be challenging as the data quality certifying agent will need to report quality problems on data that cannot be revealed. Moreover, the certifying agent would necessarily have to use a large number of data quality rules since it would be impossible to know which specific rules apply to the data. Conversely, since these rules represent an important asset for the agent due to the time and resources expended in gathering them (including the analysis of several datasets from other sources for example), it will be desirable for the agent to protect these proprietary rules as well.
In another scenario, two or more data owners may wish to collaborate to identify inconsistencies in their respective databases. Each data owner will have to first analyze their own data and generate a set of constraints found in their respective data due to correlations between values. Typically, such a rule discovery process searches for highly supported relationships between attribute values in records, and a relationship that is not supported cannot be discovered. However, there may be instances where a valid constraint that is supported in one database may not be supported in another database. For example, a business may have regional offices around the globe, each managing its own data. These data correspond to the same business domain (and they are likely to share the same schema) which strongly supports the need to collaborate in order to better assess each other's data. Accordingly, rules from any given party may be used to assess the quality of another party's data. However, an office in one country may be prevented from sharing sensitive data with another office in another country due to local legal constraints. Hence, in order to comply with regional policies, the data—even within the same organization—may have to remain private among regional offices. Moreover, it may be desirable that the rules should not be revealed as they may contain semantic relationships and information about the private data. These different owners (regional offices) would therefore need to participate in a collaborative private data cleaning process.
In an example, the scenarios can be reduced to the case of a data owner and a rules owner who is engaged in a protocol to assist the data owner in the identification of records that violate the rules whilst preserving the privacy of the data. While there are legitimate reasons to fully protect the rules as mentioned earlier, analyzing violating records (also referred to as tuples) in a database may reveal information about the violated rules. However, in an example, the privacy of any rule that is not applicable to the data is preserved. That is, rules relating to data in which no violations occur remain private.
According to an example, inconsistency detection is performed using conditional functional dependencies (CFD) which extend standard functional dependencies (FDs) with pattern tableaux that specify conditions under which the FDs hold. Thus, given a database instance D and a set of CFD rules Σ, an inconsistency detection problem in an example is characterised by determining the set of records D′⊂D that violate Σ.
However, it may not be possible to discover the rules r1 and r3 in D due to the above-mentioned inconsistent records. That is, there may not be enough support in D to discover these rules. To detect inconsistencies in D, suitable assistance from another party in possession of such rules may thus be required.
According to an example, a system and method for detecting a set of inconsistent records in a database transforms data records and rules into two vectors respectively according to an arrangement of the values from these records and rules. For example, data records can be embedded in a vector space in which a comparison by way of a dot product is performed. Typically, objects can be embedded in such a metric space using multiple different techniques. For example, a coordinate space can be defined in which each axis corresponds to a reference set which is a subset of the objects to be embedded. An example of a method which can be used to map a set of objects into a metric space is described in “Privacy Preserving Schema and Data Matching”, Scannapieco, Bertino, Figotin, Elmagarmid, SIGMOD'07, Jun. 12-14, 2007, Beijing, China, the contents of which are incorporated herein by reference in their entirety. An example of a secure dot product scheme is presented in M. Yakout, M. J. Atallah, and A. Elmagarmid, “Efficient private record linkage”, In ICDE, 2009, the contents of which are incorporated herein by reference in their entirety.
Computing a dot product of the two vectors yields a measure indicative of whether the data records are inconsistent. To ensure that the content of these two vectors is not leaked to the other party, an efficient secure dot product algorithm which does not rely on cryptographic techniques is used. The result of the secure dot product is that the data records that are inconsistent are delivered to the data owner only. Accordingly, a rules owner learns nothing about the data. Privacy of the rules depends on the level of inconsistency in the data since non-violated rules cannot typically be regenerated. That is, a bulk secure dot product protects the privacy of any non-violated rules.
According to an example, for a relational schema R, a CFD φ is defined as (R: X→Y, Tp) where i) (X∪Y)⊂attr(R), and X→Y is a standard FD; and ii) Tp is a pattern tableau for a CFD φ with attributes Aε(X∪Y), where for each entry tpεTp, tp[A] is either a constant or an unspecified value ‘-’ (denoted as wildcard); the constant is assumed to be drawn from the discrete domain of attribute A, or simply dom (A).
A pattern tableau is used for uniform representation of both the data and constraints involved in CFD rules. For example, with reference to
In an example, a set of CFDs, Σ, are accommodated in the same pattern tableau with the same set of attributes to form a merged pattern tableau denoted TΣ.
According to an example, a relation D of a schema R satisfies a constant CFD rule, rkεTΣ (denoted by Drk) when the following holds:
for rk and tεD if t[X]=rk[X], then t[Y]=rk[Y].
Similarly, D satisfies a variable CFD rule, rkεTΣ when the following holds: for rk and t1, t2εD if t1[X]=t2[X]rk[X], then t1[Y]=t2[Y]rk[Y].
The notation t1[X]=t2[X]rk[X] denotes that for attribute XlεX, if rk[Xl] is a constant then t1[Xl], t2[Xl] and rk[Xl] are equal, otherwise (when rk[Xl] is a wildcard) only t1[Xl] and t2[Xl] are equal. If Σ is a set of CFD rules, DΣ iff Drk for each CFD rule, rkεTΣ. If some records do not satisfy, or violate, the CFD rule rk, those records are said to be inconsistent with respect to rk.
An inconsistent records set is the set of data records in D that violate any CFD rule, rkεTΣ. That is, ⊂D and ∀tiεY, ti|≠TΣ. The k are the inconsistent records set with respect to a specific rule rkεTΣ. C represents an inconsistent records set which violate constant CFDs. Similarly, V is the set which violates variable CFDs.
In
According to an example, given a private database D (owned by a data owner) and a set of private CFD rules Σ (owned by a rules owner), a system and method returns the set of inconsistent records D′⊂D only to the data owner such that D′ violates some rules in Σ, D′|≠Σ. In an example, inconsistency detection with constant CFDs and variable CFDs are performed separately. These can be specified as two different queries in a SQL-based detection technique for example.
If Σc and Σv represent the constant and variable CFD rule sets privately held by the rules owner (Σ=Σc∪Σv) then, given a private database D and a private Σc, inconsistency detection with constant CFDs will return the set of inconsistent records D′⊂D to the data owner such that D′|≠Σc. Similarly, inconsistency detection with variable CFDs will return D′⊂D such that D′|≠Σv.
Inconsistency detection is performed between a record (or a record-pair) and a rule. Therefore, two different sub-problems exist, the solution to which are:
In an example, for a constant CFD task, each record is compared with an individual rule to detect inconsistency, while in a variable CFD task, a pair of records (due to the wildcard attributes in the rule) is compared with an individual rule. That is, inconsistency detection in each task is a combination of two subtasks: (i) identify the record (or the pair of records) that exactly matches the LHS of a rule and (ii) mark the record(s) as inconsistent if there is a mismatch/disagreement among the RHS attribute of the rule and the data record(s).
In each task both , a data owner, and , a rules owner, generate appropriate vectors from a data record and a rule and perform a secure dot product with the vectors. The content of both the data and the rule is not revealed to the other party; only the result of the dot product is delivered to . According to an example, a constant CFD task performs an individual inconsistency detection as a two step process (two secure dot products), and a variable CFD task performs the same as a single step process (one secure dot product). These two tasks accumulate the set of inconsistent records C and V, respectively.
In an example, fixed length hash codes of the attribute values are used instead of the actual values. The value h(a) denotes the hash code of the attribute value a. The hash codes are used to achieve communication and storage efficiency, but not as a security measure. Typically, 32 bit hash codes can be used without any collisions.
Each entry in a merged pattern tableau is denoted as rφ(X→Y)εTΣ. X (Y) constitutes the union of LHS (RHS, respectively) attributes of all CFD rules. X u Y is the set of attributes and mutually agreed on. X′ and Y′ refer to the LHS and RHS attributes with constant values, whilst X″ and Y″ denote the LHS and RHS attributes with wildcards. For example, for rule r1 in
For the constant CFD task, has a data record ti and has a constant CFD rule rφ. wants to know whether ti|≠rφ privately. A process for inconsistency detection using constant CFDs is shown below.
Method for Constant CFDs
Input: The record, tiεD held by , and the constant CFD rule, rφεTΣ held by
Output: Inconsistent records set C
The individual components of the vector VL contain the hash codes of all the attribute values X∪Y. then generates a vector WL from the hash codes of the CFD rule rφ (Step (b)). The k-th component WkL corresponds to
when the attribute xk has a constant on the LHS of the rule, i.e., xkεX′, and to 0, when xkεX′. Similarly, generates WR with the RHS constants of the rule.
In an example, a first dot product is V·WL, which equals to 1 if the data record and LHS constants of the rule match exactly, or ti[xk]=rφ[xk], ∀xkεX′.
This is because
The second dot product is V·WR, which equals to 1, if the data record and RHS constants of the rule do not match. Hence, learns that ti|≠rφ, if V·WL=1 and V·WR≠1.
For a variable CFD task, has a pair of records ti, tj and has a variable CFD rule rφ. wants to know whether ti, tj|≠rφ, privately. We transform this problem into the computation of a single dot product as described below. Knowing the dot product result, knows whether ti, tj|≠rφ.
Method for Variable CFDs
Input: The records ti, tjεD held by , and the variable CFD rule, rφεTΣ held by
Accordingly, in Step (a), for a pair of records ti and tj, generates a record vector V with the hash codes of the attributes values of either ti or tj (randomly chosen by in an example). Then, generates a matching vector M, such that the k-th component, Mk is set to 1 if the k-th attribute values of ti and tj exactly match, that is ti[xk]=tj[xk], otherwise 0. Then, generates a vector VM by concatenating V and M. The length of the vector VM is double the size of V, i.e., 2|X∪Y|.
In Step (b), generates vector W for rule rφ of the same length as VM. W can also be split into LHS (WL) and RHS (WR) parts. WL is the same as the vector WL described in the constant CFD task above, except that each term is multiplied by the term (s2−s1+1), where s1 and s2 are random scalars generated by . If the k-th component of the vector WR is a LHS constant or wildcard, it contains (s1−s2)/|X∪X″|. For a RHS wildcard it contains a random scalar s3, otherwise 0. In other words, the random value (s1−s2) is equally split among the LHS constants or wildcards of the rule. Finally, the result of the dot product VM·W equals to 1, if (ti, tj)|≠rφ, otherwise, VM·W equals to a random scalar.
According to an example, a secure dot product process operates according to a known technique such as the following:
This simplification of U′ is possible since Yj is orthogonal to W(j).
The security parameter k controls the degree at which the original vectors are hidden. Using this parameter, the original vectors are hidden in a k-dimensional hyperplane. Note that learns only i) a k-dimensional hyperplane that contains W and that is selected (Step 1) by ; and ii) the scalar U (Step 4). For i), the larger the value of k the higher the privacy is guaranteed for . For ii), since all of the αj's are unknown to , cannot learn much from U.
In Step 3, knows Xj, but not the scalars β, β′, or the vectors Yj, which hide V from . Indeed, is effectively adding a random vector of 's choice to V for hiding it. Note that without the β′Xj, could obtain the direction of V in space, but not its magnitude, by computing the k dot products Zj·W(j)=βV·W(j); their ratios would reveal that direction, as β cancels out.
In an example values of k equal to 2, 4 and 6, (d=10 and 20) can be used, but typically a value of k of 2 is enough, e.g., when d is large a smaller k is possible because enough coordinates exist to make reconstruction harder. Other values for parameter k are of course possible dependent on the nature of the database under consideration.
Accordingly, one party does not learn the vector of the other party during the exchange of the intermediate vectors in the protocol steps. As mentioned earlier, learns only a k-dimensional hyperplane containing W and the scalar U which does not reveal W. In other words, the rule is not learnt through the protocol. On the other hand, learns nothing about V. Therefore, does not learn the data record from during the exchange.
As mentioned, a data owner may be able to regenerate violated rules from a group of inconsistent records. Since the processes described above describe an individual dot product between a record (or a record-pair) and a rule, if the dot products of all the records and the rules were to be performed in this way, may learn most of the inconsistent records without having to wait for all the dot product results. After knowing that some of the records are already inconsistent and assuming a semi-honest setting, may become interested in carefully perturbing some of the original data records to obtain more violations and hence drive an attack on B. Very good guesses may even lead to violations with respect to some rules that did not violate the original set of data records in the first place. In other words, may be able to learn rules beside the violated ones. One way to prevent this driving attack is to use a bulk version of the secure dot product (SDP).
More specifically, each step of the SDP can involve operations on all the data vectors and the rule vectors at once. Therefore, now has to wait until Step 5 of SDP to obtain all the dot product results at once. In addition, if the rules are always paired in the same order with data records during the bulk SDP, would precisely learn the specific order of a rule that is matched (in the case of a constant CFD) or violated by the records. Now, can easily group the records that relate only to that specific rule and perform the rule regeneration more easily. Such attacks can be countered in an example by randomizing the order of the rules paired with each record or pair of records.
A user can interface with the system 700 with one or more input devices 711, such as a keyboard, a mouse, a stylus, touch-enabled screen or interface and the like in order to provide user input data. The display adaptor 715 interfaces with the communication bus 799 and the display 717 and receives display data from the processor 701 and converts the display data into display commands for the display 717. A network interface 719 is provided for communicating with other systems and devices via a network (not shown). The system can include a wireless interface 721 for communicating with wireless devices in the wireless community.
It will be apparent to one of ordinary skill in the art that one or more of the components of the system 700 may not be included and/or other components may be added as is known in the art. The apparatus 700 shown in
Number | Date | Country | Kind |
---|---|---|---|
1203420.3 | Feb 2012 | GB | national |
Number | Name | Date | Kind |
---|---|---|---|
20070053507 | Smaragdis et al. | Mar 2007 | A1 |
20080021899 | Avidan et al. | Jan 2008 | A1 |
20090287721 | Golab et al. | Nov 2009 | A1 |
20100250596 | Fan et al. | Sep 2010 | A1 |
20110138312 | Yeh et al. | Jun 2011 | A1 |
Number | Date | Country |
---|---|---|
20020648 | Mar 2003 | IE |
Entry |
---|
Mohamed Yaout “Efficient Private Record Linkage”, Mar. 29, 2009, a Engineering, 2009. ICDE '09. IEEE 25th International Conference, pp. 1283-1286. |
GB1203420.3 Examination Report dated Jun. 20, 2012. |
Number | Date | Country | |
---|---|---|---|
20130226879 A1 | Aug 2013 | US |