1. Field of the Invention
The present invention relates generally to systems and methods for data analysis, and more particularly for determining an element value in private datasets.
2. Discussion of Background Art
Sharing information which by choice or by law is to remain private (i.e. secret) is almost self-contradictory. How can a party share information which is to remain private? One approach to this dilemma is that often parties are willing to share certain statistical information about their own private data. Such statistical data may include, an average value, a median value, a lowest value, a highest value, as well as various other data distribution statistics.
The usefulness of sharing such statistical information while preserving the privacy of the data abound. For example: Suppose that multiple hospitals wish to compute the median life expectancy of patients they treated who had a particular medical ailment (e.g. SARS or HIV). Often by law, hospitals are not permitted to share their detailed personal patient data so that such a median life expectancy could be computed, and yet knowing the median life expectancy would likely be of great value to researchers and government entities tracking the success of combating such a disease.
Similarly, suppose multiple universities wish to compute the median salary of their combined faculty populations so as to better compete for and compensate their faculty. Each university would not like to reveal individual salaries, since pay-scales are determined by length of time in the institution (and so, for example, the minimum salary corresponds to the most junior faculty member and the maximum salary to the most senior). However, computation of the median salary is a basic statistic that various employee organizations and magazines routinely publish.
Current solutions attempt to generate such statistical data from private information by perhaps using a trusted third party; however, the computational and resource overhead required to implement such methods is very burdensome. Also due to the private and sensitive nature of their data, parties may intentionally or unintentionally provide inaccurate data about their private information, thereby further complicating efforts to ensure the accuracy of various statistics computed for such private data.
In response to the concerns discussed above, what is needed is a system and method for secure data analysis that overcomes the problems of the prior art.
The present invention is a system and method for determining a value [V] of an element, having a k-th rank [k], where [k] is a pre-selected number. The method includes: calculating a total number of elements [T] in a first dataset managed by a first party, and a second dataset managed by a second party; prohibiting each party access to each other's dataset; ranking the elements within each dataset; computing a total number of elements [Σli] in the datasets each having a value less than a test value [m], without exchanging a number of elements [li] within each dataset each having a value less than the test value; computing a total number of elements [Σgi] in the datasets each having a value greater than the test value [m], without exchanging a number of elements [gi] within each dataset each having a value greater than the test value; and setting the value [V] of the element, having the k-th rank [k], equal to the test value [m], if the total number of elements having values less than the test value [Σli] is ≦ the k-th rank [k] minus one, and the total number of elements having values greater than the test value [Σgi] is ≦ the total number of elements [T] minus the k-th rank [k]. The system includes all means for practicing the method.
These and other aspects of the invention will be recognized by those skilled in the art upon review of the detailed description, drawings, and claims set forth below.
The present invention describes a secure method for multiple parties to compute the value of a predetermined set of ranked elements in a union of their private datasets. The present invention teaches how to compute such values while preserving the privacy of each party's private dataset information. The present invention also enables parties to generate a histogram of the values in the union of their sets without knowing which party contributed which element values to the histogram data, and includes a set of malicious user verification tests for reporting on any inconsistencies in a party's responses whenever one or more parties do not fully trust each other.
In one exemplary application, a set of hospitals can use the present invention to gain more information about the general patient population (median life expectancy, median white blood cell count, etc) without revealing confidential information about any of their individual patients. In another exemplary application, a set of universities can compute their combined median salary without the cost of a trusted third party, while assuring that nothing beyond the combined median salary is revealed.
The present invention also teaches a different computational technique for securely computing the value of a k-th ranked element while requiring a lower computational overhead than current systems. Such a reduction in overhead increases the calculation's computational speed and thus enables a larger number of parties having larger private datasets to participate in the k-th ranked element computation. So while current methods for finding the k-th ranked element value and handling malicious users are very messy and complicated, the boundary tests and protocol taught by the present invention instead provides a much simpler and cleaner approach.
The system 100 includes a first party ranking system 102, a second party ranking system 104, up through an “s-th” party ranking system 106. The parties 102, 104, and 106 communicate over a network 108. Each of the party ranking systems 102, 104, and 106 preferably include: a dataset analysis module 10, a private dataset 112, and a secure computation module 114. While, the present invention is primarily described with respect to the first party ranking system 102 and its functionality, the present invention is preferably effected in parallel and in a similar manner in all of the other party ranking systems 104 through 106.
The method begins wherein each dataset analysis module 110 in each of the party ranking systems 102, 104, and 106 (1≦i≦s), individually and privately analyzes their respective private datasets 112 for the following statistical information: a number of elements [Ti] in the private dataset 112, a minimum value [αi] of the elements in the dataset 112, and a maximum value [βi] of the elements in the dataset 112. The secure computation module 114 securely sums the number of elements [Ti] in each of the private datasets 112 using a known cryptographic protocol to calculate a combined (i.e. total) number of elements [T] (i.e. T=Σ Ti, 1≦i≦s) without revealing each party's individual [Ti] over the network 108.
Such known cryptographic protocols teach how two or more parties (i.e. P1 . . . Pn, where n≧2) having private inputs (i.e. X1, X2, . . . Xn), can evaluate a given function (i.e. f(X1, X2, . . . Xn)) without divulging more information about their private inputs (i.e. X1 . . . Xn) than is already implicit in the calculated function (i.e. f(X1, . . . ,Xn)) itself. This cryptographic protocol is used in several more computations within the present invention as will be discussed below.
Next, the secure computation module 114 uses a cryptographic protocol to identify a combined minimum value [αMIN] from all of the private dataset minimum values [αi] (i.e. [αMIN=MIN(αi), where 1≦i≦s). The secure computation module 114 identifies a combined maximum value [βMAX] from all of the private dataset maximum values [βi] (i.e. [βMAX=MAX(βi), where 1≦i≦s).
The parties 102, 104, and 106 select an element having a k-th rank [k] (i.e. the “k-th ranked element”) within the total number of elements [T] whose value [V] is to be determined. Preferably the k-th rank [k] is selected based on a prior agreement between each of the system administrators managing the various party ranking systems 102, 104, and 106.
Those skilled in the art recognize that there are many ways to select the k-th rank [k], such as by negotiations between the various parties 102, 104, and 106. In such a negotiation, the parties 102, 104, and 106 may even agree to use the present invention to securely determine values associated with more than one (e.g., where k is the 25th, 50th, and 75th percentile) or even all of ranked elements in the combined set of private datasets (i.e. k, where k ranges from 1 to T) so that a histogram of the values associated with the various elements in [T] may be generated. Which individual element value came from which private dataset 112, however, would still remains private. In other instances of the present invention a system administrator for one or more of the party ranking systems may only agree to help calculate one of the ranked elements.
The dataset analysis module 10, within each of the party ranking systems 102, 104, and 106, securely sorts the elements within their own private dataset 112 according to the element's respective values [v] in ascending order (i.e. from [αi] to [βi] respectively). Those skilled in the art however will recognize that there are many ways to “rank” data and not all of them are based on an element's value. Other ranking approaches may be based on some other attribute of the dataset elements, such as a date and time when an element was entered into the private dataset.
The secure computation module 114 within the first party ranking system 102 announces the test value [m1] 206 to all of the parties 104 through 106. However, in an alternate embodiment, [m1] can be implicitly computed by each of the parties.
Each of the parties 102, 104, through 106 individually compute in a privacy-preserving manner (i.e. each party keeps the exact values of their dataset elements secret) a number of elements [li] in their private dataset 112 whose elements have a value [v] that is less than the test value [m1] 206. Each of the parties 102, 104, through 106 also computes in a privacy-preserving manner a number of elements [gi] in its private dataset 112 whose elements have a value [v] that is greater than the test value [m1] 206, where [li] and [gi] are integers.
Note that in some applications of the present invention, the system administrator for one or more of the parties 102, 104, and 106 may not completely trust information provided by one or more of the other parties (e.g. the system administrator thinks that some of the other parties may be either intentionally or unintentionally providing inaccurate information about the information in their private dataset 112 or not following the present invention's computational protocol). Those parties with such concerns preferably implement a set of verification tests which ensure that the parties 102, 104, and 106 provide consistent inputs, over the course of the method as the search range shrinks.
While the extra functionality is discussed as if only the first party 102 does not trust the second party [Pi=2], those skilled in the art recognize that any of the parties may selectively implement such extra protective functionality with respect to more than one, or even all, of the other parties (i.e. Pi where 1≦i≦s)
If the first party 102 is concerned that the second party's [Pi=2] 104 information may be inaccurate, the secure computation module 114 defines for the second party 104 a lower verification boundary [(l)i=2] (relabeled as BL in the claims) for checking the validity of the [li=2] values provided by the second party 104, and a greater verification boundary [(g)i=2] (relabeled as BG in the claims) for checking the validity of the [gi=2] values provided by the second party 104. (l)i=2 denotes a number of elements the second party possesses that are strictly smaller than the current search range, and (g)i=2 denotes a number of elements the second party possesses that are strictly larger than the current search range. Since the search range shrinks as the method executes, both (l)i=2 and (g)i=2 increase as the method executes. These first and second boundaries are initially set to zero (i.e. (l)i=2=0 and (g)i=2=0) but are later revised as the k-th element value [V] identification process continues as described below.
If the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 within the first party ranking system 102 uses a cryptographic protocol to verify that the number of elements reported by the second party 104 as less than the test value [m1] (i.e. li=2) plus the number of elements reported by the second party 104 as greater than the test value [m1] (i.e. g1=2), does not exceed the number of elements [Ti=2] the second party reported as being in their private dataset (Di=2) (i.e. li=2+gi=2≦Ti=2).
As introduced above, the cryptographic protocol enables the secure computation module 114 to perform this computation without the second party 104 (or any other party) having to reveal its private li and gi values. In one embodiment, the cryptographic protocol used to perform the verification uses a secure function [h] that preferably outputs a logical “true” if the second party 104 fails the verification test, and “false” if the second party 104 passes the verification test.
Continuing, if the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 for the first party 102 also uses a cryptographic protocol to verify that the number of elements the second party 104 reported as less than the test value [m1] (i.e. li=2), is greater than or equal to the first boundary [(l)i=2] (i.e. li=2≧(l)i=2).
If the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 for the first party 102 uses a cryptographic protocol to verify that the number of elements the second party 104 reported as greater than the test value [m1] (i.e. gi=2), is greater than or equal to the second boundary [(g)i=2] (i.e. gi=2≧(g)i=2).
If any one of the three verification tests fail, the secure computation module 114 announces on the network 108 that the second party 104 has provided inaccurate information and stops all processing within the system 100 (i.e. aborts the k-th element value calculation). In an alternate embodiment, the system 100 can reinitialize the k-th element value calculation, but this time the remaining parties exclude the second party's 104 data from the calculation.
If a sum total number of elements reported by all parties 102, 104, and 106 as less than the test value (i.e. Σ li, where 1≦i≦s) is less than or equal to the k-th rank [k] minus one (i.e. Σ li≦k−1); and, a sum total number of elements reported by all parties as greater than the test value (i.e. Σ gi, where 1≦i≦s) is less than or equal to the total number of elements [T] minus the k-th rank [k] (i.e. Σ gi≦T−k), then the secure computation module 114 sets value [V] of the element having the k-th rank [k] equal to the test value [m1] and sets a control variable equal to “done”. The secure computation module 114 preferably performs this summation (i.e. Σ li and Σ gi), and the summations to follow, using a cryptographic protocol which preserves the privacy of each party's individual li and gi values.
If the sum total number of elements reported by all of the parties 102, 104, 106 as less than the test value (i.e. Σ li) is itself greater than or equal to the k-th rank [k] (i.e. Σ li≧k), then the value [V] of the k-th ranked element has not been found, but is instead equal to something lower than the test value [m1], and so the secure computation module 114 defines a smaller search range 208 from [a2=a1] 202 to [b2=m1] 210 and sets the control variable to “not done”. The smaller search range's 208 upper limit [b2] has been reduced to the test value [m1] since the value [V] of the k-th ranked element is equal to something less than the test value [m1].
If the first party 102 is concerned that information provided by party 104 may be inaccurate, the secure computation module 114 uses a cryptographic protocol to set the second boundary [(g)i=2] equal to the second party's 104 reported private dataset size [Ti=2] minus the total number of elements the second party 104 reported as less than the test value [m1] (i.e. (g)i=2=Ti=2−li=2). Note that as the upper limit [b2] of the search range decreases, [(g)i=2] is non-decreasing. This can be seen by noting that [Ti=2−li=2≧gi=2], which is enforced to be at least as much as the previous value of [(g)i=2], while the lower limit [a2] of the search range remains the same, and thus [(l)i=2] is not increased.
If the sum total number of elements reported by all of the parties 102, 104, 106 as greater than the test value (i.e. Σ gi) is itself greater than or equal to T−k+1 (i.e. Σ gi≧T−k+1), then the value [V] of the element having the k-th rank [k] also has not been found, but is instead equal to something more than the test value [m1], and so the secure computation module 114 increases the lower limit [a2] of the search range to the test value [m1] and sets the control variable to “not done”. The search range's lower limit would be increased since the value [V] of the k-th ranked element is equal to something greater than the test value [m1].
Note that
If the first party 102 is concerned that information provided by party 104 may be inaccurate, the secure computation module 114 uses a cryptographic protocol to set the first boundary [(l)i=2] equal to the second party's 104 reported private dataset size [Ti=2] minus the total number of elements [gi=2] the second party 104 reported as greater than the test value [m1] (i.e. (l)i=2=Ti=2−gi=2). Note that as the lower limit [a3] of the search range increases, [(l)i=2] is non-decreasing. This can be seen by noting that [Ti=2−gi=2≧li=2], which is enforced to be at least as much as the previous value of [(l)i=2], while the upper limit [a3] of the search range remains the same, and thus [(g)i=2] is not increased.
If the control variable is set to “done” then the method ends. If the control variable is set to “not done” then element values within one of the smaller search ranges are analyzed again, but this time using a different test value (e.g. [m2] 212, [m3] 218, and so on).
Through this iterative process, eventually the lower and upper limits zero in on the value [V] of the k-th ranked element 220, since Σ li≦k−1, and Σ gi≦T−k, where 1≦i≦s, as described above. This value [V] of the k-th ranked element 220 will be accurate even if multiple elements within the private datasets 112 have this same value [V].
In step 406, the secure computation module 114 uses a cryptographic protocol to identify a combined minimum value [αMIN] from all of the private dataset minimum values [αi] (i.e. [αMIN=MIN(αi), where 1≦i≦s). In step 408, the secure computation module 114 uses a cryptographic protocol to identify a combined maximum value [βMAX] from all of the private dataset maximum values [βi] (i.e. [βMAX=MAX(βi), where 1≦i≦s).
In step 410, the parties 102, 104, and 106 select an element having a k-th rank [k] (i.e. the “k-th ranked element”) within the total number of elements [T] whose value [V] is to be determined. In step 412, the dataset analysis module 110, within each of the party ranking systems 102, 104, and 106, securely sorts the elements within their own private dataset 112 according to the element's respective values [v] in ascending order (i.e. from [αi] to [βi] respectively). In step 414, the secure computation module 114, within each of the party ranking systems 102, 104, and 106, defines an initial search range 201 from [a1=αMIN] 202 to [b1=βMAX] 204. In step 416, the secure computation module 114 sets a test value [m1] 206 for the k-th ranked element value [V] equal to some value within the initial search range 201. In step 418, the secure computation module 114 within the first party ranking system 102 announces the test value [m1] 206 to all of the parties 104 through 106.
In step 420, each of the parties 102, 104, through 106 compute in a privacy-preserving manner a number of elements [li] in its private dataset 112 whose elements have a value [v] that is less than the test value [m1] 206. In step 422, each of the parties 102, 104, through 106 computes in a privacy-preserving manner a number of elements [gi] in its private dataset 112 whose elements have a value [v] that is greater than the test value [m1] 206, where [li] and [gi] are integers.
In step 424, if the first party 102 is concerned that the second party's [Pi=2] 104 information may be inaccurate, the secure computation module 114 defines for the second party 104 a lower verification boundary [(l)i=2] for checking the validity of the [li=2] values provided by the second party 104, and a greater verification boundary [(g)i=2] for checking the validity of the [gi=2] values provided by the second party 104.
In step 426, if the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 within the first party ranking system 102 uses a cryptographic protocol to verify that the number of elements reported by the second party 104 as less than the test value [m1] (i.e. li=2) plus the number of elements reported by the second party 104 as greater than the test value [m1] (i.e. gi=2), does not exceed the number of elements [Ti=2] the second party reported as being in their private dataset (Di=2) (i.e. li=2+gi=2≦Ti=2).
In step 428, if the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 for the first party 102 also uses a cryptographic protocol to verify that the number of elements the second party 104 reported as less than the test value [m1] (i.e. li=2), is greater than or equal to the first boundary [(l)i=2] (i.e. li=2≧(1)i=2).
In step 430, if the first party 102 is concerned that the second party's 104 information may be inaccurate, the secure computation module 114 for the first party 102 uses a cryptographic protocol to verify that the number of elements the second party 104 reported as greater than the test value [m1] (i.e. gi=2), is greater than or equal to the second boundary [(g)i=2] (i.e. gi=2≧(g)i=2).
In step 432, if any one of the three verification tests (i.e. steps 426, 428, and 430) fail, the secure computation module 114 announces on the network 108 that the second party 104 has provided inaccurate information and stops all processing within the system 100, after which the method 400 ends.
In step 434, if a sum total number of elements reported by all parties 102, 104, and 106 as less than the test value [m1] (i.e. Σ li, where 1≦i≦s) is less than or equal to the k-th rank [k] minus one (i.e. Σ li≦k−1, where 1≦i≦s); and, a sum total number of elements reported by all parties as greater than the test value [ml] (i.e. Σ gi, where 1≦i≦s) is less than or equal to the total number of elements [T] minus the k-th rank [k] (i.e. Σ gi≦T−k, where 1≦i≦s), then the secure computation module 114 sets value [V] of the element having the k-th rank [k] equal to the test value [m1] and sets a control variable equal to “done”.
In step 436, if the sum total number of elements reported by all of the parties 102, 104, and 106 as less than the test value [m1] (i.e. Σ li, where 1≦i≦s) is greater than or equal to the k-th rank [k] (i.e. Σ li≧k), then the value [V] of the k-th ranked element has not been found, but is instead equal to something less than the test value [m1], and so the secure computation module 114 defines a revised search range 208 from [a2=a1] 202 to [b2=m1] 210 and sets the control variable to “not done”.
In step 438, if the first party 102 is concerned that information provided in step 436 by party 104 may be inaccurate, the secure computation module 114 uses a cryptographic protocol to set the second boundary [(g)i=2] equal to the second party's 104 reported private dataset size [Ti=2] minus the total number of elements the second party 104 reported as less than the test value [m1] (i.e. (g)i=2=Ti=2−li=2).
In step 440, if the sum total number of elements reported by all of the parties 102, 104, and 106 as greater than the test value [m1] (i.e. Σ gi, where 1≦i≦s) is greater than or equal to T−k+1 (i.e. Σ gi≧T−k+1, where 1≦i≦s), then the value [V] of the element having the k-th rank [k] also has not been found, but is instead equal to something greater than the test value [m1], and so the secure computation module 114 increases the lower limit [a2] of the revised search range 208 to the test value [m1] and sets the control variable to “not done”.
In step 442, if the first party 102 is concerned that information provided in step 440 by party 104 may be inaccurate, the secure computation module 114 uses a cryptographic protocol to set the first boundary [(l)i=2] equal to the second party's 104 reported private dataset size [Ti=2] minus the total number of elements [gi=2] the second party 104 reported as greater than the test value [m1] (i.e. (l)i=2=Ti=2−gi=2).
In step 444, if the control variable is set to “done” then the method ends. In step 446, if the control variable is set to “not done” then steps 414 through 442 are repeated, but this time with a next test value (e.g. [m2] 212, [m3] 218, and so on) and a new search range (e.g. revised search range 208, revised search range 214, and so on).
While one or more embodiments of the present invention have been described, those skilled in the art will recognize that various modifications may be made. Variations upon and modifications to these embodiments are provided by the present invention, which is limited only by the following claims.