Method for classifying private data using secure classifiers

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram of a method for securely classifying samples using k-nn classification according to an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 shows an embodiment of our invention for a method for privately classifying data using a k-nn classifier or a Parzen window classifier. A first party (Alice, a client) has the private 101 data x, and a second party (Bob, a server) has the classifier ƒ and private labeled samples 102. The two parties, usually named Alice and Bob (or A and B), want to privately evaluate the function ƒ on the data x, while each party gives the other party as little information as possible. We give exact definitions of the privacy Alice and Bob expect to obtain from their interaction.

A secure dot product is applied 110 to the private query sample 101 and the private labeled samples 102 to determine securely distances 111 between the private query sample and the plurality of private labeled samples. A secure k-rank protocol is applied 120 to the distances 111 to determine a nearest distance 121 of a k^thnearest labeled sample having a particular label. Then, a secure Parzen window protocol is applied 130 using the nearest distance 121 to label the private query sample according to the particular label 131.

In the above method, n private query samples can partitioned 115 into √{square root over (n)} clusters, each cluster including √{square root over (n)} private labeled samples, and private shares of the distances between the private query sample and the private labeled samples in one of the clusters can be obtained before applying the protocols 110, 120 and 130.

Density Estimation

We describe density estimation methods. All computations are performed over a prime field F=Z_pfor some prime number p. A size of the prime field p is exponential as to the cryptographic security parameter. In all cases, we need to evaluate if an expression is larger or smaller than zero. This cannot be done over the prime field F. Therefore, the parameters are integers. We let S be some upper bound on every intermediate, absolute value in the computation that is known to both Alice and Bob. Such a value S is implicit from the choice of representation of the inputs over the integers. Then, Alice and Bob select the prime field F with cardinality bigger than 2S. Due to wrapping (modulo F), negative numbers over the integers map to numbers bigger than |F|−S, and positive numbers are smaller than S.

Private Computation

Let ƒ: {0, 1}*×{0, 1}→{0, 1}*×{0, 1} be a function. We denote the first item of f(x₁, x₂) by ƒ_A(x₁, x₂), and the second item by ƒ_B(x₁, x₂). Let π be a two-party protocol for evaluating the function ƒ, and we denote the two parties by A for Alice, and B for Bob.

If Alice and Bob follow the protocol, then the parties are called semi-honest, or honest but curious. The views of A and B while applying the protocol π(x₁, x₂) are, respectively

view_A^π=(x₁; r_A; m_A,1, . . . , m_A,t) and

view_R^π=(x₂, r_B, m_B,1, . . . , m_B,t),

where for i=A, B, we denote by r_ithe random input of party i, and by m_i,jthe j^thmessage received by party i. The output received by party i at the end of the protocol π(x₁, x₂) is output_i^π (x₁, x₂).

By private computation, we mean that everything that party i learns from view_i^π can only be determined from the input and the output. Formally, we require that view_i^π can be simulated from the input and output of party i.

We say that the protocol π privately evaluates a deterministic function ƒ if there exist a probabilistic processes SIM_Aand SIM_Bsuch that, for i=A, B

(SIM_i(x₁,ƒ_i(x₁, x₂))≅view_i^π (x₁, x₂).

Oblivious Transfer Protocol

The well known oblivious transfer (OT) protocol is a cryptographic process that enables Alice to select one item from Bob's database of items. Alice obtains this item without revealing to Bob which item was selected and without learning anything about the rest of the items in the database. The most common variant of OT is

$(\begin{matrix} 2 \\ 1 \end{matrix})$

OT, where Bob has items (v₀, v₁) and Alice selects item bε{0, 1}, and after the OT, Alice obtains item v_band nothing else, and Bob learns nothing about Alice's item.

The OT process can be constructed from any known enhanced trapdoor permutation.

The Millionaire's Problem

Alice has a number x, and Bob has a number y. Alice and Bob would like to compare the two numbers and determine who has the larger number, without revealing anything else about the numbers themselves. This is a generalization of the well known millionaire's problem where two parties desire to know who has the most money, without revealing the amount each party owns.

Protocol 1—Secure Millionaire Protocol

To compare the two numbers, Alice and Bob define two comparison values {,} ε F, which are used to compare the result, typically {,}={0, 1}.

Input

Alice has a number x ε {0, 1}^m, and Bob has a number y ε {0, 1}^m.

Output

Alice obtains a, Bob obtains b, such that

$a + b = \mod F = {\begin{matrix}  & if x \geq y \\ ℬ & if x < y \end{matrix} .$

This is the secure millionaire's protocol.

Protocol 2—Secure Dot Product Protocol

We use a secure dot product protocol to determine a distance between two samples in the prime field F^d, see below. The samples can be finger prints, biological sample, images, and the like.

Alice has a vector X ε F^d, and Bob has a vector Y ε F^d, Alice and Bob privately determine shares of the dot (inner) product. The secure dot product protocol is well known.

Input

Alice has X ε F^d, and Bob has Y ε F^d.

Output

Alice obtains a, Bob obtains b, such that a+b (mod F)=X^TY, where T is the conventional notation for a transpose operator.

Secure Rank Protocol

Alice and Bob can use the secure dot product protocol to determine respectively first and second private shares of a distance between the private query sample and labeled samples in the database of the classifier. Now, Alice and Bob need to determine labeled samples in a small neighborhood label the private query sample.

A size of the neighborhood can be either defined by a radius, as is done in Parzen window classification, or by order, as is done in k-nn classification. We can convert the order to the radius by determining the distance of the k^thnearest sample, and then applying the Parzen window classification using this distance.

Alice and Bob have respectively first and second private shares of the distance d, of the nearest neighbor to the private query sample, and now wish to determine first and second private shares of the second nearest neighbor. Actually, Alice and Bob have random shares to the squared distances. This does change the ranking of the k^thitem.

The above can be expressed as determining a smallest distance, in a list of distances, subject to a constraint that this distance should greater than d₁. This rules out selecting the nearest neighbor again because the distance d₁to the nearest neighbor is not greater than the threshold. By repeatedly updating the threshold, and applying the secure rank protocol for k-1 times, we determine the k^thnearest neighbor. To preserve privacy, the threshold parameters are given as random shares.

We use two intermediate protocols in the secure rank protocol. The first, called the private shared-minimum protocol, is a modification of the millionaire's protocol in the case where both parties have private shares to the numbers to be compared. The output is a random share of the minimum distance. The second intermediate protocol adds a threshold parameter, and requires that the output is greater that this threshold.

Private Share Minimum Protocol

In this well known protocol, Alice and Bob have private shares of x and y, and want to obtain private shares of a minimum of x.

Formally, Alice has x_A, y_Aand Bob has x_B, y_Bsuch that x_A+x_B=x (mod F), and y_A+y_B=y (mod F). After the secure evaluation, Alice has z_Aand Bob has z_B, such that z_a+z_b=z (mod F), where z=min (x, y). Because x is an item of F, after adding x_a+x_b(mod F), we obtain exactly x, and not a value that is congruent to x. The same is true for y. Thus, we can compare x, y as integers, and obtain the random shares z_a, z_bwhich are congruent to the minimum of (x, y), and yet are random in the prime field F.

We can construct a Boolean circuit for evaluating this function. The circuit sums x_a+x_b, and y_a+y_b, and subtracts |f| if the result is bigger than |F|. Then, we compare x, y, and output random shares of the minimum. We can use Yao's circuit for this private computation.

To extend the private share minimum protocol to more than two numbers, one can either iterate this protocol over the pairs of numbers or construct a circuit that directly compares more than two numbers.

Threshold Private Shared Minimum Protocol

The well known private threshold shared minimum protocol differs from the private shared minimum protocol by adding another input, the threshold, such that on input x, y, t, the output is the minimum of x, y that is also at least t. If x, y<t, then the output is t. The inputs are given as respective first and second private shares of Alice and Bob, and the output is given as random shares of the minimum.

We can construct a circuit for this function. The random shares are added modulo F, x, y, and are compared to t. If x, y>t, then the output is just the minimum of x, y. If x<t<y, then the output is y. If x, y<t, then the output is t.

Secure k-Rank Protocol

The well known secure k-rank protocol applies the protocol for k-1 times, updating the threshold after every iteration. Formally, Alice and Bob have respectively first private shares a₁, . . . , a_nand second b₁, . . . , b_n, and a list of squared distances d₁, . . . , d_n. All distances are unique positive numbers. Then, the distance of the nearest neighbor is determined by setting the threshold to zero.

Protocol 3—Secure Parzen Window Protocol

A density of a Parzen-window can be estimated by

$\begin{matrix} p (X) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{V} ρ (\frac{X - X_{i}}{2 r}), & (1) \end{matrix}$

where the n labeled samples X₁, . . . , X_nare independently and identically distributed (i.i.d), r is a radius of a volume of V=(2r)^d, and ρ(u) is a window function defined as

$\begin{matrix} ρ (u) = {\begin{matrix} 1 & \langle u_{j} \rangle \leq 0.5; \\ 0 & otherwise \end{matrix} j = 1, \dots, d . & (2) \end{matrix}$

That is, the Parzen window protocol p(X) measures a density of labeled samples in a hypercube centered around the private query sample X. Classifiers based on Parzen-window estimation, estimate densities of each class and classify the private query sample by the label corresponding to a maximum posterior probability of the labeled samples.

Bob has n private labeled samples in a d-dimensional space, {X_i, y_i}ⁿ_i=1, where c is a maximum number of classes, and y_iare corresponding labels of the samples in the database. Alice has the private query sample X. Alice has the values c and n. Alice wants to privately label the private query sample X using the secure Parzen window protocol. Alice and Bob have respectively first and second private shares r_Aand r_B, of a Parzen window with a radius r=r_A+r_B.

Given {X_i, y_i}ⁿ_i=1private labeled samples in the database, we determine the label y of the majority of labeled samples that are within the radius r from the private query sample X.

The protocol for the secure Parzen window protocol proceeds as follows.

- 1. In an input step, Alice provides a private query sample X and a first private share r_Aif the radius r=r_A+r_Bof the Parzen window, and Bob provides a database of samples X₁, . . . , X_nwith labels y₁, . . . , y_nε {1, . . . , c} and a second private share r_Bof the radius.
- 2. For i=1, . . . , n, Alice and Bob perform the following steps.
  - (a) determining private shares of squared distances d_ibetween the private query sample X and the labeled samples X_iusing the secure dot product protocol such that a sum of the private shares a_i+b_iis equal to −2X^TX_i, where T is the transpose operator.
  - (b) Alice adds X²to a_i, and Bob adds X²_ito b_ito obtain a_i+b_i=d_i=∥X−X_i∥²₂.
- 3. For i=1, . . . , c, Alice and Bob perform the following step
  - (a) Alice sets a_i=0.
  - (b) Bob sets b_i=0.
  - (c) For j=1, . . . , n, perform the following steps:
    - i. Alice and Bob use Yao's garbled circuit for the following function:
- (d) Alice learns p,

$p = {\begin{matrix} π (0) & d_{j} < r \\ π (1) & otherwise \end{matrix}$

- - (e) where π(.) denotes a random permutation and the permutation
  - (f) function π( ) is known to Bob.
  - (g) ii. Bob generates a random number Δb.
  - (h) iii. If y_i≠i, then Bob constructs a two entry table in which all entries
  - (i) are Δa=−Δb,
  - (j) iv. If y_i=i, then the entries are

$\begin{matrix} Δ a = {\begin{matrix} - Δ b & p = π (0) \\ - Δ b + 1 & p = π (1) \end{matrix} & (3) \end{matrix}$

- - (k) v. Alice uses OT with p as her index to learn Δa.
  - (l) vi. Alice updates a_i=a_i+Δa, and Bob updates b_i=b_i+Δb.
- 4. Alice and Bob apply the secure k-rank protocol with a₁, . . . , a_cand b₁, . . . , b_cas their private shares to determine: arg_imax(a_i+b_i), which is a label of a majority of the labeled samples that are within the radius r of the private query sample X.

Bob knows only the radius r of the window, so Alice sets r_A=0, and Bob sets r_B=r. The results are private shares for Alice and Bob, and one party sends a private share to the other party to obtain the result.

In step 3, Alice and Bob apply a variant of the secure k-rank protocol that determines the maximum, not the minimum of a list of distances.

Obtaining the Private Query Sample

Bob desires to search for records in Alice's database. In this case, it is not enough to detect that a particular record exist in the database. Bob is interested in the information associate with the private query sample X. The secure Parzen window protocol only reveals the label of private query sample X, not its value or associated data record. This problem is solved as follows.

For simplicity of this description that the classification problem is a binary classification. At the beginning of the last step of the above secure Parzen window protocol, Alice has first private shares a₁, b₁, and Bob has second private shares a₂, b₂. Both parties agree that if a predicate a₁+b₁<a₂+b₂is true, then the private query sample X is a possible candidate record, and Alice should provide the record to Bob, otherwise, Bob should learn nothing. To solve this Alice and Bob apply the private share minimum protocol. At the end of the private shared minimum protocol, Bob has a random permutation p of the predicate. Note that we reverse the roles of Alice and Bob in this case. Alice can now construct a two-way OT table that Bob can index with p. One entry in the table contains the private query sample, the other entry contains an empty string.

Protocol 4—Secure k-nn Protocol

A secure k-nn protocol maximizes

P(y_m|X)=max_iP(y_i|X) (4)

where y₁, . . . , y_care the possible states. The k-nn protocol classifies the query sample X by measuring the density of each class and taking the maximum density. This is very similar to the secure Parzen window protocol, only now the size of the neighborhood is data dependent and not determined ahead of time. Therefore, we determine the distance of the k^thnearest neighbor as a preprocessing step and then apply the secure Parzen window protocol. We determine the distance of the k^thnearest neighbor by applying the shared minimum protocol k-1 times. Alice and Bob start with t_A=t_B=0, and use the output of each iteration as input for the next invocation of the threshold shared minimum protocol. After k-1 rounds, Alice and Bob have private shares of the correct distance, and they can invoke the secure Parzen window protocol to complete the secure k-nn protocol.

Given {X_i, y_i}ⁿ_i=1labeled samples, we determine the label of the majority of labeled samples that are the k nearest neighbors of the private query sample X.

Input

Alice has the private query sample X. Bob has a database of labeled samples X₁, . . . , X_nwith labels y₁, . . . , y_nε {1, . . . , c}, and a parameter k.

Output

Alice obtains the label of the majority of the samples within the k nearest neighbors of the private query sample X.

- 1. Alice and Bob compute private shares of the squared distance from the query point to the points in the database, see also the first step in the secure Parzen window protocol.
- 2. Alice and Bob apply the secure k-rank protocol on the list of shared distances to obtain private shares r_Aand r_B, respectively, of the k^thnearest neighbor.
- 3. Alice and Bob apply the secure Parzen protocol with r_Aand r_B.

Nearest Match Protocol

In some cases, Alice might be interested in querying a database for the labeled sample nearest to her private query sample. For example, if Alice has a finger-print, she might be interested in querying a database to obtain the nearest match to her private query sample. This can be achieved by using our k-nn protocol with k=1. With slight modification of the secure k-nn protocol Alice can obtain all the attributes of the nearest match, not just its label.

Approximate Nearest Neighbors Protocols

Nearest neighbor methods scale linearly with the number of samples in the database. As a result, approximations are often used to speed up the computations. The approximations can be used with any of the protocols described above. We first describe the non-secure approximate nearest neighbor protocol and then describe how to make the approximate protocol secure.

Non-Secure Approximate Nearest Neighbors Protocol

Bob can partition his labeled samples into l=√{square root over (n)} clusters, each having l samples at most. There are several ways for Bob to do this. One way is to apply a k-means process to the data and determine l centroids. Then, Bob attaches at most l samples to each centroid. Alternatively, Bob can use a k-d tree to recursively partition the space until each node in the k-d tree contains l samples or less. Then, each node is represented by the centroid of all its samples. Partitioning the space to accelerate retrieval is mainly useful in low-dimensional spaces. In case the data are in high-dimensional spaces, geometric hashing methods are more appropriate.

Secure Approximate Nearest Neighbors Protocol

The secure approximation nearest neighbor protocol includes a number of stages. First, we determine the nearest centroid to the private query sample. Then, we determine private shares of the distances of the labeled samples associated with this centroid to the private query sample. Finally, we apply the secure-APX-nn protocol.

The key observation is that Bob generates his private share ahead of time, and for each centroid Bob generates Alices's private shares. When Alice and Bob agree on the nearest centroid, Alice obtains her private shares from Bob. Then, Alice and Bob apply the secure-k-nn protocol.

Formally, Bob has n labeled samples X₁, . . . , X_n. Denote l=√{square root over (n)}, and let C₁, . . . , C_lbe the centroids determined by Bob. Let X_ijrepresent the j^thsample associated with centroid i. Then, the private share of Alice for the private query sample X_ijis given by A_ij=X_ij−B_i.

Protocol 5—Secure-APX-nn Protocol

Given X₁, . . . , X_nsamples, we determine approximately, the k nearest neighbors of the private query sample X.

Input

Alice has the private query sample X, and Bob has a database of labeled samples X₁, . . . , X_n, and a parameter k.

Output

Alice and Bob obtain private shares of approximate k nearest neighbors of the private query sample X.

- 1. Bob determines private shares B₁, . . . , B_l.
- 2. For i=1, . . . , l Bob performs the following steps
  - (a) For j=1, . . . , l Bob does the following Bob determines A_ij=X_ij−B_i.
- 3. Bob generates l attributes T(C₁), . . . , T(C₁), where T(C_i)={A_ij}^l_j=1
- 4. Alice and Bob apply the secure k-nn protocol with k=1 to determine the nearest centroid C_ito the private query sample X. Instead of obtaining private shares of the nearest centroid, Alice and Bob obtain private shares of T(C_i).
- 5. Alice and Bob now have private shares of the l approximately nearest sample to the private query sample X and can now apply any of the protocols described above.

EFFECT OF THE INVENTION

The embodiments of the invention provide methods for privacy preserving k-nn classification. The methods enable one party to have its data classified by another party, securely. This is done by developing a secure nearest neighbor search protocol that is then used for several secure density estimation methods, including Parzen window classification as well as exact and approximate k-nn classification.

It is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

Claims

1. A computer implemented method for classifying securely a private query sample using exact k-nn classification, comprising the steps of: applying a secure dot product protocol to the private query sample and a plurality of private labeled samples to determine securely distances between the private query sample and the plurality of private labeled samples;applying a secure k-rank protocol to the distances to determine a nearest distance of a kth nearest labeled sample having a particular label; andapplying a secure Parzen window protocol using the nearest distance to label the private query sample according to the particular label.
2. The method of claim 1, in which the samples are fingerprints.
3. The method of claim 1, in which the samples are biological samples.
4. The method of claim 1, in which the samples are surveillance images.
5. The method of claim 1, in which the samples are biometric data.
6. The method of claim 1, in which there are n private query samples, and further comprising: partitioning the plurality of n private labeled samples into √{square root over (n)} clusters, each cluster including √{square root over (n)} private labeled samples;obtaining private shares of the distances between the private query sample and the private labeled samples in one of the clusters; andperformed the applying steps of claim 1.
7. The method of claim 1, in which the secure k-rank protocol applies a secure rank protocol k-1 times.
8. The method of claim 1, in which the secure Parzen window protocol measures a density of the private labeled samples in a hypercube centered around the private query sample.
9. The method of claim 1, further comprising: obtaining securely a record associated with the private query sample.
10. A computer implemented method for classifying securely a private query sample, comprising the steps of: (1) providing a private query sample X and a first private share rA of a radius r=rA+rB of a Parzen window by a first party Alice, and providing n labeled samples X1, . . . , Xn, with labels y1, . . . , yn ε {1, . . . , c} for a maximum of c classes, and a second private share rB of the radius of the Parzen window by a second party Bob;(2) performing for i=1, . . . , n the steps of: (a) determining private shares of squared distances di between the private query sample X and the labeled samples Xi using a secure dot product protocol such that ai+bi=di=−2XTXi; and(b) adding X2 to ai by Alice and adding X2i to bi by Bob to obtain ai+bi=di=∥X−Xi∥22;(3) performing for i=1, . . . , c the steps of: (a) setting ai=0 by Alice; and(b) setting bi=0 by Bob; and(c) performing for j=1, . . . , n the steps of: (i) applying a milionaire's protocol to obtain p for Alice, in which p is a random permutation of a first comparison value of the millionaire's protocol if dj<r, and otherwise p is a random permutation of a second comparison value , and to obtain the random permutation by Bob;(ii) generating a random number Δb by Bob;(iii) constructing a two entry table by Bob having entries Δa=−Δb if yi≠I, and otherwise if y=I the entries are

Method for classifying private data using secure classifiers

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims