Machine learning (ML) methods, and specifically unsupervised learning methods such as k-means clustering and hierarchical clustering, are highly useful in applications such as identifying patterns in transactions, market research, social networks, search, categorizing and typifying observations, etc.
In some cases, disparate entities with separate data sets may wish to cluster the data in order to analyze information, while also keeping the data private. A limited number of methods exist to conduct privacy-preserving learning, but such methods may be constrained by issues such as efficiency, scaling to large datasets, and data leaking.
Embodiments of the disclosure address these and other problems, individually and collectively.
Described herein are systems and techniques for privacy-preserving unsupervised learning. The disclosed system and methods can enable separate computers, operated by separate entities, to perform unsupervised learning jointly based on a pool of their respective data, while preserving privacy. The system improves efficiency and scalability to large datasets while preserving privacy and avoids leaking a cluster identification.
In an embodiment, the system can jointly compute a secure distance via privacy-preserving multiplication of respective data values x and y from the computers based on a 1-out-of-N oblivious transfer (OT). In various embodiments, N may be 2, 4, or some other number of shares. A first computer can express its data value x in base-N. A second computer can form an ×N matrix comprising N random numbers mi,0 and the remaining elements mi,j=(yjNi−mi,0) mod The first computer can receive an output vector from the OT, having components mi=mi,xi=(yxi Ni−mi,0).
In an embodiment, a first computer and a second computer can jointly compute a secure distance, by at least performing privacy-preserving multiplication of a first data value of the first computer and a second data value of the second computer based on a 1-out-of-N oblivious transfer (OT) corresponding to a number N of shares. The privacy-preserving multiplication may further comprise expressing, by the first computer, the first data value as a first vector having a number L of components, wherein a respective component, having an index i, comprises a respective decomposition coefficient of the first data value in a base equal to N. The privacy-preserving multiplication may further comprise forming, by the second computer, a respective N-component vector having the index i of the respective decomposition coefficient and a second index. The first computer can receive an output vector of the 1-out-of-N OT, wherein a component, having an index i, of the output vector comprises a component of the respective N-component vector, the component having the index i and having the second index corresponding to the respective decomposition coefficient of the first data value in the base equal to N. The first and/or second computer may then privately assign data to a respective cluster of a plurality of clusters based on the secure distance.
In an embodiment, the first component of the respective N-component vector, having the second index equal to 0, can comprise a respective pseudo-random number. A respective remaining component, having the second index equal to j, can comprise the second data value multiplied by j and by N raised to a power of i, minus the first component of the respective N-component vector.
In an embodiment, the second computer can obtain a second output vector of the 1-out-of-N OT. A component, having an index i, of the second output vector may comprise a component of the respective N-component vector, the component having the index i and having the second index 0.
In an embodiment, privately assigning the data to the respective cluster of the plurality of clusters further comprises identifying, via a garbled circuit, a best match cluster of the plurality of clusters for a respective element of a plurality of elements of the data. The best match cluster may have a centroid with a minimum distance to the respective element. Privately assigning the data to the respective cluster of the plurality of clusters further comprises representing the best match cluster as a binary vector comprising a cluster flag for the respective element.
In an embodiment, performing the privacy-preserving unsupervised learning further comprises privately updating a centroid of a cluster by at least multiplying, for a respective element of a plurality of elements of the data and via a second OT and a third OT, a combined first share and second share of a cluster flag for the cluster and the respective element by a combined first share and second share of a position vector for the respective element. The first share of the cluster flag and the first share of the position vector may belong to the first computer. The second share of the cluster flag and the second share of the position vector may belong to the second computer. Privately updating the centroid of the cluster may further comprise summing a product of the multiplying over the plurality of elements. Privately updating the centroid of the cluster may further comprise dividing the summed product by a sum over the plurality of elements of the combined first share and second share of the cluster flag. Privately updating the centroid of the cluster may further comprise updating the centroid based on a result of the dividing.
In an embodiment, the first share and second share of the cluster flag are combined by exclusive OR.
In an embodiment, the privacy-preserving unsupervised learning comprises k-means clustering. The k-means clustering may further comprise selecting a plurality of seed clusters. The k-means clustering may further comprise jointly computing, based on the secure distance, a distance between a respective position vector of a respective element of the data and a respective centroid of a respective seed cluster. The respective position vector may be shared among the first computer and the second computer. The k-means clustering may further comprise identifying a first cluster having a minimum distance to the respective position vector. The k-means clustering may further comprise assigning the respective element to the first cluster. The k-means clustering may further comprise updating a first centroid of the first cluster based on an average of position vectors of elements of the data assigned to the first cluster, including the respective position vector.
In an embodiment, the privacy-preserving unsupervised learning comprises hierarchical clustering.
In an embodiment, N may equal 2 and the 1-out-of-N OT may comprise 1-out-of-2 OT. Alternatively, in an embodiment, N may equal 4 and the 1-out-of-N OT may comprise 1-out-of-4 OT.
In an embodiment, the secure distance comprises a secure Euclidean distance.
In an embodiment, the first computer initially has the first data value and receives a first output share value. The second computer may initially have the second data value and may receive a second output share value. The first output share value and second output share value may sum to a product of the first data value and the second data value.
In an embodiment, the second data value may adaptively change. A later iteration may reuse the 1-out-of-N OT from a first iteration.
These and other embodiments of the disclosure are described in further detail below. For example, other embodiments are directed to systems, computing systems, devices, and computer readable media associated with methods described herein.
A better understanding of the nature and advantages of embodiments of the present disclosure may be gained with reference to the following detailed description and the accompanying drawings.
Prior to discussing the details of some embodiments of the present disclosure, description of some terms may be helpful in understanding the various embodiments.
The term “server computer” may include a powerful computer or cluster of computers. For example, the server computer can be a large mainframe, a minicomputer cluster, or a group of computers functioning as a unit. In one example, the server computer may be a database server coupled to a web server. The server computer may be coupled to a database and may include any hardware, software, other logic, or combination of the preceding for servicing the requests from one or more other computers. The term “computer system” may generally refer to a system including one or more server computers, which may be coupled to one or more databases.
A “machine learning model” can refer to a set of software routines and parameters that can predict an output(s) of a real-world process (e.g., a diagnosis or treatment of a patient, identification of an attacker of a computer network, authentication of a computer, a suitable recommendation based on a user search query, etc.) based on a set of input features. A structure of the software routines (e.g., number of subroutines and relation between them) and/or the values of the parameters can be determined in a training process, which can use actual results of the real-world process that is being modeled.
The term “training computer” can refer to any computer that is used in training the machine learning model. As examples, a training computer can be one of a set of client computers from which the input data is obtained, or a server computer that is separate from the client computers.
The term “secret sharing” can refer to any one of various techniques that can be used to store a data item on a set of training computers such that each training computer cannot determine the value of the data item on its own. As examples, the secret sharing can involve splitting a data item up into shares that require a sufficient number (e.g., all) of training computers to reconstruct and/or encryption mechanisms where decryption requires collusion among the training computers.
Systems and techniques for privacy-preserving unsupervised learning are provided. The disclosed systems and techniques can enable separate computers, operated by separate entities, to perform unsupervised learning jointly based on a pool of their respective data, while preserving privacy. The system improves efficiency and scalability to large datasets while preserving privacy and avoids leaking a cluster identification. That is, the system can avoid revealing any information apart from the final output, in this case the clustering model.
In particular, each respective computer can maintain its own data set, while jointly performing unsupervised learning without revealing the contents of its data set to the other computers. The privacy-preserving joint learning may be based on a secure distance between data points of the pooled data set. The distance, in turn, may be computed based on a privacy-preserving joint multiplication.
The disclosed system and methods enable a pair of computing devices to compute jointly a secure distance via privacy-preserving multiplication of respective data values x and y from the computers based on a 1-out-of-N oblivious transfer (OT). In various embodiments, N may be 2, 4, or some other number of shares. In various embodiments, the system and methods disclosed herein can improve the computational cost of OT-based multiplication by a factor of 1.2 to 1.7. In an adaptive amortized setting, the disclosed system can realize further performance improvements, for example a 200-fold or 500-fold improvement. Moreover, the disclosed protocol may be more efficient than generic secure protocol (MPC and FHE), and can scale to very large datasets.
I. Unsupervised Learning
Embodiments of the disclosed systems and methods can use privacy-preserving joint multiplications to compute a secure distance (e.g., a Euclidean distance) for use with unsupervised learning. By contrast with supervised learning, unsupervised learning may involve machine learning (ML) when no training dataset is available.
Clustering is a type of unsupervised learning which involves grouping a set of objects into classes of similar objects. In particular, by clustering data elements into groups or clusters with similar data elements, an unsupervised learning system may discover patterns or similarities in the dataset, without needing to be trained on what types of patterns may be present. In some embodiments, the system may perform clustering methods of unsupervised learning, such as k-means clustering, or hierarchical clustering.
A. K-Means Clustering
In various embodiments, the data elements may include any type of information of relevance for an unsupervised learning process, such as account records or transaction histories, etc. In this example, the data elements may be represented as points or vectors in a multi-dimensional space. In particular, the coordinates or locations of a respective data point in the multi-dimensional space may represent the respective point's data values, for example, amounts or scored characteristics associated with historical transactions, etc. The data points' coordinates may quantify characteristics of the data. For example, in some embodiments, the closer the coordinate values of respective data points (or particular components of the points' coordinates), the more similar the data points may be.
In particular, the unsupervised learning process (e.g., k-means clustering 100) can operate based on a distance, such as a Euclidean distance, between the data points and their respective clusters. This distance may be a measure of similarity, i.e. the smaller the distance between two data points, the more similar the data points may be. In some embodiments, the system may use other measures of distance and/or other measures of similarity, and is not limited by the present disclosure.
The unsupervised learning process, such as k-means clustering 100, can optimize, or locally optimize, the clusters such that the data points best match their respective clusters. In some embodiments, the number k of clusters may be set, or user-specified, in advance. Alternatively, k may be optimized by the system, and is not limited by the present disclosure. In some embodiments, the system can optimize the number of clusters, the centroid locations of the clusters, and/or the composition of data points within each cluster, in order to achieve optimal clustering and learning. The computational complexity of k-means clustering may typically scale like O(nmt).
In some cases, it may be desirable for multiple parties to cluster a pooled dataset held by the parties. Moreover, the parties may wish to do so in a way that preserves privacy of their respective datasets, i.e. without revealing their datasets to each other. In particular, because the parties or entities are separate from each other (for example, separate companies or organization), they may wish to retain control over their own respective datasets without providing the data to each other.
In some embodiments, instead of having a subset of the data points, each party might have a portion of the data of each respective data point. For example, the parties may have separate shares (such as information associated with separate dimensions in the multi-dimensional space) of the respective data points.
II. Multiple-Server System for Privacy-Preserving Unsupervised Learning
In embodiments of the disclosed system and methods, datasets (such as transaction records) may be broken into shares possessed by two separate entities, such as two companies or organizations. The disclosed system and methods can enable separate computers, operated by separate entities, to perform unsupervised learning jointly based on a pool of their respective dataset shares, while preserving privacy. In particular, the disclosed protocol is more efficient than generic secure protocols, such as secure multi-party computation (MPC) and fully homomorphic encryption (FHE). It can also scale to very large datasets.
Accordingly, Alice may operate a computer, such as a server 206, and Bob may operate a computer, such as server 208. In an embodiment, servers 206 and 208 can communicate directly with each other, in order to engage in a joint two-party computation 210.
Two-party computation 210 may include privacy-preserving joint multiplication, joint secure distance, and/or privacy-preserving joint unsupervised learning computations, according to the methods disclosed herein below. Two-party computation 210 can output model 212, such as optimized, or locally optimized, cluster assignments.
In particular, servers 206 and 208 may communicate directly, because Alice and Bob may not wish to assemble a pooled dataset on any one computer, such as an intermediary computer. As a result, Alice and Bob may instead perform two-party computation 210 in a way such that the output model 212 is determined without the pooled dataset ever being assembled, according to the methods disclosed herein. Moreover, the disclosed methods may have advantages, such as efficiency and scaling advantages, compared with other methods.
For clarity, a certain number of components are shown in
III. Techniques for Privacy-Preserving Computing
Various embodiments can use various secure computation techniques. Such techniques can be used to perform a function on data that is secret-shared across the servers, without exposing the reconstructed data to a server. For example, in various embodiments, the system may use oblivious transfer (OT), garbled circuits, and secret sharing, which are briefly described herein. In some embodiments, the system may use variations of these techniques, as well as other techniques, and is not limited by the present disclosure. How such techniques are combined and used in the overall unsupervised learning process will be described in later sections.
A. Oblivious Transfer
Oblivious transfer (OT) is a fundamental cryptographic primitive that is commonly used as building block in secure multiparty computation (MPC). In an oblivious transfer protocol, a sender S has two inputs x0 and x1, and a receiver R has a selection bit b and wants to obtain xb without learning anything else or revealing b to S. The ideal functionality realized by such a protocol can be defined as: on input (SELECT; sid; b) from R and (SEND; sid; x0; x1) from S, return (RECV; sid; xb) to R. We use the notation (⊥;xb)OT(x0,x1;b) to denote a protocol realizing this functionality.
At 306, sender S attempts to deblind and decrypt v by applying m0 and m1 and its key to v to derive two possible values for k, one of which will equal the random value generated by receiver R. Sender S does not know (and hopefully cannot determine) which of m0 and m1 that receiver R chose. At 307, x0 and x1 are blinded with the two possible values of k. At 308, the blinded x0 and x1 are sent to receiver R, each can be identified as corresponding to 0 or 1. At 309, receiver R deblinds the blinded value corresponding to the selected b using k.
Accordingly, oblivious transfer can function by sender S generating two keys, m0 and m1. Receiver R can then encrypt a blinding factor using one of the keys. Sender S then decrypts the blinding factor using both of the keys, where one is the correct blinding factor, which is used to blind both the secret inputs. Receiver R can then deblind the correct input.
Embodiments can use OTs both as part of an offline protocol for generating multiplication triplets and in an online phase for logistic regression and neural network training in order to securely compute the activation functions. One-round OT can be implemented, but it requires public-key operations by both parties. OT extension minimizes this cost by allowing the sender and receiver to perform m OTs at the cost of λ base OTs (with public-key operations) and O(m) fast symmetric-key ones, where λ is the security parameter. Some implementations can takes advantage of OT extension for better efficiency. In one embodiment, a special flavor of OT extension called correlated OT extension is used. In this variant which we denote as COT, the sender's two inputs to each OT are not independent. Instead, the two inputs to each OT instance are: a random value so and a value s1=ƒ(s0) for a correlation function ƒ of the sender's choice. The communication for a COT of l-bit message, denoted by COTl, is λ+l, and the computation is hashing, e.g., SHA256, SHA3, or other cryptographic hashing.
B. Garbled Circuit 2PC
A garbling scheme consists of a garbling algorithm that takes a random seed σ and a function ƒ and generates a garbled circuit F and a decoding table dec; the encoding algorithm takes input x and the seed σ and generates garbled input x the evaluation algorithm takes x and F as input and returns the garbled output z; and finally, a decoding algorithm that takes the decoding table dec and z, and returns ƒ(x). Some embodiments can have the garbling scheme satisfy standard security properties.
The garbled circuit can be viewed as a Boolean circuit, with inputs in binary of fixed length. A Boolean circuit is a collection of gates connected with three different types of wires: circuit-input wires, circuit-output wires and intermediate wires. Each gate receives two input wires (e.g., one for each party) and it has a single output wire which might be fan-out (i.e. be passed to multiple gates at the next level). Evaluation of the circuit can be done by evaluating each gate in turn. A gate can be represented as a truth table that assigns a unique output bit for each pair of input bits.
The general idea of garbled circuits is that the original circuit of a function is transformed so that the wires only contain random bitstrings. For example, every bit in a truth table is replaced by one of two random numbers (encodings), with the mapping known by the sender. Each gate is encoded so that its output bitstring can be computed from the inputs, and only the random bitstrings of output gates can be mapped back to actual results. The evaluation computes the function, but does not leak information about the values on separate wires. The main drawback of the garbled circuit technique are inefficient evaluation and inability to reuse the circuit. Accordingly, the two parties (sender and receiver) can learn the output of the circuit based on their own input and nothing else, i.e., not learn the other party's input to the circuit.
In some implementations, the sender prepares the garbled circuit by determining a truth table for each gate using the random numbers that replaced the two bits on the input wires. The output values are then encrypted (e.g., using double-key symmetric encryption) with the random numbers from the truth table. Thus, one can only decrypt the gate only if one knows the two correct random numbers for a given output value. The four values for a given table can be randomly permuted (garbled), so there is no relation of row to the output value. The sender can send the garbled tables (sets of encrypted values and the relation between them, i.e., outputs from one to be inputs to another) to the receiver, as well as the sender's input of random values corresponding to the input bits. The receiver can obtain the corresponding random numbers from the sender via oblivious transfer, and thus the sender does not know the receiver's input. The receiver can then compute the output, or potentially get an encoding that needs to be sent back to the sender for decoding. The encoding can be sent to the sender if you want the sender to learn the output. This may not be done for intermediate values of the computation, and may only be done for a final output, which the parties are supposed to learn anyways. If a party is not supposed to learn the output, the encoding does not need to be sent. In some embodiments, the garbled circuits work on intermediate values (e.g., comparison function), so they may not be decoded.
Given such a garbling scheme, it is possible to design a secure two-party computation protocol as follows: Alice generates a random seed σ and runs the garbling algorithm for function ƒ to obtain a garbled circuit GC. She also encodes her input x using σ and x as inputs to the encoding algorithm. Alice sends GC and x to Bob. Bob obtains his encoded (garbled) input y using an oblivious transfer for each bit of y. While an OT-based encoding is not a required property of a garbling scheme, all existing constructions permit such interacting encodings. Bob then runs the evaluation algorithm on GC,x, y to obtain the garbled output z. We can have Alice, Bob, or both learn an output by communicating the decoding table accordingly. The above protocol securely realizes the ideal functionality Fƒ that simply takes the parties inputs and computes ƒ on them. In this disclosure, we denote this garbled circuit 2PC by (za,zb)GarbledCircuit(x;y,ƒ).
C. Secret Sharing and Multiplication Triplets
As described above, values are secret-shared between the two servers. In various embodiments, three different sharing schemes can be employed: Additive or arithmetic sharing, Boolean sharing and Yao sharing. In some embodiments, all intermediate values are secret-shared between the two servers.
To additively share a -bit value a∈, the party with a (say Alice) generates aA∈ uniformly at random and sends aB=a−aA∈ to the other party (Bob). We denote Alice's share by aA and the Bob's by aB. For ease of composition, we omit the modular operation in the protocol descriptions, i.e. aB=a−aA mod . This disclosure mostly uses the additive sharing in the examples, but other sharing techniques may be used. To reconstruct an additively shared value (aA,aB), the party who should learn the value receives the second share from the other party. For example, for Alice to learn a then Bob would send aB to Alice who computes a=aA+aB∈.
Given two shared value of a and b, it is easy to non-interactively add the shares by having each party compute cA=aA+bA, cB=aB+bB. It is easy to see that c=a+b=cA+cB.
Boolean sharing of a bit a∈ can be seen as additive sharing in and hence all the protocols discussed above carry over. In particular, the addition operation is replaced by the XOR operation (⊕) and multiplication (to be described) is replaced by the AND operations (AND(⋅,⋅)).
In the case that a∈ is an -bit value, the binary sharing of a can be extended by sharing the vector a0, . . . , ∈ such that a=ai2i∈. We will refer to (a0, . . . , )∈ as the binary decomposition of a∈. More generally, an N-ary or base-N decomposition of a∈ is defined by (a0, . . . , )∈ such that a=aiNi∈ and where =┌logN()┐.
Finally, one can also think of a garbled circuit protocol as operating on Yao sharing of inputs to produce Yao sharing of outputs. In particular, in all garbling schemes, for each wire w the garbler (P0) generates two random strings k0w, k1w. When using the point-and-permute technique, the garbler also generates a random permutation bit rw and lets K0w=k0w∥rw and K1w=k1w∥(1−rw). The concatenated bits are then used to permute the rows of each garbled truth table. A Yao sharing of a is a0Y=K0w and a=Kaw. To reconstruct the shared value, parties exchange their shares. XOR and AND operations can be performed by garbling/evaluating corresponding gates.
To switch from a Yao sharing a0Y=K0w, K1w and a1Y=Kaw to a Boolean sharing, P0 lets a0B=K0w[0] and P1 lets a1B=a1Y[0]. In other words, the permutation bits used in the garbling scheme can be used to switch to Boolean sharing for free. We denote this Yao to Boolean conversion by Y2B(⋅,⋅).
IV. Privacy-Preserving Unsupervised Learning
A. Euclidean Distance
In a typical embodiment, the distance may be a Euclidean distance.
The Euclidean distance, such as distance 400, may be given by an d-dimensional distance formula, such as in standard Euclidean geometry: DEuc(p, c)=Σi=1d(pi−ci)2, where DEuc is the square of the Euclidean distance, and the data elements p∈, and clusters c∈ are d-dimensional vectors with elements in . In this formula, p and c are fixed to a particular data point and cluster, respectively, and the index i indexes a coordinate or component the d-dimensional vector space.
Note that, in some embodiments, the secure distance may be another distance function or metric, for example some other function of the coordinates of points 402 and 404, and is not limited by the present disclosure.
Furthermore, the location of cluster center 458 may be recomputed to include point 454, which in this example may be newly-added to the cluster. For example, if cluster center 458 is a centroid, such as a mean location of all the data points assigned to the cluster, the centroid location can then be recomputed. For example, centroid location 458 may be recomputed by computing a new mean location of all the data points, including newly-added point 454.
B. Secure Euclidean Distance
However, as described in the examples of
Thus, computing the secure distance may involve secure or privacy-preserving multiplication of Alice's and Bob's respective shares. In particular, we can break down the expression for DEuc into separate parts for Alice and Bob, plus a joint part: DEuc(p, c)=Σi=1d(piA+piB−ciA−ciB)2=Σi=1d((piA−ciA)+(piB−ciB))2. Alice can locally compute Σi=1d(piA−ciA)2, while Bob can locally compute Σi=1d(piB−ciB)2. Thus, it remains for Alice and Bob to jointly compute the cross term or inner product Σi=1d(piA−ciA) (piB−ciB), while preserving privacy. Embodiments of the disclosed system and methods can solve this problem by conducting privacy-preserving joint multiplication more efficiently than existing systems. Specifically, the privacy-preserving joint multiplication disclosed herein below may be used to compute the cross term, Σi=1d(piA−ciA) (piB−ciB).
1. Secure Multiplication with 1-Out-of-2 OT
In this example, the first computer 504 is operated by Alice, and the second computer 506 by Bob. Alice will hold some integer element x∈ and Bob will hold y∈. They will compute a secret sharing of z=xy such that Alice holds a uniformly random zA∈ and Bob holds zB∈ such that z=zA+zB. First Alice can express this data value x in binary as a vector 508. In an embodiment, xi may contain Alice's share of the input to a cross term, x=piA−ciA and y=piB−ciB.
Generally, the disclosed method may work by expressing Alice's data value x as its binary decomposition (x0, . . . , )∈ and then using the individual bits xi to select messages from Bob in the OT 502. Intuitively, if xi=0, then (yxi)=0, so it is not necessary to receive information about y from Bob. If xi=1, Bob's message containing information about y is selected, and Alice receives information about (yxi)=y. It then holds that z=(xiy)2i is the desired value. However, this procedure would reveal Bob's input y to Alice. Privacy is achieved by transmitting a random integer ri∈ when xi=0 (instead of zero) and y+ri when xi=1. Therefore Alice can compute zA=(xi(y+ri)2i∈ and Bob computes zB=−ri2i∈. As a result, OT 502 is used to perform the equivalent of the multiplication while preserving privacy.
In some embodiments, the system can instead perform privacy-preserving multiplication based on 1-out-of-N OT, for a value of N besides 2. This can be accomplished by modifying Alice's and Bob's messages in the OT compared to the case of 1-out-of-2 OT, as in
2. Secure Multiplication with 1-Out-of-N OT
In this example, the first computer 544 is operated by Alice, and the second computer 546 by Bob. Alice will hold some integer element x∈ and Bob will hold y∈. They will compute a secret sharing of z=xy such that Alice holds a uniformly random zA∈ and Bob holds zB∈ such that z=zA+zB. The first computer 544 can have a data value x∈ that is held by Alice is expressed in base-N. Specifically, x may be expressed 548 as (x0, . . . , )∈ which is the N-ary decomposition of x such that x=xiNi∈.
The second computer 546 operated by Bob has a data value y∈. The second computer 546 can form an ×N matrix 550 based on y. Let M be matrix 550 where Mi,j=jy+ri∈ and ri∈ is a uniformly random integer for i∈{0, 1, . . . , −1} and j∈{0, 1, . . . , N−1}. In an embodiment ri may be sampled by the OT protocol and therefore Mi,0=ri may not need to be explicitly communicated.
The first computer 544 can receive an output vector 552 from the OT. According to the OT, as described in the example of
Based on some embodiments of the OT, each OT will require approximately κ=128 bits of communication plus (N−1) bits, where κ is the security parameter. This follows from 1-out-of-N OT for random messages requiring κ bits of communication. Setting the remaining messages Mi,1, . . . , Mi,N-1 to the desired value then requires (N−1) bits of communication. Given that the multiplication protocol requires ′=┌log N()┐ OTs the total communication is t=┌log N()┐(κ+(N−1)). Based on the required we choose N∈{2,3,4, . . . } to minimize t.
In various embodiments, the system and methods disclosed herein can improve the computational cost of OT-based multiplication by a factor of 1.2 to 1.7. In an amortized setting, the disclosed system can realize further performance improvements, as described below.
A straightforward application of secure multiplication, as described in the examples of
3. Amortized Secure Multiplication
In some embodiments, the system may improve efficiency by applying secure multiplication in an amortized setting. Consider some fixed value x∈ known to Alice and a series of many y1, . . . , ym′∈ known to Bob. For each new yi, the parties desire to compute a sharing of zq=xyq. The yi values may be known to Bob at different times, e.g. a different subset of them at each iteration of the algorithm. The generic method is to repeat a previous multiplication protocol for each of the m′ multiplications.
Instead, the parties may first perform a random OT where Alice uses the N-ary decomposition of x denoted as (x0, . . . , )∈ to learn the corresponding random OT value/key. Specifically, let R∈ random messages/keys output by the OTs where Alice learns Ri,x
4. Amortized Euclidean Squared Distance
Over the course of the training process the training points p1, . . . , pn∈ are fixed and secret shared between Alice and Bob. The secret shared centroids c1, . . . , cm∈ are set to some initial value and then updated at each iteration of the algorithm. Let cj,t∈ be the value of the jth centroid at iteration t. At each iteration t the squared Euclidean distance ei,j,t between pi and cj,t for i∈[n], j∈[m] is computed. Previously, ei,j,t, was expressed as ei,j,t=Σh=1d(pi,h−cj,t,h)2−Σh=1d((pi,hA+pi,hB)−(cj,t,hA+cj,t,hB))2.
Recall that only the mixed terms need to be computed using the secure multiplication protocol, i.e. (pi,hA−cj,t,hA)(pi,hB−cj,t,hB). In the amortized setting it can be beneficial to rewrite this as pi,hA(pi,hB+cj,t,hB)−pi,hB(cj,t,hA)+cj,t,hAcj,t,hB. Observe that pi,hA in the first term and pi,hB is the second term are fixed across all t∈[T] iterations and therefore can efficiently be computed using the amortized multiplication protocol, i.e. we define pi,hA, and pi,hB, as the fixed multiplicand and (pi,hB+cj,t,hB), and cj,t,hA, as the changing multiplicand for t∈[T].
Finally, the cj,t,hAcj,t,hB, term is contained in all n Euclidean distance computation at iteration t and they can be computed once at each iteration. Note that the number of centroids is typically much smaller than the number of training points, i.e. n>>m. The total overhead is therefore Tmd standard multiplication (to compute cj,t,hAcj,t,hB) and 2Tnmd amortized multiplications. By contrast, the generic approach would require Tnmd standard multiplications, which for some parameter choices results in high computational overhead.
The disclosed method of
However, later iterations of k-means clustering may reuse the already-performed OT from the first iteration for the first two terms. Therefore, later iterations only need instances of OT to compute the last term of the DEuc cross-term.
Accordingly, a total cost of the k-means clustering process is (2n+m)d instances of OT. For example, if n=1,000, m=10, and t=50, the disclosed solution shows a factor of 200-fold improvement. In a further example, if n=10,000, m=10, and t=100, the disclosed solution shows a factor of approximately 500-fold improvement.
5. Communication Flow Diagram
Schematically, in this example the two computing devices 610 and 620 are shown as sending respective information to OT 630. As described in the example of
In an embodiment, a first -component vector 640 is transmitted from the first computing device 610 to the 1-out-of-N OT 630. The -component vector 640 may be a base-N decomposition of the first data value. For example, if N=2, the -component vector 640 may be a binary decomposition, and the components of vector 640 may be binary bits representing the first data value. In another example, if N=4, the -component vector 640 may be a base-4 decomposition.
In an embodiment, an ×N matrix 650 is transmitted from the second computing device 620 to the 1-out-of-N OT 630. The ×N matrix 650 may comprise vectors with N components each. An ith respective N-component vector of the ×N matrix may have the index i of the respective decomposition coefficient and a second index j. In an embodiment, a first component Mi,0 of the ith respective N-component vector can comprise a respective pseudo-random number ri. In an embodiment, this first component Mi,0 may the first index i and may have the second index equal to zero or one. A respective remaining component Mi,j of the respective N-component vector, having the index i and having the second index equal to j, can comprise the second data value multiplied by j and by N raised to a power of i, minus the first component of the respective N-component vector.
Finally, an output vector or output value 660 is transmitted from the 1-out-of-N OT 630 to the first computer 610. According to the OT, as described in the example of
In an embodiment, the first computer can instead receive an output value 660 from the 1-out-of-N OT. The output value 660 may comprise a sum over i of components of the respective N-component vector multiplied by N to the power i. A respective component in this sum may have the index i and have the second index corresponding to the respective decomposition coefficient of the first data value in the base N. That is, the output value may comprise zA=Mi,x
In an embodiment, the second computer may obtain a second output vector or value 670 from the 1-out-of-N OT 630. A component, having an index i, of the second output vector 670 may comprise a negative component −Mi,0 of the ×N matrix 650 (or the ith respective N-component vector of matrix 650). This negative component −Mi,0 may have the index i and the second index 0. That is, the ith component of the output vector 670 may comprise the opposite of the first component (having index 0) of the ith column vector of the ×N matrix 650 (also referred to as the ith respective N-component vector), comprising the opposite of the respective pseudo-random number ri. In an embodiment, the first computer can instead receive an output value 670 from the 1-out-of-N OT 630. The output value 670 may comprise a sum over i of negative components −Mi,0 of the respective N-component vector multiplied by N to the power i, a respective negative component −Mi,0 within the sum having the index i and having the second index 0. That is, the output value may comprise zB=−Mi,0Ni∈.
Accordingly, in an embodiment, the final sharing of z=xy may be computed as a first output value 660 of zA=Mi,x
The first and/or second computer may then privately assign data to a respective cluster of a plurality of clusters based on the jointly computed secure distance.
C. Assigning Data Points to Clusters
As described in the examples of
Accordingly, the system can assign pi to the cluster ck*. This must be achieved without revealing the values of pi, cj, ei,j and j* to either party, i.e. the computation is performed on secret shares of these values.
In an embodiment, Alice and Bob may have separate shares of the distances ei,j, i.e. Alice holds ei,jA∈ and Bob holds ei,jB∈. In particular, as described in the examples of
the system may apply a garbled circuit. The system may present j* as a binary vector J*∈, where J*j*=1, and where Jk*=0 for all k≠j*.
An embodiment could be, for each i the parties input the binary decomposition of their share ei,jA, ei,jB into a garbled circuit computation. In a recursive manner let us assume we have compute J0, J1∈ where J0 is the argmin vector for ei, 1, . . . , ei,m/2 and J1 is the argmin vector for ei,2/m+1 . . . , ei,m. Moreover, let e0 and e1 be the min value corresponding to J0 and J1. The final J* can be computed as J*=cJ0∥(1⊕c)J1 and e*=ce0+(1⊕c)e1 where c=1 if e0>e1 and 0 otherwise. The comparison may be computed within the garbled circuit using the standard comparison circuit. The multiplication between c and Jb along with c and eb can be performed within the garbled circuit or using the OT based multiplication protocol. Note that the base case of the recursion is a single ei,j which is the min by definition. Embodiments of the disclosed system may make use of other efficient solution for this conversion between share types. Furthermore, the disclosed system and methods may be more secure than existing approaches, in particular by preventing leaking the cluster identification j* with the minimum distance.
D. Updating Clusters
After assigning data points to the closest clusters, the system can further update the location of the centroid of each cluster: ck=avg(p), pi∈Ck. Table 1 shows an example of a flag Mik for cluster assignment for data points p1 through p4. In particular, Table 1 shows, for cluster index k, the value of the flag Mik, which indicates whether or not point pi is assigned to cluster k. For example, points p1 and p3 both are assigned to the cluster k=3.
In an embodiment, the new cluster centroid can be computed according to:
In this example, M13=0 and M33=0, so p1 and p3 do not contribute to the centroid calculation for cluster k=3.
Note that, in this formula, both the flag Mik and the coordinates of the data point pi are shared. Thus, a possible direct solution to this computation would be to convert the Boolean share (e.g., of Mik) to an arithmetic share, or secure multiplication of Mik by pi. However, in embodiments, the system may instead use two OTs.
V. Privacy-Preserving Clustering
In some embodiments, the privacy-preserving unsupervised learning may comprise k-means clustering, as in the example of
At step 705, the first computer may express the first data value as a first vector having a number of components. A respective component, having an index i, may comprise a respective decomposition coefficient of the first data value in a base equal to N. In particular, the -component vector may be a base-N decomposition of the first data value. For example, if N=2, the -component vector may be a binary decomposition, and the components of the vector may be binary bits representing the first data value. In another example, if N=4, the -component vector may be a base-4 decomposition.
At step 710, the second computer may form an ×N matrix. A respective N-component vector of the ×N matrix may have the index i of the respective decomposition coefficient and a second index j.
In an embodiment, a first component of the respective N-component vector can comprise a respective pseudo-random number. In an embodiment, this first component may the index i and may have the second index equal to zero or one. A respective remaining component of the respective N-component vector, having the index i and having the second index equal to j, can comprise the second data value multiplied by j and by N raised to a power of i, minus the first component of the respective N-component vector.
At step 715, the first computer can receive an output vector of the 1-out-of-N OT. A component, having an index i, of the output vector may comprise a component of the respective N-component vector, the component having the index i and having the second index corresponding to the respective decomposition coefficient of the first data value in the base equal to N. In an embodiment, the first computer can instead receive an output value from the 1-out-of-N OT. The output value may comprise a sum over i of components of the respective N-component vector multiplied by N to the power i, a respective component in the sum having the index i and having the second index corresponding to the respective decomposition coefficient of the first data value in the base equal to N. That is, the output value may comprise zA=Mi,x
In an embodiment, the second computer can obtain a second output vector of the 1-out-of-N OT. A component, having an index i, of the output vector may comprise a component of the respective N-component vector, the component having the index i and having the second index 0. That is, the component, having index i, of the output vector may comprise the first component of the respective N-component vector in step 710, comprising the respective pseudo-random number. In an embodiment, the first computer can instead receive an output value from the 1-out-of-N OT. The output value may comprise a sum over i of components of the respective N-component vector multiplied by N to the power i, a respective component in the sum having the index i and having the second index 0. That is, the output value may comprise zB=−Mi,0Ni∈. In an embodiment ri may be sampled by the OT protocol and therefore Mi,0=ri may not need to be explicitly communicated.
In an embodiment, N may equal 2 and the 1-out-of-N OT may comprise 1-out-of-2 OT. Alternatively, in an embodiment, N may equal 4 and the 1-out-of-N OT may comprise 1-out-of-4 OT.
In an embodiment, the first computer initially has the first data value and receives a first output share value. The second computer may initially have the second data value and may receive a second output share value. The first output share value and second output share value may sum to a product of the first data value and the second data value. Note that both shares preserve the privacy of the input values x and y. That is, the system can avoid leaking or revealing any information apart from the final multiplication output.
In an embodiment, the second data value may adaptively change. A later iteration may reuse the 1-out-of-N OT from a first iteration.
At step 720, the system may then privately assign data to a respective cluster of a plurality of clusters, based on the jointly-computed secure distance. Privately assigning data to clusters may be based on the methods disclosed above, and in the example of
A. Assigning Data to Clusters
At step 735, the system may identify, via a garbled circuit, a best match cluster of the plurality of clusters for a respective element of a plurality of elements of the data. The best match cluster may have a centroid with a minimum distance to the respective element. As described above, the system may apply the garbled circuit to compute
where ei,j=dist(pi, cj) for j=1, . . . , m and i=1, . . . , n.
At step 740, the system may represent the best match cluster as a binary vector comprising a cluster flag for the respective element. As described above, the system may present j* as a binary vector J*∈, where J*j*=1, and where Jk*=0 for all k≠j*.
B. Updating Cluster Centroids
At step 765, the system may multiply, for a respective element of a plurality of elements of the data and, a combined first share and second share of a cluster flag for the cluster and the respective element by a combined first share and second share of a position vector for the respective element. The multiplication may include performing a second oblivious transfer (OT) and a third OT. The first share of the cluster flag and the first share of the position vector may belong to the first computer. The second share of the cluster flag and the second share of the position vector may belong to the second computer.
In an embodiment, the first share and second share of the cluster flag are combined by exclusive OR.
At step 770, the system may sum a product of the multiplying over the plurality of elements. As described above, the system may form Σi=1nMik*pi or Σi=1n(MikA⊕MikB)*(piA+piB).
At step 775, the system may divide the summed product by a sum over the plurality of elements of the combined first share and second share of the cluster flag. As described above, the system may form
At step 780, the system may update the centroid based on a result of the dividing. Specifically, the new cluster centroid can be computed according to ck, as in step 775. The system may then update the centroid coordinates for the kth cluster to ck, and may subsequently use the updated coordinates when computing distances of the data points pi from the centroid for each cluster.
VI. Example for Unsupervised Learning
At step 810, the system can first set the number of clusters equal to a value k. In some embodiments, k may be specified by a user.
At step 820, the system can then select k initial clusters of the data points. This may be done in various ways, e.g., by selecting the clusters randomly. In some cases, the initial choice of clusters is arbitrary, since the method may in any case eventually converge on optimal, or locally optimal, clusters. However, in some cases, the converged clusters may depend on the initial choice. The initial clusters may also be referred to as seed clusters.
Next, at step 830, the system can calculate the distances between the individual data points and all the cluster centroids. In some embodiments, the distances can be calculated in some other way, for example based on a respective cluster as a whole.
At step 840, the system may then assign each data point to a cluster to which the data point has the minimum distance.
At step 850, the system can then compute new cluster centroids based on the new assignments of data points to clusters.
At step 860, the system can then determine whether to perform another iteration. For example, the system can determine to perform another iteration if any of the cluster centroids have moved. If the system determines to perform another iteration, it can return to calculating the distances. If the system does not determine to perform another iteration, the method can end, resulting in the optimized, or locally optimized, cluster assignments.
VII. Computer System and Apparatus
Storage media and computer-readable media for containing code, or portions of code, can include any appropriate media known or used in the art, including storage media and communication media, such as but not limited to volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage and/or transmission of information such as computer-readable instructions, data structures, program modules, or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, data signals, data transmissions, or any other medium which can be used to store or transmit the desired information and which can be accessed by the computer. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments.
Embodiments of the disclosure provide for a number of advantages over conventional systems. For example, in various embodiments, the system and methods disclosed herein can improve the computational cost of OT-based multiplication by a factor of 1.2 to 1.7. In an adaptive amortized setting, the disclosed efficient multiplication may have a computational cost of O((n+mt)d), vs O(nmtd), an improvement of nmt/(n+mt), where n is the number of points, m is the number of clusters, t is the number of iterations, and d is the dimensionality of the data points. In an example, if n=10,000, m=10, and t=100, the disclosed system and methods can show a factor of approximately 500-fold improvement. Moreover, the disclosed protocol is more efficient than generic secure protocol (MPC and FHE), and can scale to very large datasets.
In the preceding description, various embodiments have been described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.
It should be understood that any of the embodiments of the present disclosure can be implemented in the form of control logic using hardware (e.g. an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner. As used herein, a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software.
Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission, suitable media include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like. The computer readable medium may be any combination of such storage or transmission devices.
Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium according to an embodiment of the present disclosure may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g. a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
The above description is illustrative and is not restrictive. Many variations of the disclosure will become apparent to those skilled in the art upon review of the disclosure. The scope of the disclosure should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with their full scope or equivalents.
One or more features from any embodiment may be combined with one or more features of any other embodiment without departing from the scope of the disclosure.
A recitation of “a”, “an” or “the” is intended to mean “one or more” unless specifically indicated to the contrary.
All patents, patent applications, publications, and descriptions mentioned above are herein incorporated by reference in their entirety for all purposes. None is admitted to be prior art.
This application is a Continuation of U.S. patent application Ser. No. 16/675,499, filed on Nov. 6, 2019, the disclosure of which is herein incorporated by reference in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
9747470 | Patey | Aug 2017 | B2 |
20080021899 | Avidan | Jan 2008 | A1 |
20110026781 | Osadchy et al. | Feb 2011 | A1 |
20150341326 | Premnath | Nov 2015 | A1 |
20160020904 | Ioannidis et al. | Jan 2016 | A1 |
20160026825 | Patey et al. | Jan 2016 | A1 |
20160119119 | Calapodescu | Apr 2016 | A1 |
20160182222 | Rane et al. | Jun 2016 | A1 |
20170359321 | Rindal | Dec 2017 | A1 |
20170372226 | Costa et al. | Dec 2017 | A1 |
20180276417 | Cerezo Sanchez | Sep 2018 | A1 |
20200242466 | Mohassel | Jul 2020 | A1 |
20210091952 | Wentz | Mar 2021 | A1 |
20210209247 | Mohassel | Jul 2021 | A1 |
Number | Date | Country |
---|---|---|
107145791 | Sep 2017 | CN |
3475868 | May 2019 | EP |
20160030874 | Mar 2016 | KR |
Entry |
---|
Mohassel et al., SecureML: A System for Scalable Privacy-Preserving Machine Learning, IEEE Symposium, 2017. |
U.S. Appl. No. 16/675,499 , Non-Final Office Action, dated Aug. 26, 2022, 14 pages. |
U.S. Appl. No. 16/675,499 , Notice of Allowance, dated Jan. 23, 2023, 5 pages. |
Chen et al., “SANNS: Scaling Up Secure Approximate k-Nearest Neighbors Search”, Available Online at: https://arxiv.org/pdf/1904.02033.pdf, Nov. 20, 2019, pp. 1-19. |
Application No. CN202080076982.6 , Office Action, dated Nov. 15, 2022, 10 pages. |
Dahl et al., “On Secure Two-party Integer Division”, The 16th International Conference on Financial Cryptography and Data Security, Available Online at: https://eprint.iacr.org/2012/164.pdf, 2012, 24 pages. |
Application No. EP20885029.7 , Extended European Search Report, dated Nov. 24, 2022, 9 pages. |
Jaschke et al., “Unsupervised Machine Learning on Encrypted Data”, Available Online at: https://madoc.bib.uni-mannheim.de/46393/1/2018-411.pdf, 2018, 30 pages. |
Meng et al., “Private Two-Party Cluster Analysis Made Formal & Scalable”, Available online at: https://arxiv.org/abs/1904.04475, Oct. 28, 2019, 19 pages. |
Mohassel et al., “Practical Privacy-Preserving K-Means Clustering”, Proceedings on Privacy Enhancing Technologies, Issue 4, Aug. 17, 2020, pp. 414-433. |
Application No. PCT/US2020/058980 , International Search Report and Written Opinion, dated Feb. 22, 2021, 9 pages. |
Schoppmann et al., “Private Nearest Neighbors Classification in Federated Databases”, Cryptology ePrint Archive: Report 2018/289, 2018, 16 pages. |
Su et al., “Collaborative Agglomerative Document Clustering With Limited Information Disclosure”, Security and Communication Networks, vol. 7, No. 6, Jun. 14, 2013, pp. 964-978. |
Vaidya et al., “Privacy-Preserving K-Means Clustering over Vertically Partitioned Data”, In: KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Aug. 2003, pp. 206-215. |
Yuan et al., “Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large- Scale Dataset”, In: IEEE Transactions on Cloud Computing, vol. 7, No. 2, Jun. 2019, pp. 568-579. |
Mohassel “Practical Privacy-Preserving K-means Clustering”, Cryptology ePrint Archive, Oct. 5, 2019, pp. 1-30. |
Application No. SG11202203263Q, Written Opinion, Mailed on Feb. 27, 2024, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20230252358 A1 | Aug 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16675499 | Nov 2019 | US |
Child | 18302965 | US |