The present invention relates to a method of masking data making up a user profile associated with a node of a network, and a method for estimating the similarity between a first node and a second node of a network, each node having an associated user profile.
The invention belongs to the field of observing privacy in networks, notably in distributed networks.
Nowadays, communications via internet are widely generalized, and many distributed networks are created on the basis of data and preferences from users, such as for example social networks. Typically, such a distributed network is formed with nodes, each node being associated with a user and having an associated user profile, consisting of a sub-set of elements present from among a set of possible elements, relating to particularities or to preferences of the user.
Various applications use the user profile, for example in order to generate groups of users having common interests or for recommendation systems able to propose to a user, new products compliant with his/her interests. Such applications require the calculation of a similarity between user profiles in order to determine their proximity according to the data or preferences expressed in the user profiles.
It is known, in a centralized network in which the representative nodes of the users are considered as clients, how to use a central server for carrying out such a similarity calculation. This poses two types of problems, a confidentiality problem of the data making up the user profile, on the one hand, and a problem of computing power to be applied by such a central server carrying similarity calculations on the other hand.
A goal of the invention is to propose a method of masking data making up a user profile, giving the possibility of obtaining a representation of a user profile which has guarantees in terms of confidentiality while having good usefulness for a similarity calculation between profiles associated with the nodes of a network.
For this purpose, the invention according to an example relates to a method of masking data making up a user profile associated with a node of a network, a user profile consisting of a sub-set of elements present from among a set of possible elements. The method of the invention includes the following steps:
Advantageously, the method of the invention gives the possibility of obtaining a masked data structure representative of the elements present in the user profile, which observes a pre-determined confidentiality level.
The method of masking data according to the invention may have one or several of the features below:
According to another example, the invention relates to a method for estimating similarity between a first node and a second node of a network, each node having an associated user profile, a user profile consisting of a sub-set of elements present among a set of possible elements. The estimation method includes the steps:
Advantageously, the similarity estimation may be applied onto any node of the network insofar that the similarity estimation is made from the obtained masked data structures. In particular when the similarity estimation is applied onto various nodes of the network, like in a distribution system of the peer-to-peer type, this gives the possibility of getting rid of the need for a central server carrying out all the calculations.
The similarity estimation method according to the invention may have one or more of the features below:
Other features and advantages of the invention will become apparent from the description which is made below, as an indication and by no means as a limitation, with reference to the appended figures, wherein:
Of course, a network according to the invention in practice consists of any number of nodes; the number of nodes may change dynamically.
In practice, a node of the network is implemented by a programmable device of the personal computer type, having computing capabilities and the capabilities for connecting to a communications network. For example, the nodes 2, 4, 6 are connected via the Internet network.
The nodes 2, 4, 6 of the network 1 each have an identifier, the identifiers being respectively noted as A, B and C.
Each node has an associated user profile, consisting of a sub-set of elements present among a set of possible elements. The set of possible elements is either finite or not.
For example, the present elements designate addresses of the URL (“Uniform Resource Locator”) type of documents for which the user has expressed preference.
Each node computes and stores in memory an initial representation of the associated user profile, respectively noted as PA for node A, PB for node B, and PC for node C. This initial representation, also called a private profile is calculated in a deterministic way.
Moreover, each node calculates and stores a masked representation of the associated user profile, also called public profile, respectively noted as P*A for node A, P*B for node B, and P*C for node C, the calculation being carried out according to one of the embodiments of the invention explained in detail hereafter with reference to
Thus, the masked representations may be published, i.e. transmitted to other nodes of the network 1, while guaranteeing confidentiality and a security level with respect to malicious attacks seeking to retrieve the masked or hidden data of the user profile. The initial representations are private representations, which are locally stored on each node and which are not made public.
As illustrated by arrows, illustrated in
Alternatively, each node transmits its public profile to a central server or to another node of the network which carries out the similarity computations between two public profiles, therefore in their masked representation.
In the example of
Each of the nodes A and B is able to compute similarity values between nodes, and to provide these values to client applications for example. These nodes A and B then play the role of a server in the network 1.
For example, node A may estimate the similarities s(A, B) and s(A, C) from PA, P*B and P*C, and node B may estimate the similarities, s(B, A) and s(B,C) from PB, P*A and P*C.
Node C does not receive any masked representation of other nodes so cannot carry out any similarity computation. Node C plays a role of a client.
Each node of a network according to the invention is applied on a programmable device, such as a computer, the main functional blocks of which are schematically illustrated in
Thus, a programmable device 10 able to apply the invention comprises a screen 12, a means 14 for inputting commands from a user, for example a keyboard, which may be integrated into a touch screen, a central processing unit 16, capable of executing control programme instructions when the programmable device is on. The programmable device 10 also includes means for storing information 18, for example registers, capable of storing executable code instructions allowing software application of a method of masking the data and/or of a similarity estimation method according to the invention. Further, the programmable device 10 includes means 20 for communicating with a communications network.
The various functional blocks of the device 10 described above are for example connected via a communications bus 22.
Such a masking method is for example applied at every update of the user profile associated with a node N by the user, for example every time the user expresses a new preference.
Alternatively, such a masking method is applied periodically or on demand.
In a first step 30, the elements present in the user profile of the node N are retrieved. These elements are stored in a memory 18 of the device 10. For example the user profile consists of S elements PU(N)={p1, . . . , ps}, which are present elements, each present element having a unique associated identifier. For example, the identifier is a number associated with a present element.
Subsequently, in the next step 32, an initial representation of the user profile is generated, as a data structure PN={b0, . . . , bM-1} including a predetermined number M of binary elements, i.e. elements which may assume a value from among two possible values {Va, Vb}.
In the preferred embodiment, the possible values are Va=0, Vb=1, therefore each bi is equal to 0 or to 1, and the initial data structure PN is obtained by applying a Bloom filter on the elements pi present in the user profile PU(N).
In a known way, a Bloom filter is a compact probabilistic structure for representation of data, giving the possibility of determining with a certain probability that an element is present in a set of data.
The number M of binary elements of the data structure PN is fixed, regardless of the number S of elements present in the user profile PU(N).
In order to obtain the initial data structure with a Bloom filter, a number K of hash functions is used. Each hash function produces a match between an element present in the user profile PU(N) and an indexed i comprised between 0 and M−1, and the value of the binary element of index i, bi, is set to 1, or more generally to a first value among the two possible values.
In practice, a hash function h applied to a present element pk carries out a pseudo-random draw with as a root the unique identifier of pk, modulo M, with which a numerical value may be obtained: h(pk)=I. The element bi of the initial data structure is then set to the value 1: bi=1.
The taking into account of a new present element ps+1 or the addition of an element of the user profile is accomplished by successively and independently applying each of the K hash functions h(ps+1), and setting to the value of 1 the designated bits of the data structure PN, independently of the values already assumed by the bits bi of the data structure.
Moreover, the data structure PN obtained through the Bloom filter may also be used for checking for the presence or not of an element pk in the user profile associated with a node N. If at least one bit corresponding to a binary position obtained by the applied hash functions with the element pk is equal to 0, the element pk is certainly absent from the user profile.
On the other hand, the fact that all the bits corresponding to a binary position obtained by the K applied hash functions with the element pk are equal to 1, only gives the possibility of inferring the presence of the element pk in the user profile with a certain probability, since it is possible that collisions appear. Thus, a structure obtained by Bloom filtering is representative of the elements present in the user profile with an associated certainty level.
Next, in step 34, a masked representation of the user profile is obtained, noted as PN*={v0, . . . , vM-1}, by applying a probabilistic flip operation of one or several binary values of the data structure obtained previously.
In an example, for each bit bi of the initial data structure, an inversion of the binary value is either applied or not, depending on the result of a random draw, with a probability p.
A draw of a uniform variable X is carried out in the interval [0, 1]
The principle of the flip inversion operation of the value of the binary element bi of the following:
If X≦p then vi=flip(bi)=1−bi
Otherwise, vi=flip(bi)=bi.
More generally, if X≦p, the binary element bi changes value and assumes the other possible value, otherwise it remains unchanged.
According to an alternative, the probabilistic inversion operation is only applied on a sub-set of the binary elements of the initial data structure, for example on one binary element bi out of two, or else on a sub-set of binary elements of the initial data structure also selected in a pseudo-random way.
A data structure P*N, corresponding to a masked representation of the user profile is thus obtained at the end of step 34.
The probabilistic inversion operation is applied with a probability p. It clearly appears that the value of p has a strong impact on the obtained confidentiality level, and also on the usefulness level for similarity calculations of the obtained masked representation. Indeed, a value of p=½ results in a random result, which totally preserves confidentiality but in this case the masked representation is of no use for a similarity calculation.
It is important to determine a probability value p with a predetermined confidentiality level, according to a selected confidentiality metric, while preserving sufficient usefulness level.
By using the differential confidentiality metric, or “differential privacy” defined in the article “Differential privacy: a survey of results”, by C. Dwork, published in Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, Xi'an, China, 25-29 Apr. 2008, pages 1-19, it is possible to calculate p according to the confidentiality parameters relating to the confidentiality of each binary element bi by:
exp(−ε)·Pr[flip(1)]=bi]≦Pr[flip(0)=bi]≦exp(ε)·Pr[flip(1)=bi]
wherein exp represents the exponential function and Pr[A] the probability of an event A. A binary element bi has the value 0 or 1.
The confidentiality ε of each binary element bi according to this confidentiality metric is ensured for a probability value p such that:
In an alternative example, the probability value p is calculated in order to ensure confidentiality at the elements pi present in the user profile from which the data structure is calculated. For this, the number K of hash functions is also taken into account.
The following metric is used. Let there be two profiles PU1 and PU2 which only differ by a single element pi, present in PU1 and absent in PU2. The initial data structures obtained by Bloom filtering associated with these profiles are noted as PN1 and PN2, and P*N1 and P*N2 are the masked data structures obtained by applying a probabilistic inversion operation according to the embodiment described above.
A confidentiality parameters is defined by the following equations:
exp(−ε)Pr[piεPN2*]≦Pr[piεPN1*]≦exp(ε)Pr[piεPN2*]
exp(−ε)Pr[piεPN1*]≦Pr[piεPN2*]≦exp(ε)Pr[piεPN1*]
The confidentiality of each present element pi according to this confidentiality metric is ensured for a probability value p such that:
wherein K is the number of hash functions used in Bloom filtering.
Step 34 is followed by a step 36 for publishing in the network the calculated masked representation P*N.
The example described above includes the application of Bloom filtering for obtaining an initial representation in the form of an initial data structure, and then the application of a probabilistic inversion operation for obtaining a masked representation with an associated confidentiality level.
According to an alternative example, if the set P of possible elements pi in a user profile is countable and finite, equal to a number G of possible elements, in step 32 the initial representation of the user profile is obtained by generating an initial data structure in the form of a vector with size G, indicating the presence or the absence of an element in the user profile: PN=[b1, . . . , bG] wherein bi=1 if the element pi is present in the user profile, and bi=0 if the element pi is absent from the user profile.
In this example, the step 34 for applying a probabilistic inversion operation is applied on this vector PN, in the way explained above. The step 34 is followed by an optional step, not shown in
In order to carry out the estimation of similarity, the data structures calculated as detailed above with reference to
The similarity estimation method is applied by a central processing unit 16 of a programmable device 10 associated with a node A.
In a first step 40, the node A receives, from node B, a masked data structure P*B representative of the elements present in the user profile of the node B. As explained earlier, the masked data structure or public profile P*B of the node B has a confidentiality guarantee with a pre-determined confidentiality level.
Next in step 42, the node A recovers an initial data structure PA representative of the user profile associated with this node A. It is not necessary to use the masked data structure P*A, since the node A does not need to guarantee confidentiality insofar that the computations are carried out on this node.
In practice, the initial data structure PA is computed as explained above with reference to
In the example, the representations of data are calculated by applying Bloom filtering, and then by applying a probabilistic inversion operation in order to obtain a masked data structure.
In the next step 44, a computation for estimating the similarity between the user profiles of the nodes A and B is carried out. As the node A does not have any private profile of the node B, PB, only an estimation of similarity from both present data structures, PA and P*B respectively, may be computed.
According to the embodiment, the estimator of the similarity between the node A and the node B is computed from the scalar product between PA and P*B. The scalar product is equivalent to a cosine similarity measurement for binary vectors.
If it is noted that PA={b0, . . . , bM-1} et P*B={v0, . . . , vM-1}, the scalar product SP is:
In order to obtain an unbiased estimator, SP*, i.e. an estimator for which the expectation value is equal to the expectation value of the scalar product between PA and PB, the following formula is applied:
Alternatively, an unbiased estimator is calculated on the basis of a binary sum BS between the data structures representative of the profiles PA and P*B:
Thus, a similarity estimation between nodes A and B is obtained on node A, from the public profile of node B and from the private profile of node A.
Alternatively, each of the nodes A and B sends its public profile, P*A and P*B respectively, to a central server or to a third party node, different from the nodes A and B, which carry out a similarity estimation calculation between the user profiles of the nodes A and B. In this case, it is also possible to obtain an unbiased estimator SP* for the calculation of similarity, wherein: P*A={u0, . . . , uM-1} and P*B={v0, . . . , vM-1}:
Advantageously, the various similarity estimators explained above have good performances for obtaining nodes similar to a given node, i.e. having a user profile close to the user profile of the given node.
Thus, by means of the invention, a user profile is both masked in order to guarantee its confidentiality and to ensure that it may be distributed to a third party without any risk of disclosing its private data, while remaining useable for similarity calculations between user profiles.
Number | Date | Country | Kind |
---|---|---|---|
12 52716 | Mar 2012 | FR | national |
This application in the National Stage of International Application PCT/EP2013/056133 filed Mar. 22, 2013 and which published as WO 2013/144031 on Oct. 3, 2013. The PCT claims priority to French Patent Application Serial No. 12 52716 filed Mar. 27, 2012. All of the above application are incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/056133 | 3/22/2013 | WO | 00 |