Recommender systems are fast becoming one of the cornerstones of the Internet. In a world with ever increasing choices, they are one of the most effective ways of matching users with items. Today, many websites use some form of such systems. Recommendation engines ask users to rate certain items, e.g. books, movies or music, infer the ratings of other items from this data, and use these inferred ratings to recommend new items to users. However, the data collected, if anonymized at all, can be de-anonymized and compromise the privacy of those users who submitted ratings.
In order to allow for improved privacy of users providing ratings a new approach also referred to as differential privacy has been developed, recognising that, although perfect privacy is impossible to be had, a user may be put in control of the information provided through deliberately corrupting sensitive data before submitting ratings, and declaring to do so.
In the context of recommender systems there are essentially two models of providing differential privacy, centralised and local. Under the centralised model the recommender system needs to be trusted to be able to collect data from users, and must be trusted to respond to queries by publishing either only aggregate statistics or sensitive data that has been corrupted through a mechanism that obeys the differential privacy constraint. Under centralized differential privacy ratings are stored in the recommending device and the user of such system must trust the operator that privacy is guaranteed.
US2011/0064221 discloses a differential privacy recommendation method following the centralised model in which noise is added centrally in the correlation engine only after the correlation has been done.
Further information on the design of recommender systems with differential privacy under the centralised model can be found in F. McSherry and I. Mironow “Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contenders”, KDD, 2009, pp. 627-636.
WO 2008/124285 suggests a recommendation system using a sample group of users that is a sub-group of a larger group, for rating items. The rated items are suggested to other users if the rating from the sub-group exceeds a preset value.
US 2011/0078775 discloses adjusting a trust value associated with content through a trust server based on locally collected credibility information.
Users increasingly desire control over their private data given the mistrust towards systems with centrally stored and managed data.
The present invention provides improved privacy under the local model, in which users store their data locally, and differential privacy is ensured through randomization under the control of the user before submitting data to the recommender system. The invention provides a user-adjustable degree of privacy while still allowing for generating recommendations with a decent accuracy.
In accordance with one aspect of the invention user ratings, generated under privacy at the user's location, of items are requested by a content provider or a recommendation engine for making recommendations to other users. The request is done as bulk request, i.e. a list of items, without knowledge which or indeed whether any of the items in the list have been “consumed” or rated by the user. In this context “consumed” includes any activity involving an item that allows for rating the item. The user's device provides an aggregate rating in response. In order to provide a certain degree of privacy the user can add a value indicating the likelihood that the returned rating is exact and reliable, i.e. a “trust value”. The trust value may be provided for individual items in the rated list, or as a global value for the entire list. It is also possible to have a user's device randomly add trust values, or add these trust values at random to individual items or the entire list. It is likewise conceivable of a user presetting a range of trust values, from which the user's device randomly picks trust values for privatizing the ratings. This adds a form of data noise to the ratings, which in turn enhances the user privacy.
The content provider receives the rated item lists and the associated trust value from a plurality of users and clusters items from the received lists into clusters of items similarly “liked” by a number of users in according with their ratings and the “trust value”. The content of each cluster is likely to be liked by the set of users who also liked other contents within this cluster. The content provider or recommendation engine returns the results of the clustering to the users. A user's device locally stores its response provided to the content provider or to the recommendation engine and uses this reliable information for extracting items contained in the cluster such that recommendations can be made to each individual user.
Aspects of the invention pertain to improving the accuracy of content recommendation depending on whether the basic data used for clustering originates from an information-scarce or an information-rich environment, while maintaining improved privacy.
Information-rich in this context relates to a situation in which the rated item is known to have a rating assigned to it, while the rating itself is privatized.
Information-scarce as used in the present specification relates to an environment, in which a user provides ratings only for a small fraction of content items of a set of content items, and possibly even a varying fraction. Also, no information may be provided whether a user has consumed an item at all. In this context [N] is used as a reference to a set {1, 2, . . . , N} of N ordered items, and [U] is used as a reference to a set of U users. The set of users is divided into K clusters labelled Cu={1, 2, . . . K}, where cluster i contains αiU number of users. Similarly, the set of items is divided into L clusters Cn={1, 2, . . . , L}, where cluster I contains βiN number of items. A denotes the matrix of user/item ratings, each row corresponding to a user and each column corresponding to an item. For simplicity it is assumed that Aijε{0, 1}. This could, for example, correspond to ‘like/dislike’ ratings.
The following statistical model is used for the ratings: for user uεU with user class k, and item nε[N] with item class I, the rating Aun is given by a Bernoulli random variable Aun˜Bernoulli (bkl), where the ratings by users in the same class, and for items in the same class, are independent and identically distributed (i.i.d.). Finally, for modelling limited information, i.e., the fact that users rate only a small fraction of all items, parameter ω represents the number of items each user has rated. For the purpose of explanation within this specification parameter ω is assumed to be constant. However parameter ω could as well be a lower or upper bound on the number of items rated. Likewise, for the purpose of explanation within this specification it is assumed that the set of rated items is picked uniformly at random. The expression ω=Ω(N) represents an information-rich environment and expression ω=o(N) represents an information-scarce environment.
A randomized function ψ: X→Y that maps data XεX to YεY is said to be ε-differential private, or ε-DP, if for all values yεY in the range space of ψ, and for all ‘neighbouring’ data x, x′ the following (in-)equation holds true:
The definition of ‘neighbouring’ data is chosen according to the situation and determines the properties of the data that remain private. Two databases are said to be neighbours if the larger database is constructed by adding a single row to the smaller database; if the rows correspond to the data of a single individual, then differential privacy can be thought as guaranteeing that the output does not reveal the presence or absence of any single individual in the database. Similarly, in the context of rating matrices, two matrices can be neighbours if they differ in a single row, corresponding to per-user privacy, or if they differ in a single rating, corresponding to per-rating privacy.
Recommender systems generate predictions of user preference by determining correlations between the limited data, i.e. items and ratings, provided by users, and using this limited data to infer information about unknown ratings. One way to do this is by clustering the items into a set of classes, e.g. by finding and using correlations between the rankings given to each item by a multiplicity of users, and then releasing this classification. The items considered by the present recommender system are thus assumed to have a cluster structure, i.e. items may be clustered in accordance with specific item properties. Using this classification, and using their own personal rankings, users can determine their potential ratings for new content, taking into account algorithms which are used by the recommender system for determining the classification.
The recommendation method in accordance with the invention comprises two main phases: a learning phase and a recommendation phase. The learning phase is performed collaboratively with privacy guaranteed, while the actual recommendation is performed by the user's devices, based upon a general recommendation that is a result of the learning phase, without additional interaction with the system.
In the learning phase the clusters are populated. In the recommendation phase the populated clusters are revealed to the users. Each user can then derive recommendations based upon the populated clusters and locally stored data about the user's ratings of items and respective trust values associated with rated items.
In a hypothetical information-rich environment a minimum number ULB of users required for learning a concept class C under ε-differential privacy can be calculated as
In the following the quantity log |C| is referred to as the complexity of concept class C.
In order to derive recommendations a low-dimensional clustering of the set of items in dependence of the user ranking matrix is required. In this context the concept class is the set of mappings from items [N] to cluster labels [L], L typically being much smaller then N, having a complexity of N log L≈O(N). It is further assumed that each user has only rated w items out of the possible N items. In the information-rich environment, with w=Ω(N), in which each user has rated a constant fraction of all items, the minimum required number of users ULB for a local differential privacy algorithm based on spectral clustering techniques, in particular based on the Pairwise-Preference Algorithm, is calculated using the following equation:
It is to be noted that in this environment only the user ratings are private, while the identity of any item rated by the user is not.
However, in most practical scenarios, the information-rich assumption may not be valid. For example, in a movie rating system most users usually have seen and rated only a small fraction of the set of movies. Therefore, in another aspect of the present invention local differential privacy in an information-scarce environment with w=o(N) is considered. In the information-scarce environment the minimum number of users ULB required can be determined using the following equation:
For the special case of w=o(N1/3) the previous equation can be simplified into
In accordance with the invention a user is provided with a list of items to be rated. These items could be, for example, content items as found in audio-visual content. However, the items could as well relate to anything else that can be electronically rated. The user rates, in privacy, e.g. by using a device to which the user has exclusive access, at least some of the items. The rating of an item list through the users is a form of “bulk” rating without knowledge which, or indeed whether any of the items in the list have been rated, or consumed, by the user. The user then selects a degree of trust that is assigned to individual ratings or assigned globally to all items in the list of items, thereby “privatizing” the ratings. The degree of trust is a value indicating the likelihood that the returned rating is exact and reliable. This adds noise to the ratings, which enhances the user privacy. The rated and “privatized” list is submitted to a recommender device. The returned rating can be considered an aggregate rating. The recommender device may be operated by the provider of the items to be rated, or operated by a third party providing a recommendation service. The recommender device collects rated lists of items from a predetermined minimum number of users and performs a clustering on the lists, clustering items of similar content and/or similar rating. The clustering takes the degree of trust into account that has been assigned to the items or to the lists. The content of each cluster is likely to be liked by the set of users who also liked other contents within this cluster. The clustered items are used to generate a list of recommended items, which is returned to the users that had provided their rating on one or more items.
However, not every user whose “bulk” response has been used in creating the cluster has actually “consumed” the content. In accordance with the response from an individual user, and the content items contained in the cluster, recommendations can be made to that individual user. A version of the rated list prior to “privatization” may be stored locally in the user's device. This knowledge may be used refining the recommendation to each individual user, deleting items from the recommendation that had already been consumed. For example, a movie that a user had watched and rated is not recommended again, even if it is on a recommendation list, based on the non-privatized knowledge in the user's device. Storing true ratings and related data only locally in the user's device, and providing “bulk” rating is one aspect that provides the desired degree of privacy. Assigning a trust value at the user's side and providing the ratings only after assigning the trust value is a further aspect that provides the desired degree of privacy. It can be compared to adding noise that “hides” data a user wishes to keep private, and at the same time reduces the confidence that a user must have in the provider of the recommendation service.
In one aspect of the invention the “consumption” habits of the users is not considered private, while their ratings are, i.e. the information whether a user “consumed” an item from the list of items is public and true, while information how a user actually rated the item is private. This is applicable in many scenarios, as in order to “consume” the item, e.g. view a movie, the user most likely will leave a trail of information.
For example, if the recommendation engine is connected to a movie content provider system, then recommendation engine has knowledge of what content the user has watched. In this context, in accordance with an exemplary embodiment of the invention, each user is first presented two movies, both of which the user has watched. These two movies may be picked at random. Next, the user converts the rating for these two movies to normalized rating that lies in {0,1}. This may be done by a scheme which maps each possible rating to {0,1} according to a randomized function, which can be user-preset in the user's device, or user-controlled for each rating. Then a private ‘sketch’ is determined, which may consist of a single bit. This bit is set to 1 if both movies have the same normalized rating, and else set to 0. The sketch bit may also be referred to as similitude indicator. More generally, the similitude indicator may indicate if the rating value of each item of the pair of items (A, B) is above a predetermined value. Next, the sketch bit is privatized according to a given procedure, e.g. a trust value is assigned in accordance with which the true sketch bit is provided to the recommendation device with a certain specified probability, or else the complement of the sketch bit is provided. This step ensures differential privacy. The recommendation engine then stores these ‘pairwise preference’ values in a matrix, and performs spectral clustering on this matrix to determine the movie clusters, which are then released. The spectral clustering may include adding the sketch values provided by the users for each identical pair.
In the following, a more detailed mathematical representation of the preceding embodiment of the invention is presented. It is assumed that N items can be rated by U users. Each user has a set of w ratings (Wu, Ru),
Ruε{0, 1}w. Each item i is associated with a cluster CN(i) from a set of L clusters, {1, 2, . . . , L}.
In accordance with the recommendation method, for each user uε[U] a pair of items Pu={iu, ju} is picked. Picking may be at random in the information rich environment with w=Ω(N). If Wu is known, a random set of two rated items may be picked. A user's device determines a private sketch that is 1 if the ratings of the two items are identical, or, if the normalized ratings thereof are identical, and otherwise is 0. If an item has no rating, the rating is assigned a value of 0. This private sketch, also referred to as Su0, is then privatized as follows. With probability
the released sketch Su=Su0, and with probability
the released sketch Su=
Then a pair-wise preference matrix A is generated in accordance with
A
ij=ΣuεU|P
and the top L normalized eigenvectors x1, x2, . . . , xL corresponding to the L largest magnitude eigenvalues of matrix A are extracted. Each row (node) is projected into the L-dimensional profile space of the eigenvectors, and k-means clustering is performed in the profile space for obtaining the item clusters.
In another aspect of the invention both the “consumption” habits, i.e. whether or not an item has been consumed, as well as the ratings of the users are considered private information. In order to produce meaningful recommendations, this information-scarce environment requires collecting information for a much larger set of items from each user than only two, as being sufficient in the preceding example for the information-rich environment. In this case a user's device is presented with a sensing vector of binary values. Each vector component corresponds to an item from the set of items. For example, in terms of the preceding paragraph, each item corresponds to a movie. The sensing vector determines whether a movie from the set of movies is probed. A movie may be probed for example if the corresponding component of the vector is ‘1’. The user's device, in the same manner as described in the previous paragraph, converts all ratings to normalized ratings for the probed movies, and sets the normalized rating for unwatched movies to 0. Next, the user's device calculates the maximum normalized value among the probed movies, i.e., movies for whom the corresponding component in the sensing vector was ‘1’, and releases a privatized version of this vector component. The privatization mechanism is the same as before, i.e. the logical value is flipped or not in accordance with the probability, or trust value assigned. A sketch bit indicates if a movie has the same normalized rating as the maximum normalized value determined before. A trust value indicates the probability of the sketch bit being true. The recommendation engine then combines these vector components in the following way: for each movie, the engine calculates the sum of the privatized user ratings over all users for whom the corresponding movie was probed, i.e., for whom the corresponding element in the sensing vector was set to ‘1’. Next, the engine uses these ‘movie sums’ to perform k-means clustering in 1-dimension, and returns the so-obtained movie classifications.
In a variant of the preceding embodiment the sensing vectors may also be randomly generated by the users. In this case the vector must be provided to the recommendation engine in addition to their privatized rating. In either case differential privacy is maintained as long as the sensing vector is independent of the set of items actually rated by the user.
In the following, a more detailed mathematical representation of the preceding embodiment of the invention is presented. As before it is assumed that N items can be rated by U users. Each user has a set of w ratings (Wu, Ru),
Ruε{0,1}w. Each item i is associated with a cluster CN(i) from a set of L clusters, {1, 2, . . . , L}.
In accordance with the recommendation method, for each user uε[U] a sensing vector Huε{0,1}N is generated, Hui being a ‘probe’ for item i given by
H
ui˜Bernoulli(p),
with
θ being a chosen constant.
Then a user's device determines the maximum rating within the ratings for items corresponding to the sensing vector in accordance with
S
u
0(Wu,Ru,Hu)=maxiε[N]Hui{circumflex over (R)}ui,
with {circumflex over (R)}ui=Rui if iεWu, and 0 otherwise. The result is a vector indicating all items that have identical maximum ratings. Each user's device provides a privatized version of this vector, the vector including privatized ratings Su for each item. Privatization is achieved, as discussed before, by assigning a trust value to the rating. It is noted that in a variant of this embodiment the vector may indicate those items that have a rating above a predetermined threshold. Likewise, in this embodiment as well as in the previously discussed embodiments, normalization may not be needed, depending on the items' features and/or properties considered when rating.
In other words: the sensing vector Hui essentially asks for ratings of those′ items whose indices are 1 in this vector. The user device constructs a sketch S0, which is 1 if at least one of these items is rated 1, and 0 otherwise. (Similarly for normalized ratings) The user device then privatizes this sketch as follows: it outputs the sketch S=S0 with probability (exp(eps)/(1+exp(eps))), eps being the privacy parameter, and with probability (1/(1+exp(eps))) it flips the sketch and outputs S=
The recommendation engine then determines a count Bi in accordance with
B
i=ΣuεUHuiSu
and performs a k-means clustering on the counts Bi.
In yet another variant the user's device is presented a multiplicity of sensing vectors, each sensing vector being determined to cover a randomly chosen subset of the entire set of items. Ratings and privatization are processed as described before. It is obvious that this variant also allows for sensing vectors randomly generated at the user's device, the sensing vectors not being private information and being transmitted to the recommendation engine.
In the following, a more detailed mathematical representation of the preceding variant of the invention is presented. As before it is assumed that N items can be rated by U users. Each user has a set of w ratings
{0, 1}w. The number of ratings presented to a user's device is Q. Each item i is associated with a cluster CN(i) from a set of L clusters, {1, 2, . . . , L}.
First, for each user uεQ sensing vectors H(u, q)ε{0, 1}N are generated, each vector being generated by choosing Np items uniformly, without replacement and non-overlapping with other sensing vectors. As before, the probability p of an item to be probed in any sensing vector is
For each of the Q sensing vectors, the user's device returns a privatized response S(u, q) to the recommendation engine.
For each item the recommendation engine determines a count a from the privatized responses in accordance with
B
i=ΣuΣUΣqΣ[Q]H(u,q)iS(u,q)
and performs k-means clustering on the counts Bi.
In the following, the invention will be further elucidated with reference to the drawings, in which
The recommendations are transmitted to the user devices, which, present the recommendations to the user, optionally improved by taking locally stored ‘true’ and trusted information into account.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/052533 | 2/8/2013 | WO | 00 |