The present invention is related to protecting privacy information while allowing a recommender to provide relevant personalized recommendations.
Several recent publications study the threat of inferring demographics from user-generated data. Closest to the present invention, Weinsberg et al., Blurme: inferring and obfuscating user gender based on ratings, Proceedings of the Sixth ACM Conference on Recommender Systems, 2012 shows that gender can be inferred from movie ratings and proposes heuristics for mitigating the resulting privacy risk. However, Weinsberg's proposed obfuscation method specifically targets a logistic regression method for inferring gender. In contrast, the present invention follows a principled approach, allowing proving strong privacy guarantees against an arbitrary inference method.
The definition of privacy in the present invention is motivated by, and a limiting case of, the notion of differential privacy. Differential privacy has been applied to fields such as data mining, social recommendations and recommender systems. These works assume a trusted database owner and focus on making the output of the application differentially private. In contrast, in the present invention, a setup is studied where the recommender is curious, and users wish to protect against statistical inference of private information from feedback they submit to the recommender.
Several theoretical frameworks that model privacy against statistical inference under accuracy constraints exist. These approaches assume a general probabilistic model linking private and non-private variables, and ensure privacy by distorting the non-private variables prior to their release. Although general, the application of these frameworks requires knowledge of the joint distribution between private data and data to be released, which may be difficult to obtain in a practical setting. The assumption of a linear model in the present invention, which is strongly supported by empirical evidence, renders the problem tractable. Most importantly, it allows the method of the present invention to characterize the extent of data disclosure necessary on the recommender's side to achieve an optimal privacy-accuracy trade-off, an aspect that is absent from all of the aforementioned works.
Recommender systems can infer demographic information such as gender, age or political affiliation from user feedback. The present invention proposes a framework for data exchange protocols (steps, acts) between recommenders and users, capturing the tradeoff between the accuracy of recommendations, user privacy and the information disclosed by the recommender.
The present invention allows a user to communicate a distorted version of his/her ratings to a recommender system, in such a way that the recommender has no way of inferring some demographic information the user wishes to hide, while allowing the recommender to still provide relevant, personalized recommendations to the user.
Users of online services are routinely asked to provide feedback about their experience and preferences. This feedback can be either implicit or explicit, and can take numerous forms, from a full review to a five-star rating, to choices from a menu. Such information is routinely used by recommender systems to provide targeted recommendations and personalize the content that is provided to the user. Often, the statistical methods used to generate recommendations produce a user ‘profile’ or feature vector. Such a profile can expose personal information that the user might consider private, such as their age, gender, and political orientation. This possibility has been extensively documented on public datasets. Such a possibility calls for mechanisms that allow privacy-conscious users to benefit from recommender systems, while also ensuring that information they wish to protect is not inadvertently disclosed or leaked through their feedback, thereby incentivizing user participation in the service.
A common approach to reducing such disclosure or leakage is by distorting the feedback reported to the recommender. There is a natural tradeoff between recommendation quality and user privacy. Greater distortion may lead to better obfuscation but also less accurate profiles. A contribution of the present invention is to identify that there is a third term in this tradeoff, which is the data the recommender discloses to the users in order to obscure their private values. To illustrate this, notice that absolute privacy could be achieved if the recommender discloses to the user all of the data and algorithms used to produce a user profile. The user may then be able to run a local copy of the recommendation system without ever sending any feedback to the recommender. This is clearly private. However, it is also untenable from the recommender's perspective, both for practical reasons (efficiency and code maintenance) and crucially, for commercial reasons since the recommender may be charging a fee, monetizing both the data that it has collected and the algorithms that it has developed. Disclosing the data and algorithms to the user or possible competitors is clearly a disadvantage.
On the other hand, some data disclosure is also necessary. If a user wishes to hide his/her political affiliation prior to releasing his/her feedback, the knowledge of any bias brought by political affiliation can be used by the user to negate this effect. The recommender detecting such bias from collected data can reveal it to privacy-conscious users.
This state of affairs raises several questions. What is the minimal amount and nature of information the recommender needs to disclose to privacy-conscious users to incentivize their participation? How can this information be used to distort one's feedback, to protect one's private features (such as gender, age, political affiliation etc.) while allowing the recommender to estimate the remaining on-private features? What estimation method yields the highest accuracy when applied to distorted feedback? The present invention proposes a formal mathematical framework for addressing the above questions, encompassing three protocols:
(a) Data disclosure in which the recommender engages
(b) The obfuscation method applied to the user's ratings, and
(c) The estimation method applied to infer the non-private user features.
The specific implementation of the above three protocols provides perfect protection to the user's private information, while also ensuring that the recommender estimates non-private information with the best possible accuracy. Crucially, the date disclosure of the recommender is minimal No smaller disclosure can lead to an accuracy equal to or better than the proposed implementation.
The proposed protocols were evaluated on real datasets establishing that they indeed provide excellent privacy guarantees in practice, without significantly affecting the recommendation accuracy.
A method and apparatus for protecting user privacy in a recommender system are described including determining what information to release to a user for a movie, transmitting the information to the user, accepting obfuscated input from the user and estimating the user's non-private feature vector. Also described are a method and apparatus for protecting user privacy in a recommender system including receiving movie information, accepting a user's movie feedback, accepting user's private information, calculating an obfuscation value and transmitting the obfuscation value.
The present invention is best understood from the following detailed description when read in conjunction with the accompanying drawings. The drawings include the following figures briefly described below:
a) and 1(b) show the distribution of inference probabilities for males and females before obfuscation after the standard obfuscation scheme with selection using the MovieLens dataset and logistic inference.
c) shows the RMSE-AUC tradeoff.
The setup considered in the present invention comprises a recommender and a user. The recommender solicits user feedback on items which, for the sake of concreteness, are referred to as ‘movies’. The user's feedback (e.g., 1-5 star ratings) for each item is sampled independently from a probability distribution parameterized by two vectors: a movie profile vi and a user profile x. The user profile x is of the form (x0; x), where x0 is distinguishable binary feature that the user wishes to keep private (e.g., his/her gender), and x is a non-private component. It should be noted that though the user knows x0, he/she is unaware of x: this would be the case if e.g., the features used by the recommender are unknown to the user, or even computed through a process called matrix factorization and are therefore latent.
The recommender knows the movie profiles vi and wishes to learn the user's profile x. The recommender's purpose is to predict the user's feedback for other movies and make recommendations. The user wishes to benefit from recommendations, but is privacy-conscious with respect to his/her variable x0, and does not wish to release this to the recommender. To incentivize the user's participation, the goal of the present invention is to design a protocol for exchanging information between the recommender and the user that has three salient properties. Informally, the three salient properties are:
(a) At the conclusion of the protocol, the recommender estimates x, the non-private component of x, as accurately as possible.
(b) The recommender learns nothing about x0, the user's private variable.
(c) The user learns as little as possible about the movie profile vi of each item i.
The first property ensures that, at the conclusion of the protocol, the recommender learns the non-private component of a user's profile, and can use it to suggest new movies to the user, which enables the main functionality of the recommender. The second property ensures that a privacy-conscious user benefits from recommendations without disclosing his/her private variable, thereby incentivizing participation. Finally, the third property ensures that movie profiles are not made publicly available in their entirety. This ensures that the recommender's competitors cannot use profiles, whose computation requires resources and which are monetized through recommendations.
To highlight the interplay between these three properties, three “non-solutions” are discussed. First, consider the protocol in which the user discloses his/her feedback to the recommender “in the clear”: this satisfies (a) and (c) but not (b), as it would allow the recommender to estimate both x and x0 through appropriate inference methods. In the second protocol, the recommender first reveals all movie profiles vi to the user; the recommender estimates x locally, again through inference, and subsequently sends this to the recommender. This satisfies (a) and (b), but not (c). Finally, the “empty” protocol (no information exchange) satisfies (b) and (c), but not (a).
More specifically, it is assumed that the user is characterized by a feature vector xεd+1. This feature vector has one component that corresponds to a characteristic that the user wants to keep private. It is assumed that this feature is binary, the generalization to multiple binary features being straightforward. Formally, x=(x0,x), where x=(x1, . . . , xd)εd and x0ε{1+1,−1} is the private feature. As a running example, it can be assumed that the user wants to keep private his/her gender, that is encoded as x0ε{+1,−1}.
The recommender solicits feedback for M movies, whose set is denoted by [M]≡{1, . . . , M}. In particular, each movie is characterized by a feature vector vi=(vi0,vi)εd+1, where vi=(vi1, . . . , vid)εd. Attention is restricted to vectors vi such that iε0. The set of all such vectors is denoted by −0d+1={(v0,v)εd+1: v≠0} and the feature vector of movies for which feedback is solicited by ν≡{vi,iε[M]}⊂−0d+1.
It is assumed that the recommender maintains the feature vectors in a database. Constructing such a database is routinely done by recommender algorithms. Features are typically computed through a combination of matrix factorization techniques (and are, hence, latent), as well as explicit functions of the movie descriptors (such as, e.g., genres, plot summaries, or the popularity of cast members). In both cases, these vectors (or even the features identified as relevant) can be used by a competitor, and are, hence, subject to non-disclosure.
The user feedback for movie iε[M] is denoted by riε. ri is restricted to a specific bi-linear model, whose for is known to both the recommender and the user. In particular, let <a,b>≡Σi=1kaibi the usual scalar product in k. It is assumed that there exists a probability distribution Q on , such that for all iε[M]:
r
i
=<v
i
,x>+z
i
=<v
i
,x>+v
i0
+z
i
,z
i
˜Q (1)
where zi are independent “noise” variables, with E(z)=0, E(z2)=σ2<∞.
Despite its simplicity, this model is strongly supported by empirical evidence. Indeed, it is the underlying model for numerous prediction methods based on low-tank approximation, such as matrix factorization, singular value decomposition etc. It should be noted that the restriction to movie vectors in 0d+1 makes sense under (1). Indeed, if the purpose of the recommender is to retrieve x, the feedback for a movie for which v=0 is clearly uninformative. It is assumed that the recommender maintains feature vectors ν in a database. Constructing such a database is routinely done by recommender algorithms. Features are typically computed through a combination of matrix factorization techniques (and, hence, latent), as well as explicit functions of the movie descriptors such as genres, plot summaries or the popularity of cast members. These vectors (or even the features identified as relevant) can be used by a competitor, and are, hence, subject to non-disclosure.
The user does not have access to this database, and does not know a priori values of these feature vectors. In addition, the user knows his/her private variable x0 and either knows or can easily generate her feedback ri to each movie iε[M]. Nevertheless, the user does not know a-priori the remaining feature values xεd, as “features” corresponding to each coordinate of vi are either “latent” or not disclosed.
The privacy preserving recommendation method and system of the present invention includes the following protocol between the user and the recommender, comprising three steps:
The triplet R=(L,Y,p) is referred to as a recommendation system. Note that the functional forms of all three of these components are known to both parties: e.g., the recommender knows the obfuscation protocol Y. Both parties are honest but curious: both parties (recommender and user) follow the protocol, but if at any step either party can extract more information than what is intentionally revealed, they do so. Both protocols L and Y can be randomized. In the following, the probability and expectation with respect to the feedback model as well as protocol randomization, given x, ν is denoted by Px,ν, Ex,ν.
Next, the basic quality metrics for a privacy-preserving recommendation system, including accuracy of the recommendation system, privacy of the user, and data disclosure extent, corresponding to the properties (a)-(c) discussed above.
Formalization of privacy for the obfuscated feedback Y is motivated by differential privacy. The context of the present invention differs from the prior art in that Y(r,x0,l) depends on x, l and x0, but the present invention is only concerned with the privacy with respect to the private information x0.
A recommendation system is ε-differentially private if, for any xεX and any vεν, the following occurs. If l=(l1, . . . , lM) denotes the information leaked or divulged from database ν, and rεM the user feedback, then for any event A⊂y,
It can be said that the system is privacy preserving or private if it is s-differentially private with ε=0.
The focus of the present invention is on privacy preserving recommendation systems, i.e., systems for which ε=0. Intuitively, in privacy preserving system the obfuscation Y is a random variable that does not depend on x0. The distribution of Y is the same, irrespective of the user's gender. The second definition states that an estimator p has optimal accuracy if it reconstructs the user's non-private features with minimum l2 loss. This choice is natural; nevertheless, reasons for quantifying accuracy through l2 loss in the supplement are discussed.
It can be said that a recommendation system R=(L,Y,p) is more accurate than R′=(L′,Y′,p′) if, for all items v⊂−0d+1, supx0ε{±1},xεXE(x0,x),ν{∥p(y,ν)−x∥22}≦supx0ε{±1},xεXE(x0,x),ν{∥p′(y′,ν)−x∥22}, where y=Y(r,x0,L(ν)), y′=Y′(r,x0,L′(ν)). Further, it can be said that it is strictly more accurate if the above inequality holds strictly for some ν⊂−0d+1.
Finally, an ordering between data disclosure protocols can be defined. Intuitively, a protocol L discloses as much information as L′ if L′ can be retrieved from L.
It can be said that the recommendation system R=(L,R,p) discloses as much information as the system R′=(L′,Y′,p′) if there exists a measurable mapping φ: →′ such that L′=φ∘L (i.e., L′(v)=φ(L(v)) for each vε−0d+1). It can be said that R=(L,Y,p) and R′=(L′,Y′,p′) disclose the same amount of information if L=φ∘L′ and L′=φ′∘L for some φ, φ′. Finally, it can be said that R=(L,Y,p) discloses strictly more information than R′=(L′,Y′,p′) if L′=φ∘L for some φ but there exists no φ′ such that L=φ∘L′.
Below, it is shown that, under the linear model, the following recommendation system—that shall be referred to as the ‘standard scheme’ has optimality properties.
p(y,ν)≡arg minxε
The estimator in (3) is referred to as the least squares estimator, and is denoted by pLS. It is noted that, under (1), the accuracy of the standard scheme is given by the following l2 loss: for all xεd,
E
(x0,x),
ν{∥p
LS(y,ν)−x∥22}=σ2tr[(Σiε[M]viviT)−1], (4)
where σ2 the noise variance in (1) and tr( ) is the trace.
The following theorem summarizes the standard scheme's properties:
Theorem 1.
Under the linear model:
The theorem is proved below. The second and third statements establish formally the optimality of the standard scheme. Under Gaussian noise, no privacy preserving system achieves better accuracy. Surprisingly, this is true even among schemes that disclose strictly more information than the standard scheme. There is no reason to disclose more than vi0 for each movie. The third statement implies that, to achieve the same accuracy, the recommender system must disclose at least vi0. In fact, the proof establishes that, in such a scenario, an l2 loss that was finite under the standard scheme can become unbounded.
Proof of Theorem 1:
Y′(p+{tilde over (v)}0,+1,l′Y′(p−{tilde over (v)}0,−1,l′), (5)
i.e., the two random outputs are equal in distribution.
Z
+(s+ξe1,l*)Z+(s,l*) (6)
Taking, xk′=xk+Kξ/vk, and x′l=xl for all other l in {0, 1, . . . , d} yields a x′ that satisfies the desired properties.
Several aspects of the model of the present invention call for a more detailed discussion.
Leakage (Disclosure, Divulgation) Interpretation.
In the standard scheme, the disclosed (divulged, leaked) information vi0 is the parameter that gauges the impact of the private feature on the user's feedback. In the running example, it is the impact of the gender on the user's appreciation of movie i. For the linear model (1), this parameter has a simple interpretation, if in a population of users for which the other features x are distributed independently of the gender. Indeed, assume a prior distribution on (x0,x) such that x is independent of x0. Then: E{ri|x0=+}−E{ri|x0=−}=v,E{x|x0=+}−E{x|x0=−}+2vi0=2vi0. Hence, given access to a dataset of user feedback, in which users are not privacy-conscious, and have disclosed their gender, the recommender need only compute the average rating of a movie per gender. Disclosing vjo amounts to releasing the half distance between these two values.
Inference from Movie Selection.
In practice, generating all ratings in [M] may correspond to a high cost in time. It thus makes sense to consider the following constraint: there exists a set S0 (e.g., the movies the user has viewed) such that the obfuscated set of ratings must satisfy S532S0. In this case, S0 itself might reveal the user's gender.
A solution is presented when viewing events are independent, i.e.: Px0(S0=A)=ΠiεApix0ΠipA(1−pix0), where pix0 is the probability that the user has viewed movie i, conditioned on the value of his/her gender x0. Consider the following obfuscation protocol. First, given S0, the user generates and discloses feedback for movie iεS0 independently, constructing thusly a set S, whereby:
P(iεS|iεS0)=max(1,pix0/pix0), (7)
for x0 is the complement of x0. Ratings for iεS are revealed after applying the standard scheme.
This obfuscation has the following desirable properties. First, S⊂S0. Second, it is privacy preserving. To see this note that Px0(iεS)=max(1,pix0/pix0)×pix0=min(pix0,pix0), i.e., it does not depend on x0. Finally, the set S is maximal: there is no privacy preserving method for generating a set S′⊂S0 such that E′{|S′|}>E{|S|}. To see this, note that, for any scheme such that if E{|S′|}>E{|S|}, there exists an i such that Px0(iεS′)>Px0(iεS)=min(pi+,pi−). If the scheme is privacy preserving, this must be true for both x0; however, as S⊂S0, it must be that Px0(iεS)≦pix0 for both x0, a contradiction. Motivated by the maximality of this obfuscation scheme, it is used below as a means select only a subset of the movies rated by a user.
The standard scheme of the present invention is evaluated on a movie recommender system. Users of the system provide an integer rating between 1 and 5 for the movies they have watched, and in turn expect the system to provide useful recommendations. Gender is defined as the private value that users do not want to reveal to the recommender, which is known to be inferable from movie ratings with high accuracy. Datasets from two movie rating services are used: MovieLens and Flixster. Both contain the gender of every user. The datasets are restricted to users that rated at least 20 movies and movies that were rated by at least 20 users. As a result, the MovieLens dataset has 6K users (4319 males, 1703 females), 3043 movies, and 995K ratings. The Flixster dataset has 26K users (9604 males, 16433 females), 9921 movies, and 5.6M ratings.
To assess the success of obfuscation in practice, several standard methods are applied to infer gender from ratings, including Naïve Bayes (NB), Logistic Regression (LR) and Support Vector Machines (SVM) and a new method similar to Linear Discriminant Analysis (LDA) is proposed. The latter method is based on the linear model (1), and assumes a Gaussian prior on x and a Bernoulli prior on the gender x0. Under these priors, ratings are normally distributed with a mean determined by x0, and the maximum likelihood estimator of x0 is precisely LDA in a space with dimension of the number of movies viewed. Each inference method is evaluated in terms of the area under the curve (AUC). The input to the LR, NB and SVM methods comprises the ratings of all movies given by the user as well as zeros for movies not rated. LDA on the other hand operates only on the ratings that the user provided.
The standard obfuscation scheme is studied both with and without the selection scheme, which is performed using the maximal scheme (7) discussed above. The movie vectors are constructed as follows. For each movie, gender biases v0 are computed as the half distance between the average movie ratings per each gender. Using these values, the remaining features v were computed through matrix factorization with d=20. These are computed from the non-obfuscated ratings. Matrix factorization was performed using gradient descend, 20 iterations, regularization parameter of 0.02, selected through cross validation.
When using the standard scheme, the new rating may not be an integer value, and potentially may even be outside of the range of rating values which is expected by the recommender system. To that end, a variation that rounds the rating value to an integer in range [1,5] is considered. Given a non-integer obfuscated rating r, which is between two integers k=└r┘ and k+1, rounding is performed by assigning the rating k with probability r−k and the rating k+1 with probability 1−(r−k), which on expectation gives the desired rating r, if ratings higher than 5 or lower than 1 are truncated to 5 or 1, respectively. For brevity, this entire process is referred to as “Rounding”. Two baselines for obfuscation are also considered. The movie average scheme replaces a user's rating with the average rating of the movie. The gender average scheme replaces the user's rating with the average rating provided by males or females, each with probability 0.5.
The accuracy of the recommendations in terms of the root mean square error (RMSE) of the ratings is measured. To this end, the user's ratings are split to training and evaluation sets. First the obfuscation method is applied to the training set, and then x is estimated through ridge regression over the obfuscated ratings with regularization parameter of 0.1. Ratings of the movies in the evaluation set are predicted using the linear model (1), where x0 is provided from the LDA inference method. Experiments with the other inference methods were conducted with similar results.
The proposed obfuscation and inference methods were run on both datasets. A 10-fold cross validation on the users was used, and the mean AUC and RMSE were computed across the folds. The summary of all the evaluations is shown in Table 1. The table provides the AUC obtained by the different inference methods under the various obfuscation methods detailed above, as well the RMSE for each obfuscation method.
Several observations are consistent across the two datasets. First, inference methods are affected differently by the obfuscation methods, with LR, NB and SVM being mostly affected by the selection scheme whereas LDA is mostly affected by the standard obfuscation scheme of the present invention. However, when both selection and the standard obfuscation scheme are used, the AUC of all methods reduces to roughly 0.5. Furthermore, the impact of the obfuscation methods on the RMSE is not high, with a maximum increase of 1.5%. This indicates that although the obfuscation schemes manage to hide the gender, rating prediction is almost unaffected. The standard obfuscation scheme of the present invention performs almost exactly the same when rounding is introduced. Compared to the standard scheme (SS), baseline schemes result in a similar AUC but higher RMSE, indicating that aggressive obfuscation comes at a cost of losing the recommendation accuracy without considerable benefits in AUC.
To illustrate how obfuscation affects the inference accuracy,
The privacy-accuracy tradeoff is studied by applying an obfuscation scheme with probability a, and releasing the real rating with probability 1−α.
It is natural to extend the questions we introduced in this work to more general inference setting beyond the linear model we study here. In particular, quantifying the amount of information whose release is necessitated to ensure privacy and accuracy under more general parametric problems remains an interesting open question. In addition, our focus here was on privacy-preserving recommendation systems. There are several ways of relaxing our privacy constraint, including the use of differential privacy, with ε>0.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs). Preferably, the present invention is implemented as a combination of hardware and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
This application claims priority to U.S. provisional application Ser. No. 61/761,330 filed on Feb. 6, 2013, entitled “PRIVACY PROTECTION AGAINST CURIOUS RECOMMENDERS”, incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US13/53984 | 8/7/2013 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61761330 | Feb 2013 | US | |
61761330 | Feb 2013 | US |