The present invention relates generally to the data mining More specifically, the invention relates to the determination of the number of users contributing to a set of ratings.
Online commerce services such as Netflix provide personalized recommendations by collecting user ratings about a universe of items, referred to here as ‘movies’. Typically, multiple people within a single household (family members, roommates, etc.) may share the same account for both viewing and rating movies. Service providers are reluctant to deploy multiple accounts as log-in screens are often perceived as a nuisance and a barrier to using the service. This is especially true on devices lacking a keyboard, such as televisions or gaming platforms. Account sharing persists even when providers offer the option of registering secondary accounts, as the latter typically have access to a subset of the services enjoyed by the primary account holder. Finally, sharing might be regarded as a partial (if unconscious) privacy protection mechanism, as users might not want to release the household composition and demographics.
The use of a single account by multiple individuals poses a challenge in providing accurate personalized recommendations. Informally, the recommendations provided to a “composite” account, comprising the ratings of two dissimilar users, may not match the interests of either of these users. Moreover, recommendation methods relying on low-rank assumptions (such as matrix factorization) may fail on data including composite users. This is because “mixing” entries from different rows of a low-rank matrix results in a matrix that need not be low-rank. Beyond personalized recommendations, this ability is useful as it can aid in determining the household's demographics. Such information can be subsequently monetized, e.g., through targeted advertising.
The present invention addresses the challenges of identifying separate users in a composite account, and discovering information related to their profiles.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. The Summary is not intended to identify key features or essential features of the claimed subject matter, not is it intended to be used to limit the scope of the claimed subject matter.
The present invention includes a method and apparatus to detect a number of individual users corresponding to movie ratings in a common, composite set of movie ratings. The method includes accessing the composite set of movie ratings and a set of movie profiles by loading both sets into a rating analysis engine. A number of partitions of movie ratings present in the composite set are calculated using the composite set and the movie profiles. The number of partitions is determined iteratively using subspace clustering of ratings from the composite set. The number of determined partitions corresponds to the number of individual users.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments which proceeds with reference to the accompanying figures.
The foregoing summary of the invention, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the accompanying drawings, which are included by way of example, and not by way of limitation with regard to the claimed invention.
a illustrates a functional diagram of a rating analysis engine according to aspects of the invention;
b illustrates a functional diagram of a partition detector according to aspects of the invention;
a depicts an example web-based rating analysis engine according to aspects of the invention;
b depicts an example set-top-box based rating analysis engine according to aspects of the invention;
In the following description of various illustrative embodiments, reference is made to the accompanying drawings, which form a part thereof, and in which is shown, by way of illustration, various embodiments in the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modification may be made without departing from the scope of the present invention.
Initially, a statistical model is developed to help frame an analysis. Consider a dataset of ratings on M movies provided by N accounts, each corresponding to a different household. Ratings are available for a subset of all N×M possible pairs: denoted by H⊂[M], where mH≡|
H|, the set of movies rated by account/household H, and by rHj∈R the rating of movie j∈
H.
Each movie j∈[M] is associated with a feature vector Vj∈d, where d<<N,M. Matrix factorization is used to extract the latent features for each movie, as further described below. If explicit information (e.g., genres or tags) is available, this can be easily incorporated in the model by extending the vectors vj.
Each household H may comprise one or more users that actually rated the movies in H. Denoted by H is the set of users in this household, and by nH=|H| the household size. For each i∈H, denoted by A*i⊂
H is the set of movies rated by i, and by I*(j)∈H the user that rated j∈MH. Note that neither the household size nH nor the mapping I*:
H→H are a priori known. With this starting point, model selection can used to determine the household size nH. A closely related problem is the one of determining whether the account is composite (i.e., |H|>1) or not. Also, user identification can be performed to identify movies that have been viewed by the same user—i.e., recover partitions I*, up to a permutation, and use this knowledge to profile the individual users.
A linear model is used to help frame an analysis. Focusing now on a single household, and omitting the index H hereafter, denote by n is the household size, Å and m the set of movies rated by this household and its size, respectively, and by rj is the rating given to movie j∈M. One main modeling assumption is that the rating rj generated by a user i∈H for a movie j∈. One is determined by a linear model over the feature vector vj. That is, for each i∈H there exists a vector u*i∈Rd and a real number z*i∈R (the bias), such that
r
j
=<u*
i
,v
j
>+z*
i+εj, for all j∈A*i, i∈H, (1)
where εj∈R are independent, identically distributed (i.i.d.) Gaussian random variables with mean zero and variance σ2. Such linear models are used extensively by rating prediction methods that rely on matrix factorization, and are known to perform very well in practice.
Assuming that the household size is known, the model parameters of (1) are (a) the user profiles Θ*={θ*i}i∈H∈n×d+1, where θ*i=(u*i,z*i)∈
d+1, i∈H, as well as (b) the mapping I*:
→H. Given two estimators Θ, I of Θ*, I*, the log-likelihood of the observed sequence of pairs {(vj,rj)}j∈
, is given by
Estimating the maximum likelihood model parameters thus amounts to minimizing the mean square error:
where Θ∈n×d+1, I∈J, the set of all mappings from
to H. Note that (3) is not convex. Nevertheless, fixing I results in a quadratic program, while fixing Θ results in a combinatorial problem solvable in O(nm) time.
Subspace arrangements are now discussed. An insightful geometric interpretation of the minimization (3) is obtained by studying the points xj=(vj,1,rj)∈d+2, i.e., the d+2-dimensional vectors resulting from appending (1,rj) to the movie profiles. Eq. (1) implies that although the points xj exist in an ambient space of dimension d+2, they actually lie on a lower-dimensional manifold: the union of n hyperplanes, i.e., d+1-dimensional linear subspaces of
d+2.
To see this, let n*i=(ui,zi,−1)∈d+2 be the vector obtained by appending the bias z*i and −1 to u*i. Then, |<n*i,xj>|=|<u*i,vj>+z*i−rj|=|εj|, for every j∈Ai. Hence, provided that the variance σ2 is small, the points xj lie very close to the hyperplane with normal n*i that crosses the origin.
d+2 lie slightly off a hyperplane whose normal is (ui,zi,−1)∈
d+2.
A union of such affine subspaces is called a subspace arrangement. Given that the data xj, j∈, “almost” lie on such a manifold, minimizing the MSE has the following appealing geometric interpretation. First, mapping a movie j to a user amounts to identifying the hyperplane to which xj is closest to. Second, once movies are thus mapped to users, profiling a user amounts to computing the normal to its corresponding hyperplane. Finally, identifying the number of users in a household amounts to determining the number of hyperplanes in the arrangement.
These tasks are known collectively as the subspace estimation or subspace clustering problem, which has numerous applications in computer vision and image processing. This connection is exploited herein to apply algorithms for subspace clustering on user identification; namely the Expectation Maximization (EM) algorithm and the Generalized Principal Components Analysis (GPCA) algorithm.
Example data sets were applied to the algorithms of the current invention to test its use. One data set is the CAMRa2011 dataset. The CAMRa2011 dataset was released at the Context-Aware Movie Recommendation (CAMRa) challenge at the 5th ACM International Conference on Recommender Systems (RecSys) 2011. This dataset consists of 4 536 891 5-star ratings provided by N=171 670 users on M=23 974 movies, as well as additional information about household membership for a subset of 602 users. The 290 households comprise 272, 14 and 4 households of size 2, 3 and 4 users, respectively. The entire dataset was used to compute the movie profiles vj through matrix factorization, using d=10 (found to be optimal through cross validation). In the sequel, attention is restricted to the 544 users belonging to households of size 2. To simulate a composite account, the ratings provided by users belonging to the same household were merged. The original mapping of ratings to household members serves as the ground truth.
A second dataset used was the Netflix dataset. The second dataset contains 5-star ratings given by N=480 189 users for M=17 770 movies. The movie profiles Vj were obtained through matrix factorization on the entire dataset, with d=30. Attention is restricted to the subset of 54 404 users who rated at least 500 movies. Also, 300 ‘synthetic’ households of size 2 were generated by pairing the ratings of 600 randomly selected users. Matrix factorization is likely to be unreliable for extracting account feature vectors, as they may be composite. On the other hand, it appears to perform well for movies. The OPTSPACE algorithm is used in both datasets for matrix factorization, which is not further discussed.
The algorithms of Expected Maximization (EM) and Generalized Principal Components Analysis (GPCA) are discussed herein. The EM algorithm identifies the parameters of mixtures of distributions. It naturally applies to subspace clustering—technically, this is “hard” or “Viterbi” EM. Proceeding over multiple iterations, alternately minimizing the MSE in terms of the movie-user mapping of ratings in partition I and the user profiles Θ. Initially, a mapping I0∈J is selected uniformly at random; at step k≧1, the profiles and the mapping are computed as follows.
The minimization in EQ. (4) can be solved through linear regression. For example, obtain a mapping I: →[n]=H by clustering the rating events (vi,rj)∈
d+1, j∈
into n clusters. Then, given I, estimate θi=(ui,zi), i∈[n], by solving the quadratic program minΘ M SE (Θ,I) where MSE is given by EQ (3). EQ. (5) amounts to identifying the profile that best predicts each rating, i.e.,
which can be computed in O(nm) time.
The Generalized Principal Components Analysis (GPCA) algorithm is an algebraic-geometric algorithm for solving the general subspace clustering problem, as defined herein above. To give some insight on how GPCA works, consider first an idealized case where the noise εj in the linear model (1) is zero. Then, the points xj=(vj,1,rj), j∈A*i, lie exactly on a hyperplane with normal n*i=(u*i,z*i,−1). Thus, every xj, j∈, is a root of the following homogeneous polynomial of degree n:
Denoted by
the vector of the monomial coefficients ck|≧K(n,d)=O(min(nd,dn)), c can be computed by solving the system of linear equations Pc(j)=0, j∈
.
Knowledge of c can be used to exactly recover I*, up to a permutation. This is because, by EQ (7), for any j∈A*i, the gradient ∇Pc(xj) is proportional to the normal n*i. Hence, the partition in of points {A*i} can be recovered by grouping together points with co-linear gradients.
Unfortunately, this result does not readily generalize in the presence of noise. In this case, one approach is to estimate by solving the (non-convex) optimization problem.
Solving EQ (8) is accomplished through a first order approximation of Pc and cluster gradients using a “voting” method as known to those of skill in the art.
Evaluation by the inventors of the EM and GPCA algorithms provided statistically significant accuracy results. The user identification algorithmic methods presented above assume a priori knowledge of the number of users sharing a composite account. However, this information may not be readily available. Discussed below is a model selection algorithm for this task.
The problem of estimating the number of unknown parameters in a model is known as model selection. Denoting by Θn∈Rn×(d+1), In∈J the estimators of the parameters a*, I* of the linear model EQ (1) for size n, the general method for model selection amounts to determining n that minimizes
is the log-likelihood of the data, given by EQ (2), and C is a metric capturing the model complexity, usually as a function of the number of parameters n. Several different approaches for defining C exist. The inventors have found that the well known Bayesian Information Criterion (BIC) algorithm performed best over the datasets used.
The BIC for a household H of size |H|=n is given by
where σ2 is the variance of the Gaussian noise in EQ (1). Note that different methods for obtaining the estimators Θn, In lead to different values for BICn.
BIC was tested on the two datasets as follows. For the CAMRa2011 (Netflix) dataset, a combined dataset was created comprising the 272 (300) composite accounts of n=2 as well as as the 544 (600) individuals of size n=1 that are included in these households, yielding a total of 816 (900) accounts. For each of these accounts, the MSE is first computed under the assumption that n=1; this amounted to solving a regression for a single profile θ1=[u1,z1] under I(j)=1, for all j∈, obtaining an MSE denoted by MSE1. Subsequently, the identification methods (EM, GPCA) were used to obtain a mapping I:
→H, and vectors θi=(ui,zi), i∈{1,2}: each of these yielded an MSE for n=2, denoted by MSE2.
Using these values, the following classifier was constructed. An account may be labeled as composite when
(MSE1−MSE2)−τ log m/m>0 (10)
By varying τ, the classifier can be made more or less conservative towards declaring accounts as composite. For τ=2σ2(d+2), this classifier coincides with BIC.
Knowledge of household composition can be used to improve recommendations. In a typical setup, a user accesses the account and the recommender system suggests a small set of movies from a catalog, recommending movies that are likely to be rated highly. However, even if the recommender system knows the household composition and the user profiles, it still does not know who might be accessing the account at a given moment. In the absence of side information, the present invention can circumvent this problem as follows. Assume the recommender has a budget of K movies to be displayed; it can then recommend the union of the K/n movies that are most likely to be rated highly by each of the n users. This exploits household composition, without requiring knowledge of who is presently accessing the account.
Having developed the algorithmic background for a technique of user identification solely on the ratings provided by users based on subspace clustering, application of the now-developed principles is discussed.
a depicts a functional diagram 200 of a rating analysis engine 210. The rating analysis engine accesses account ratings 205 and movie profile vectors 215, processes those inputs, and produces multiple outputs 225. The outputs include the number n of partitions corresponding to the number of users that are present in the account ratings, the number of partitions and ratings associated with those partitions present in the account ratings, and profiles associated with the identified users. The rating analysis engine can be used as a core device to provide an identification of separate users in a composite ratings set.
In one utilization of the ratings analysis engine 210, once the individual users are separated from the composite account ratings (that is, identified as separate users within the composite accounts rating set), then the individual user's ratings information can be used to perform data analysis on the separated composite ratings list. In one embodiment, the separate user ratings can be processed to determine demographic information about the individual user. Once demographic information is determined, then targeted advertisements can be given to those identified users based on their determined demographic information.
b depicts a function block diagram 250 of a partition and profile detector 260 used within the rating analysis engine 200 of
Processor 320 provides computation functions for the rating analysis engine 300, which corresponds to functional diagram 200. The processor can be any form of CPU or controller that utilizes communications between elements of the rating analysis engine to control communication and computation processes for the engine. Those of skill in the art recognize that bus 315 provides a communication path between the various elements of engine 300 and that other point to point interconnection options instead of a bus architecture are also feasible.
Memory 330 can provide a repository for memory related to the method that incorporates the functionality of the ratings analysis engine. Memory 330 can provide the repository for storage of information such as program memory, downloads, uploads, or scratchpad calculations. Those of skill in the art will recognize that memory 330 may be incorporated all or in part of processor 320. Processor 320 utilizes program memory instructions to execute a method, such as method 500 of
Partition and profile detector 340 acts to implement the functions of the partition detector of
The rating and analysis engine 300 of may be integrated as a functional element in a device, such as a web-based analysis engine or a set top box, as discussed herein below with respect to
User device 402 may be a digital television, a smart phone, PDA, tablet or conventional laptop computer, or a fixed location personal computer (PC). Users of device 402 view digital content, such as movies and other video, and provide ratings of the viewed content via link 403 to a network interface 404. Network interface 404 may be part of user device 402. The composite rating information is transferred via link 405, through network 406 and link 407 to the network interface 407 of the engine 408. Network Interfaces 404 and 409 each contain receivers and transmitters (transceivers) for two-way communication to and from network 406.
The composite rating information received by engine 408 may include ratings from multiple users of device 402. Engine 408 uses the rating analysis engine 470 to separate out the individual users, determine which user is associated with which rating, and can profile the user sufficiently to provide movie and video recommendations back to a user. In one application of the invention, engine 408 may also use the determined ratings to infer demographic information of each separate user and utilize that newly determined demographic information to target advertisements to a user. The inference of demographic information using ratings is discussed in U.S. Provisional Application No. 61/662,609 entitled “Method and Apparatus For Inferring User Demographics Based on Ratings”, which has inventors in common with the invention discussed herein.
Information regarding advertisements can be obtained via web-based database 413 or via local database 471 which may be accessed by engine 408 via a rating and analysis engine input/output interface, such as interface 350. The placement of advertisements may involve the engine 408 utilizing the processing capability of the rating analysis engine 470 to also perform processing on ratings determined via the use of the rating analysis engine 470 and to select advertisements from a database of advertisements such as database 413 or database 471. Once selected the advertisement can be sent to the user via transceivers of the network interfaces 409 and 404 to be received by user device 403.
b depicts an example configuration 450 of a set top box (STB) based analysis engine 410 according to elements of the invention. In
STB. The composite rating information is provided to the rating analysis engine 460. In the configuration of
The composite rating information received by the rating analysis engine 460 may include ratings from multiple users of device 420. STB based engine 410 uses the rating analysis engine 460 to separate out the individual users, determine which user is associated with which rating, and can profile the user sufficiently to provide movie and video recommendations back to a user. Such recommendations may be provided via communications from content provider 414 after STB analysis engine 410 provides content provider 414 rating and user profile information determined from the ratings and analysis engine 460.
As discussed above with respect to
Process 500 starts at step 501 and moves to access movie ratings in a composite account at step 505. As discussed above, such movie ratings can contain multiple users and the number of users may not be known a priori. Accessing movie ratings in a composite account includes loading the composite set of movie ratings into a rating analysis engine, such as that of
At step 515, the partition and profile detector is used to determine a number of partitions of the composite ratings that were input at step 505. User profiles are also generated at step 515 via the partition and profile detector, such as the one described in conjunction with
Step 520 further uses the results of step 515 by using profile information from the individual users in the composite ratings to determine recommendations for each user. In the instance of a web-based analysis engine, such as shown in
Returning to the flow diagram of
Step 530 utilizes the determined demographic information to target advertisements to an individual user determined from the composite ratings. Selection of such a targeted advertisement can be determined from a database of advertisements which can be available on a network connection, such as that of networked database items 413 and 412 of
The process 600 starts at step 601 and moves to step 605 to set the number of partitions (users) to 1. Access to the composite movie ratings and movie profiles is provided in step 610. As previous described, the provided movie ratings are a composite set of movie ratings which may represent a single account for a service such as Netflix™ or Hulu™ where multiple individuals have access to the one account. The movie profiles include feature vectors as described herein above.
Partition and profile information for each user is determined at step 615. Partition and profile information is determined using the partition and profile detector 260 of
At step 625 the value of BIC for a value of n and a value of BID for a value of n−1 are compared. Generally, the correct value of partitions, and hence users is the minimum of the determination sought in step 625. If the value of BIC using n starts to rise and is greater than the value of BIC previously calculated using n−1, then the determination of step 625 is affirmative and the process 600 terminates by providing the correct value of partitions or users. At the affirmative conclusion of step 625, the correct value of partitions, and hence users, is n−1.
If the determination at step 625 is negative, that is, if BIC(n) is less than or equal to BIC(n−1), then the value of BIC is not yet increasing and a minimum value of BIC may not have been reached. As a result, if the determination at step 625 is negative, the number n is increased by 1 at step 630. The process then continues iteratively to step 615 where the number of partitions and profiles for the partitions are determined As previously described, once the number of partitions of the composite ratings is determined using method 600, the number of users is equivalent to the number of partitions in the composite ratings because each partition corresponds to a hyperplane mapping of the composite rating set.
Although specific architectures are shown for the implementation of an analysis engine such as that of example embodiments of
Such options are equivalent to the functionality and structure of the depicted and described arrangements.
This application claims priority to U.S. Provisional Application No. 61/662,637 entitled “User Identification Through Subspace Clustering”, filed on 21 Jun. 2012, which is hereby incorporated by reference in its entirety for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2013/001543 | 6/20/2013 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61662637 | Jun 2012 | US |