This work was partially funded by the French Government under the grant ANR-13-CORD-0020 (ALICIA Project).
The exemplary embodiment relates to collaborative filtering and finds particular application in recommender systems where item perception and user tastes vary over time.
Recommender systems are designed to provide automatic recommendations to a user by attempting to predict the preferences or choices of the user. Recommender systems are employed in numerous retail and service applications. For example, an online retailer may provide a website through which a user (i.e., a customer) browses the retailer's catalog of products or services. To promote purchases, the retailer would like to identify and present to the customer specific products or services that the customer is likely to want to purchase. The recommender system, in this application, identifies products or services that are likely to be of interest to the customer, and these products are recommended to the customer.
Collaborative filtering is often used in such systems to provide automatic predictions about the interests of a user by collecting rating information from many users. The ratings can be explicit (e.g., a score given by a user) or implicit (e.g., based on user purchases). The method is based on the expectation that if two users have similar opinions on one item (or a set of items) then they are more likely to have similar opinions on another item than a person randomly chosen. For example, collaborative filtering-based recommendations may be made to a user for television shows, movies, books, and the like, given a partial list of that user's tastes. These recommendations are specific to the user, but use information obtained from many users.
In many collaborative filtering applications, the available data is represented in a matrix data structure. For example, product ratings can be represented as a two-dimensional matrix in which the rows correspond to customers (users) and the columns correspond to products (items), or vice versa. The data structure is typically very sparse, as most users have not purchased or reviewed many of the items.
One problem with existing recommender systems is that they often fail to provide the level of reactivity to changes that users expect, i.e., the ability to detect and to integrate changes in needs, preferences, popularity, and so forth. User preferences and needs change over time, either gradually or sharply, e.g., depending on particular events and on social influences. Similarly, item perception may evolve in time, due to a natural slow decrease in popularity or a sudden gain in interest, e.g., after winning an award or receiving positive reviews from influential commentators.
Another problem with many existing methods is that they often lack efficiency and scalability to meet the demands of very large recommendation platforms.
Additionally, existing systems generally do not address the “cold start” case, such as when a user or item is added to the system without any historical information, or when abrupt changes occur.
One approach for addressing temporal effects in recommender systems is known as the timeSVD++ algorithm (Yehuda Koren, “Collaborative Filtering with temporal dynamics,” Communications of the ACM, 53(4):8997, 2010). This approach explicitly models the temporal patterns on historical rating data, in order to remove “temporal drift” biases. This means that the time dependencies are modeled parametrically as time-series, typically in the form of linear trends, with a large number of parameters to be identified. Such a system would be unable to extrapolate rating behavior into the future, as it involves the discretization of the timestamps into a finite set of “bins” and the identification of bin-specific parameters. It is thus impossible to predict ratings for future, unobserved bins.
Other approaches rely on a Bayesian framework and on probabilistic matrix factorization, where a state-space model is introduced to model the temporal dynamics (see, for example, Deepak Agarwal, et al., “Fast online learning through online initialization for time-sensitive recommendation,” Proc. 16th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), pp. 703-712, 2010; Zhengdong Lu, et al., “A spatio-temporal approach to collaborative filtering,” Proc. 3rd ACM Conf. on Recommender Systems (RecSys), pp. 13-20, 2009; and David H Stern, et al., “Matchbox: large scale online Bayesian recommendations,” Proc. 18th Int'l Conf. on World Wide Web (WWW), pp. 111-120, ACM, 2009).
Tensor factorization approaches have also been adopted to model the temporal effects of the dynamic rating behavior (Liang Xiong, et al., “Temporal collaborative Filtering with Bayesian probabilistic tensor factorization,” Proc. SIAM Int'l Conf. on Data Mining (SDM), vol. 10, pp. 211-222, 2010). In this method, user, item and time constitute the three dimensions of the tensors. Variants of this general framework have been proposed which introduce second-order interaction terms and a different definition of the time scale (user- or item-specific time scales, by considering the time interval since the user or item first entered into the system). (L. Xiang, et al., “Time-dependent models in collaborative filtering based recommender system,” IEEE/WIC/ACM Int'l Joint Conf. on Web Intelligence and Intelligent Agent Technologies, 2009 (WI-IAT'09), vol. 1, pp. 450-457, 2009; L. Yu, et al., “Multi-linear interactive matrix factorization,” Knowledge-Based Systems, Vol. 85, Issue C, pp. 307-315, 2015). Tensor factorization is useful for analyzing the temporal evolution of user and item-related factors, but it does not extrapolate rating behavior into the future.
Other approaches propose to incrementally update the item- and user-related factors corresponding to a new observation by performing a stochastic gradient step of a quadratic loss function, but only allowing one factor to be updated. The updating decision is taken based on the current number of observations associated to a user or to an item. Thus, for example, a user with a high number of ratings will no longer be updated. (P. Ott, “Incremental matrix factorization for collaborative filtering,” Science, Technology and Design 01/2008, Anhalt University of Applied Sciences, 2008; S. Rendle, et al., “Online-updating regularized kernel matrix factorization models for large-scale recommender systems,” Proc. 2008 ACM Conf. on Recommender Systems, pp. 251-258, 2008). A similar approach has been extended to a non-negative matrix completion setting by assuming that the item-related factors are constant over time. (S. Han, et al., “Incremental learning for dynamic collaborative filtering,” J. Software, 6(6):969-976, 2011).
The use of Kalman Filters for collaborative filtering has also been proposed. Some methods rely on a Bayesian framework and on probabilistic matrix factorization, where a state-space model is introduced to model the temporal dynamics. (Z. Lu, et al., “A spatio-temporal approach to collaborative filtering,” Proc. 2009 ACM Conf. on Recommender Systems, pp. 13-20, 2009; D. Agarwal, et al., “Fast online learning through offline initialization for time-sensitive recommendation,” Proc. KDD 2010, pp. 703-712, 2010; D. Stern, et al., “Matchbox: large scale online Bayesian recommendations,” Proc. Int'l Conf. on World Wide Web (WWW '09), pp. 111-120, 2009). In one approach, an Expectation-Maximization-like method based on Kalman smoothers (the non-causal extension of Kalman filters) is used to estimate the value of the hyperparameters (J. Sun, et al., “Collaborative Kalman filtering for dynamic matrix factorization,” IEEE Transactions on Signal Processing, 62(14):3499-3509, 2014; J. Sun, et al., “Dynamic matrix factorization: A state space approach,” IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 1897-1900, 2012). Another approach models the continuous-time evolution of the latent factors through Brownian motion (S. Gultekin et al., “A collaborative Kalman filter for time-evolving dyadic processes,” IEEE Int'l Conf. on Data Mining (ICDM 2014), pp. 140-149, 2014). While such methods could, in theory, be extended to include additional user- or item-related features to address the cold-start problem, in order to remain computationally tractable and to avoid having to tackle non-linearities, they update only either the user factors, or the items factors, but never both factors simultaneously. This amounts to considering only linear state-space models, for which standard (linear) Kalman Filters provide an efficient and adequate solution.
Recently, an incremental matrix completion method has been proposed that automatically allows the latent factors related to both users and items to adapt “on-line” based on a temporal regularization criterion, ensuring smoothness and consistency over time, while leading to very efficient computations (U.S. application Ser. No. 14/669,153; J. Gaillard et al., “Time-sensitive collaborative filtering through adaptive matrix completion,” Adv. in Information Retrieval—Proc. 37th European Conf. on IR Research (ECIR 2015), pp. 327-332, 2015). The method allows updating of both item and user latent factors, but does not address the cold start problem explicitly.
Multi-Armed Bandits have been used for item recommendation (L. Li, et al., “A contextual-bandit approach to personalized news article recommendation,” Proc. 19th Int'l Conf. on World Wide Web (WWW 2010), pp. 661-670, 2010; O. Chapelle et al., “An empirical evaluation of Thompson sampling,” Proc. Adv. in Neural Information Processing Systems (NIPS 2011) vol. 24, pp. 2249-2257, 2011; D. Mahajan, et al., “Log UCB: an explore-exploit algorithm for comments recommendation,” 21st ACM Int'l Conf. on Information and Knowledge Management (CIKM 2012), pp. 6-15, 2012). Some approaches use linear contextual bandits, where a context is typically a user calling the system at time t and an associated feature vector. The reward (i.e., the rating) is assumed to be a linear function of this feature vector. Other approaches consider binary ratings, with a logistic regression model for each item and then use Thompson Sampling or UCB sampling to select the best item following an exploration/exploitation trade-off perspective. Another approach combines Probabilistic Matrix Factorization and linear contextual bandits (X. Zhao, et al., “Interactive collaborative filtering,” 22nd ACM Int'l Conf. on Information and Knowledge Management (CIKM'2013), pp. 1411-1420, 2013). None of these approaches, however, allows an adaptive behavior: the features associated to a user are assumed to be constant and known accurately.
A system and method are provided which allow dynamic tracking of both user and item latent factors while facilitate control of the exploration/exploitation trade-off in an on-line learning setting.
The following references, the disclosures of which are incorporated herein by reference, are mentioned:
Recommender systems and collaborative filtering methods are described, for example, in U.S. Pat. No. 7,756,753 and U.S. Pub. Nos. 20130218914, 20130226839, 20140180760, and 20140258027.
An adaptive collaborative filtering method is described in U.S. application Ser. No. 14/669,153, filed Mar. 26, 2015, entitled TIME-SENSITIVE COLLABORATIVE FILTERING THROUGH ADAPTIVE MATRIX COMPLETION, by Jean-Michel Renders.
In accordance with one aspect of the exemplary embodiment, a method for updating a predicted ratings matrix includes receiving an observation, the observation identifying a user, an item, an observed rating of the user for the item, and a time of the observation. User and item latent factor matrices and user and item biases are updated using extended Kalman filters based on the observation. The user latent factor matrix includes latent factors for each of a set of users. The item latent factor matrix includes latent factors for each of a set of items. A predicted ratings matrix is updated as a function of the updated user latent factor matrix and the updated item latent factor matrix.
One or more steps of the method may be performed with a processor.
In accordance with another aspect of the exemplary embodiment, a system for updating a predicted ratings matrix includes an adaptive matrix completion component which updates user and item latent factor matrices and user and item biases using extended Kalman filters based on the observation, the user latent factor matrix including latent factors for each of a set of users, the item latent factor matrix including latent factors for each of a set of items. The component also updates a predicted ratings matrix as a function of the user updated latent factor matrix and the updated item latent factor matrix. A processor device implements the adaptive matrix completion component.
In accordance with one aspect of the exemplary embodiment, a method for making a recommendation includes, for a plurality of iterations, receiving an observation, each observation identifying a user, an item, an observed rating of the user for the item, and a time of the observation, updating user and item latent factor matrices and user and item biases using extended Kalman filters, based on the observations, the user latent factor matrix including latent factors for each of a set of users, the item latent factor matrix including latent factors for each of a set of items, and updating a predicted ratings matrix as a function of the user latent factor matrix and the item latent factor matrix. A request for an item to be recommended to a user is received. An item to recommend to the user is identified using multi-arm bandit sampling and the identified item is output.
One or more steps of the method may be performed with a processor.
Aspects of the exemplary embodiment relate to a method and system for adaptive collaborative filtering which enable both adaptivity and “cold start” challenges to be addressed through the same framework, namely Extended Kalman Filters (EKF) coupled with contextual Multi-Armed Bandits (MAB). Advantages of the system and method include scalability and tractability, which are useful in systems dealing with large numbers of users and items. Rather than relying on complex inference methods derived from fully Bayesian approaches, a more simplified and efficient method is employed.
The system and method rely on a matrix completion approach to collaborative filtering, which is extended to the adaptive, dynamic case, while controlling the exploitation/exploration trade-off (especially in the “cold start” situations). EKF provide a useful framework for modelling smooth non-linear, dynamic systems with time-varying latent factors (sometimes referred to as “states”). In contrast to conventional filters, which only allow one of the user and item latent factor matrixes to be updated, EKF allow both matrices to vary over time, as well as user and item biases, which is more realistic. The EKF maintain, in particular, covariance estimates over the states (which are covariance matrices for latent factors and scalar values for user and item biases) or, equivalently, a posterior distribution over the user/item biases and latent factors, which are exploited by the MAB mechanism to guide its sampling strategy. In the exemplary embodiment, Iterative Extended Kalman Filters (IEKF) are employed. Two different MAB approaches are described herein by way of example: Thompson sampling, which is based on the probability matching principle, and UCB (Upper Confidence Bound) sampling, which is based on the principle of optimism in face of uncertainty.
The exemplary system and method can be employed in a recommender system as described, for example, in above-mentioned application Ser. No. 14/669,153, incorporated herein by reference. Briefly, when a user u calls the system at time t for a recommendation, an item i is chosen that will simultaneously satisfy the user and improve the quality estimate of the parameters related to both the user u and the proposed item i. When the system then receives a new feedback in the form of a user, item, rating, time tuple (<u,i,r,t> tuple), it updates corresponding entries of user and item latent factor matrices and user and item biases. The present system also updates posterior covariance matrices over the factor estimates. The present system and method provide an algorithm entailing fairly basic algebraic computations, which allows updates to be made without the need for matrix inversion or singular value decomposition. This enables the exemplary algorithm to update the parameters of the model and make recommendations, even with a high arrival rate of, for example, several thousand ratings per second.
The term “user” as used herein encompasses any person acting alone or as a unit (e.g., a customer, corporation, non-profit organization, or the like) that rates items. An “item” as used herein is a product, service, or other subject of ratings assigned by users.
With reference to
The online retail store website 10 enables users to rate items, such as products or services, which may be available on the website 10, and may be identified by a unique identifier. In an illustrative example, the user ratings (i.e., observed ratings) are on a one-to-five star integer scale. However, any other rating scale can be employed. The ratings are suitably organized in an n×m user ratings matrix 32 (
During the course of a user's session on the online retail store website 10, it is advantageous to direct recommendations to the user. For this, the website 10 utilizes a recommender system 40, which may be hosted by the website server computer 12 or by a separate computing device 42, which may be communicatively connected with the computer 12, as illustrated. In the illustrated embodiment, the website server computer 12 and recommender server computer 42 both have access to a database 30, which stores data previously collected from the users, although other arrangements are contemplated. For example, the website 10 may send raw data to the recommender system 40 which generates and stores the matrix 32 locally.
The recommender system 40 includes memory 44 which stores instructions 46 for performing the method described with reference to.
The instructions 46 include an adaptive matrix completion (AMC) component 60 and a recommendation component 62.
The AMC component 60 decomposes the user ratings matrix 32 into user and item latent factor matrices 72, 74 denoted L and R, to minimize errors between the current user matrix 32 and a reconstructed predicted ratings matrix 76 (denoted {tilde over (X)}, where {tilde over (X)}=LRT). The predicted ratings matrix, unlike the sparse user ratings matrix 32, includes a value in each of the cells for the user, item pairs. When a new observation 78 is received in the form of a tuple: user ID, item ID, rating, time-stamp indicating the time at which the rating was submitted (<u,i,r,t> tuple), the AMC component 60 updates the user and item latent factor matrices 72, 74 and hence the reconstructed matrix 76 as well as user and item biases for the respective user and item. For each rating, the AMC component 60 updates at least one latent factor vector (or row) of the user and item factor matrices 72, 74 and user and item biases. In the exemplary embodiment, this is achieved with Kalman filters 79. The Kalman filter keeps track of the estimated state of the system and the variance (or uncertainty) of that estimate.
The recommendation component 62 receives as input a query 80 from the website 10 and outputs a recommendation 82 based on the updated predicted ratings matrix 76. Various types of query are contemplated. For example, the query 80 may include a user ID and seek a recommendation of an item (or set of n items) to be proposed to the corresponding user 14. To do this, the recommendation component 64 uses a multi-arm bandit approach to Kalman filtering which takes into account the predicted ratings for the items (from the row of the matrix 76 for that user, and also the uncertainty of those predictions. The aim is to satisfy the user 14 by providing a recommendation with a high predicted rating, while at the same time, recommending an item which is expected to reduce the uncertainty in the predictions. Thus, for example, if the user is looking for a movie recommendation and has not yet rated any horror movies, the recommender system may recommend a horror movie to the user, provided that the predicted ratings matrix does not predict very low user ratings for all the horror movies in the collection of items. The rating of the user for this item is thus expected to be informative and lead to better predictions in the future than if the system recommended a Western movie to a user who has already rated a number of Westerns. As will be appreciated, the system may also be used to identify users to make a recommendation for a particular item, based on the predicted ratings that the users would give and the uncertainty in the predictions.
The recommendation 82 is output by the system to the website 10, which may then display one or more of the recommended items to the user on the display 20 of the client device.
The sequence of generating the query 80, receiving the recommendation 82, and displaying the recommendation 82 on the display 20 of the client device 16 can occur in various settings. For example, when a user selects an item for viewing, the online retail store website 10 may generate the query 80 as the (user, item) pair, and then display the store's prediction for the user's rating of the item in the view of the item. To increase user purchases, in another variation when the user views an item in a given department the query 80 is generated as the user alone (not paired with the item) but optionally with the query constrained based on the viewed item (e.g., limited to the department of the store to which the viewed item belongs). The recommended items are then displayed in the view of the selected item, along with suitable language such as “You may also be interested in these other available items:” or some similar explanatory language. The displayed recommended items may be associated with hyperlinks such that user selection of one of the recommended items causes the online retail store website 10 to generate a view of the selected recommended item. In another embodiment, the website may pay for click through advertisements to be displayed next to content on another website. Given the user ID, the recommended items recommended for that user may be displayed in the click-through advertisement (which when clicked on by the user, takes the user to a page of the store website). Or the website may be configured for recommending items to users on request, such as a request for a recommendation of movie currently being shown in the user's neighborhood or a local restaurant.
The computer system 40 may include one or more computing devices 42, such as a PC, such as a desktop, a laptop, palmtop computer, portable digital assistant (PDA), server computer, cellular telephone, tablet computer, pager, combination thereof, or other computing device capable of executing instructions for performing the exemplary method.
The memory 44 may represent any type of non-transitory computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In one embodiment, the memory 44 comprises a combination of random access memory and read only memory. In some embodiments, the processor 48 and memory 44 may be combined in a single chip. Memory 44 stores instructions for performing the exemplary method as well as the processed data.
The network interface 50, 52 allows the computer to communicate with other devices via a computer network, such as a local area network (LAN) or wide area network (WAN), or the internet, and may comprise a modulator/demodulator (MODEM) a router, a cable, and and/or Ethernet port.
The digital processor device 48 can be variously embodied, such as by a single-core processor, a dual-core processor (or more generally by a multiple-core processor), a digital processor and cooperating math coprocessor, a digital controller, or the like. The digital processor 48, in addition to executing instructions 46 may also control the operation of the computer 42. Computer 12 may be configured similarly to computer 42, with respect to its hardware.
The term “software,” as used herein, is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software. The term “software” as used herein is intended to encompass such instructions stored in storage medium such as RAM, a hard disk, optical disk, or so forth, and is also intended to encompass so-called “firmware” that is software stored on a ROM or so forth. Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.
At S102, a sparse user ratings matrix 32 is decomposed into a user latent factor matrix 72 which, for each of a set of users, includes a value for each of a set of K latent factors, and an item latent factor matrix 74 which, for each of a set of items includes a value for each of a set of K latent factors. The product of the latent factor matrices is a reconstructed ratings matrix which includes values for each user item pair. User and item biases
At S104, at least one observation <u,i,r,t> is received and may be stored in memory 44.
At S106, adaptive matrix completion is performed using Extended Kalman filters for each of the plurality of times t. This step may include the following substeps, as illustrated in
At each time t, when an observation is received it is used to update corresponding rows of the latent factor matrices 72, 74 for the corresponding user and item. The updating includes a predictor step (S202) and an iterative corrector step (S204). In the predictor step, covariance estimates are computed for a user bias, an item bias, and for the respective rows of the user and item latent factor matrices (corresponding to the user and item) as a function of the respective covariance estimates at a respective prior time t−1 when an observation concerning the respective user or item was made and a (weighted) respective difference in time between the current time t and the prior time t−1.
In the corrector step, rows of the user and item latent factor matrices are initialized with their prior values at time t−1 (S206A). Then, for at least one iteration, the following are computed:
S206B. An update factor is computed as a function of the prior covariance estimates and the standard deviation of a noise probability distribution.
S206C. A user latent factor filter gain matrix is then computed for the row of the user latent factor matrix corresponding to the user as a function of the update factor, the user bias covariance estimate at time t, and the respective row of the user latent factor matrix.
S206D. An item latent factor filter gain matrix is computed for the row of the item latent factor matrix corresponding to the item as a function of the update factor, the user bias covariance estimate at time t, and the respective row of the item latent factor matrix.
S206E. The respective rows of the user latent factor matrix and item latent factor matrix are each updated as a function of the respective filter gain matrix, the rating of the user for the item, the user bias at time t−1, the item bias at time t−1, and the respective user, item element of the reconstructed matrix at time t−1, and a fixed weight. After one or more iterations, at S206F, user and item bias filter gain values are computed as a function of the current update factor and the respective prior user bias and item bias covariance estimates, the user and item biases at time t are each computed as a function (e.g., sum) of their prior values and a function of respective computed user or item bias filter gain value, the rating of the user for the item, the user bias at time t−1, the item bias at time t−1, and the respective user, item element of the reconstructed matrix at time t−1, and the fixed weight. The covariance estimates for the user bias, the item bias, and for the respective rows of the user and item latent factor matrices are also updated.
In the case when no training data has been observed for a particular user/item, i.e., the cold start case, the covariance estimates for the user bias, item bias, and the latent factor matrices can be set to relatively large values (corresponding to the large variance of the prior distributions over the model parameters), which are expected to reduce when more observations of these items/users are made.
At S108, a request is received for an item for proposing to a selected user.
At S110, the current latent factor matrices and user biases are used to select an item to sample, using one of the MAB sampling methods to provide a tradeoff between exploration and exploitation.
At S112, the recommendation is output.
If at S114, the user provides a rating of the proposed item, or another user provides a rating of an item, the method returns to S102. Otherwise the method ends at S116.
Further details of the system and method will now be provided.
In the following, the terms “optimization,” “minimization,” and similar phraseology are to be broadly construed as one of ordinary skill in the art would understand these terms. For example, these terms are not to be construed as being limited to the absolute global optimum value, absolute global minimum, and so forth. For example, minimization of a function may employ an iterative minimization algorithm that terminates at a stopping criterion before an absolute minimum is reached. It is also contemplated for the optimum or minimum value to be a local optimum or local minimum value.
The training data (observations 78) comprises or consists of a sequence of tuples <user,item,rating,time-stamp>. Since the training data is very sparse (users only rate a small proportion of the items), a matrix factorization approach to collaborative filtering is used in which a user ratings matrix 32 is decomposed into user and item latent factor matrices 72, 74 and user and item biases, which minimize the errors between the original user ratings matrix and a reconstructed user ratings matrix 76. The number K of latent factors can be manually selected or determined automatically. In general, K<<n, and K<<m, e.g., K is at least 5, or at least 10, or at least 20, or at least 50. In this setting, each observed rating can be modeled as:
r
u,i
=μ+a
u
+b
i
+L
u
·R
i
T+ε
where μ represents a fixed weight, au, bi, Lu and Ri are latent variables, respectively a user bias, an item popularity (bias), the user latent factors, and the item latent factors. T represents the transpose operator. Lu and Ri are row vectors of the respective latent factor matrices 72, 74, with K components (each with a rank k), K being the dimensionality of the latent space. The noise E is assumed to be i.i.d. (independent and identically distributed) Gaussian noise, with mean equal to 0 and variance equal to the square of the standard deviation σ2. The completion of the reconstructed user ratings matrix 76 may involve minimization of a loss function which combines the reconstruction error over the training set and regularization terms, e.g., as follows:
(a,b,L,R)=Σ(u,i)εΩ|rui−μ−au−bi−Lu·RiT∥2+λa∥a∥2+λb∥b∥2+λL∥L∥F2+λR∥R∥F2 (1)
where:
The minimization of the loss function can be solved by gradient descent (optionally accelerated by the Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) quasi-Newton method), as described in Malouf, R., “A comparison of algorithms for maximum entropy parameter estimation,” Proc. Sixth Conf. on Natural Language Learning (CoNLL), pp. 49-55 (2002), or by Alternating Least Squares (for Lu and Ri).
It may be noted that this loss could alternatively be interpreted in a Bayesian setting as the Maximum A Posteriori (MAP) estimate, provided that all the latent parameters (au, bi, Lu and Ri) have independent Gaussian priors, with diagonal covariance matrices. In this case,
where σL2 is the variance of the diagonal Gaussian prior on Lu; with λu, λb and λR being computed correspondingly:
The above describes a static setting. It is assumed, however, that the model parameters evolve over time, and have their own dynamics. The evolution is dependent on the time between observations. Since some of the latent variables result from rare observations, they tend to have a high variability, while others that result from more frequent observations show less variability between observations. As an example, if a user has recently been observed, it can be assumed that the latent factors for that user have not changed much in the intervening period, whereas for a user who was last observed a long time ago, the latent factors can be assumed to have a much higher variability.
One approach to reconstructing the evolution of these parameters (considered as latent variables) from the sequence of observations relies on the use of Extended Kalman Filters (EKF), which are a non-linear version of Kalman Filters. As used herein, EKF are a generalization of recursive least squares for dynamic systems with “smooth” non-linearities and aim at estimating the current (hidden) state of the system:
x
t=ƒ(xt-1)+Wt
y
t
=h(xt)+zt
x
0
:N(x*0,Λ)
w
t
:N(0,Qt)
z
t
:N(0,σt2)
where N(m,C) denotes a multi-variate Gaussian distribution with mean m and covariance matrix C, (where the mean m is 4 for x*0 and x0 for wt and zt and the covariance matrix C is Λ for x0, Qt for Wt and σt2 for zt),
The function ƒ can be used to generate a prediction of the latent state xt, denoted {circumflex over (x)}t, from the previous estimate and similarly the function h can be used to compute the predicted observation ŷt from the predicted state.
The Kalman Filters follow a general predictor-corrector (update) scheme, in which a predictor step is performed to predict the state based on the prior state of the system, and a corrector step which updates the state estimate based on an observation. A filter gain matrix Kt, which is applied to the prediction error at time t, and a covariance matrix Pt|t of the posterior distribution of the state estimates are maintained and updated.
In the predictor step, the state at time t ({circumflex over (x)}t|t-1) is estimated as a time-dependent function of the estimate of the state at a previous time t−1. The predicted covariance estimate at time t (Pt|t-1) is then computed as a function of the estimated state at time t, the predicted covariance estimate at time t−1 (Pt-1|t-1) and the covariance matrix Qt for xt:
Predictor Step:
Predicted state estimate:
{circumflex over (x)}
t|t-1=ƒ({circumflex over (x)}t-1|t-1)
Predicted covariance estimate:
P
t|t-1
=J
ƒ({circumflex over (x)}t|t-1)Pt-1|t-1Jƒ({circumflex over (x)}t|t-1)T+Qt
where Jƒ is the Jacobian matrix of the function ƒ (Jƒ is defined by
In the corrector step, the filter gain matrix Kt is computed as a function of the estimated covariance estimate Pt|t-1 and the estimate of the state at time t ({circumflex over (x)}t|t-1), obtained from the predictor step, and the covariance σt2.
The state estimate is then updated as a function of {circumflex over (x)}t|t-1 computed in the predictor step, the filter gain matrix Kt and the difference between the actual observation yt and the predicted observation at time t, computed as h({circumflex over (x)}t|t-1). The covariance estimate is also updated as a function of the filter gain matrix Kt, and the predicted covariance estimate at time t (Pt|t-1).
Corrector Step:
Filter gain matrix:
K
t
=P
t|t-1
J
h({circumflex over (x)}t|t-1)T(Jh({circumflex over (x)}t|t-1)Pt|t-1Jh({circumflex over (x)}t|t-1)T+σt2)−1
Updated state estimate:
{circumflex over (x)}
t|t
={circumflex over (x)}
t|t-1
+K
t
·[y
t
−h({circumflex over (x)}t|t-1)]
Updated covariance estimate:
P
t|t
=[I−K
t
J
h({circumflex over (x)}t|t-1)]Pt|t-1
where Jh is the Jacobian matrix of the function h (Jh is defined by
In practice, an Iterated Extended Kalman Filter (IEKF) can be used, where the first two equations of the Corrector step are iterated until {circumflex over (x)}t|t(i) is stabilized, gradually offering a better approximation of the non-linearity through the Jacobian matrices.
In order to apply these filters, the adaptive collaborative filtering can be expressed as a continuous-time dynamic system with the following equations, assuming that the tuple <u,i,ru,i> is observed at time t:
a
u,t
=a
u,t-1
+w
a(μ,t−1,t)
b
i,t
=b
u,t-1
+w
b(i,t−1,t)
L
u,t
=L
u,t-1
+W
L(u,t−1,t)
R
i,t
=R
i,t-1
+W
R(i,t−1,t)
r
u,i,t
=μ+a
u,t
+b
i,t
+L
u,t
·R
i,t
T+εt
where au,0˜N(0,λa), bi,0˜N(0,λb), Lu,0˜N(0,λL), Ri,0˜N(0,λR) and εt˜N(0,σ2),
au,t-1 denotes the value of the bias of user u when that user last appeared in the system before time t. Similarly, bi,t-1 denotes the value of the popularity of item i when it appeared in the system for the last time before time t. The abbreviated notation (t−1), as used herein, is thus contextual to an item and to a user. The symbol ˜ denotes “drawn from the distribution.”
wa(u,t−1,t), wb(i,t−1,t), WL(u,t−1,t), and WR(i,t−1,t) are noises whose variance depends on the time lapse since the last occurrence of a user (for wa and WL) or of an item (for wb and WR). This defines a brownian motion for the temporal evolution of the parameters.
εt represents the noise in the observation (rating) at time t and is assumed to be i.i.d. (independent and identically distributed) Gaussian noise, with mean equal to 0 and variance equal to the square of the standard deviation σ2.
The parameters au, bi, Lu and Ri are thus all assumed to follow some kind of Brownian motion (the continuous counter-part of a discrete random walk) with Gaussian noises whose respective variances are proportional to the time interval since a user/an item appeared in the system for the last time before the current time, denoted respectively as Δu(t−1, t) and Δi(t−1, t): wa(u,t−1,t)˜N(0,Δu(t−1, t)·γa), wb(i,t−1,t)˜N(0,Δi(t−1,t)·γb), WL (u,t−1,t)˜N(0,Δu(t−1,t)˜ΓL) and WR(i,t−1,t)˜N(0,Δi(t−1,t)·ΓR).
γa, γb, ΓL and ΓR are referred to as volatility hyper-parameters. It can be assumed that the hyper-parameters λa, γa and the diagonal covariance matrices ΛL, ΓL are identical for all users, and independent from each other. The same can be assumed for the hyper-parameters related to items.
With these assumptions, the application of the Iterated Extended Kalman filter equations gives:
Predictor Step:
P
t|t-1
a
=P
t-1|t-1
a
+Δu(t,t−1)γa
P
t|t-1
b
=P
t-1|t-1
b
+Δi(t,t−1)γb
P
t|t-1
L
=P
t-1|t-1
L
+Δu(t,t−1)ΓL
P
t|t-1
R
=P
t-1|t-1
R
+Δi(t,t−1)ΓR
In the predictor step, therefore, covariance estimates Pt|t-1a
Corrector Step:
Initialize ←and ←
(the accent ̂ denotes an estimate of the respective value)
Iterate until convergence:
ω=(σ2+Pt|t-1a
K
t
L
=ωP
t|t-1
L
K
t
R
=ωP
t|t-1
R
=+KtL
=+KtR
Then:
K
t
a
=ωP
t|t-1
a
K
t
b
=ωP
t|t-1
b
=+Kta
=+Ktb
P
t|t
a
=P
t|t-1
a
(1−Kta
P
t|t
b
=P
t|t-1
b
(1−Ktb
P
t|t
L
=(I−KtL
P
t|t
R
=(I−KtR
with P0|0a
lower case (e.g., λ) generally being used to denote scalars and upper case (e.g., Λ) for denoting matrices. ω, Kta
In practice, it has been found that the iterative part of the Corrector step may converge in only a few iterations (such as 2 or 3). It should be noted that, if a user is not well known (high covariance Pa
The net effect is generally that if more observations are received in a relatively short period of time, the uncertainty (covariance estimate) decreases. The aim, however, is not to have the parameters converge to zero, since it is assumed that these parameters vary over time.
The independence and Gaussian assumptions make it simple to compute the posterior distribution of the rating of a new pair <u,i> at time t: it is a Gaussian with mean μ+++ and variance σ2+Pt|ta
In one embodiment, the IEKF method may be extended to introduce any smooth non-linear link function (e.g., ru,i,t=g(μ++++εt), with g(x) being a sigmoid between the minimum and maximum rating values. This extension includes pre-multiplying each occurrence of Pt|t-1{.} by the derivative of the g sigmoid at the current point (μ+++) in the equations of the corrector step.
The hyper-parameters may be learned from training data through a procedure similar to the EM algorithm using Extended Kalman smoothers (a forward-backward version of the Extended Kalman Filters) as described, for example, in J. Sun, et al., “Collaborative Kalman filtering for dynamic matrix factorization,” IEEE Transactions on Signal Processing, 62(14):3499-3509, 2014, or by tuning them on a development set, whose time interval is later than the training set.
One property of the exemplary method is that it is easily parallelizable. This is because a tuple <i,j,ri,j> will only modify the estimated parameters CU, , , and the variances/covariances Pt|ta
Let θ denote the set of all parameters (biases au, bi and latent factors Lu, Ri, for all u and i). If the true parameters θ* were known, for a given context (user u at time t), the system should recommend an item i* such that i*=argmaxi E(r|u,i,θ*) with P(r|u,i,θ*)˜N(μ+L*uRiT*+a*u+b*i, σ2). If θ* is not known, it would be possible to marginalize over all possible 6 through the use of the posterior p(θ|D) with D=training data. This amounts to choosing i*=argmaxiμ+++, if a Maximum Posterior solution (MAP) is adopted. However, this is a “one-shot” approach, considered as pure exploitation. As the setting is a multi-shot one, it is desirable to balance exploitation and exploration, which can be expressed by the concept of “regret” (the difference in expected rewards or ratings between a strategy that knows the true θ* and the one based on a current estimate θt).
One or both of the following two different sampling strategies may be used to control this trade-off: Thompson sampling, based on the “probability matching” principle, and UCB (Upper Confidence Bounds) sampling, based on the principle of optimism in the face of uncertainty. See, for example, O. Chapelle, et al., “An empirical evaluation of Thompson sampling,” Proc. Adv. in Neural Information Processing Systems (NIPS), vol. 24, pp. 2249-2257 (2011), hereinafter, Chapelle 2011) for an introduction to these strategies in the context of recommendation. A “contextual bandit” setting is assumed. This means that at each time step t, a context given by a single user u is observed, characterized by an imperfect estimate of her bias au and latent factors Lu (some kind of noisy context), and the system should then recommend an arm (i.e., an item) such that the choice of this arm will simultaneously satisfy the user and improve the quality of the estimates of the parameters related to both the user u and the proposed item i.
1. Thompson Sampling
The Thompson Sampling strategy can be expressed by Algorithm 1:
Algorithm 1 considers a set of times from t=1 to t=T. At each time, the identifier of a particular user is received, i.e., one of the set of users. The parameters {tilde over (θ)}t=ãu,t, {tilde over (b)}i,t, {tilde over (L)}u,t, {tilde over (R)}i,t are drawn from probability distributions generated from the training data, such as the observations obtained to date. An item to propose to the user is then selected which maximizes over all items, the expected reward (or score), denoted E(r|u,i,θt), which is computed as the maximum, over all i, of the mean μ of the predicted rating distribution plus the product of the sampled {tilde over (L)}u,t,{tilde over (R)}i,t, to which the sampled ãu,t and, {tilde over (b)}i,t are added. The item i* is proposed to the user and, assuming that the user provides a rating rt for that item, the rating is used to update the training data (D=D∪ut, i*,rt. The parameters and their variances/covariances are then updated by the iterative Extended Kalman Filtering method, e.g., IEKF.
In one embodiment, the “Optimistic Thompson sampling” variant may be used (see, Chapelle 2011). This results in the score never being smaller than the mean score. More precisely, in Algorithm 1, E(r|u,i,θt) is replaced by max (μ+{tilde over (L)}u,t{tilde over (R)}i,t+ãu,t+{tilde over (b)}i,t, μ+++). Additionally or alternatively, the variance/covariance values/matrices may be pre-multiplied by a factor, such as 0.5, to favor exploitation.
2. UCB-Like Sampling
The Upper-Confidence-Bounds (UCB) algorithm can be represented in the pseudo-code shown in Algorithm 2:
Here, in the selection step, √{square root over (σ2+Pt|ta
The Thompson sampling method provides a sampling from the set of items as a function of the respective distribution of predicted ratings and the uncertainty in the predicted ratings. For uncertain items the bell-shaped distribution is wide, whereas for items with lower uncertainty, the bell-shaped distribution is more peaked around the mean predicted rating, and the sampled predicted rating is likely to be closer to the mean. In the case of UCB sampling, an optimistic approach is taken, with only the part of the distribution of predicted ratings for an item that is above the mean (up to an upper cutoff, of around two times the standard deviation, for example) being considered.
Whichever method of MAB sampling is used, the item recommended to the user may not be the one whose mean predicted rating is not as high as for another item, but for which the uncertainty in that prediction is relatively high, which will lead to improvements in the system in the future, assuming that the user rates the item.
Other approaches for selecting an item to recommend to a user may be used instead of, or in combination with, one or both of the MAB approaches described herein. For example, the method described in application Ser. No. 14/669,153 could be employed.
Item recommendation is used in many applications. Items may be product in an online store, services, such as restaurants, movies, workers/jobs in an HR scenario, locations or paths in transport applications, advertising tags and creatives in on-line display advertising problems. In these recommendation scenarios, the environment can be highly dynamic and not stationary, which precludes the use of standard static recommendation algorithms. Moreover, the arrival rate of new users or new items is, in general, very high, so that solving the cold start problem is very useful.
Without intending to limit the scope of the exemplary embodiment, the following examples illustrate the application of the system and method.
Experiments have been performed on 2 datasets: MovieLens 10M and Vodkaster (Vodkaster (http://www.vodkaster.com) is a French movie recommendation website, dedicated to rather movie-educated people. Each dataset is divided into 3 temporal, chronologically ordered, splits: Train (90%), Development (5%), and Test (5%). Before using the data, the ratings corresponding to the early, non-representative (transient) time period are removed from both datasets (2.5M ratings for MovieLens, 0.3M ratings for Vodkaster). The Development set is used to tune the different hyper-parameters of the algorithm. These two datasets show very different characteristics, as illustrated in Table 1, especially in the arrival rate of new users.
The experiments are divided into two parts: one assessing separately the adaptive capacities of our method, the other evaluating the gain of coupling these adaptive capacities with Multi-Armed Bandits.
The experimental protocol is the following: the Extended Kalman Filters are run from the beginning of the training dataset, initializing the item and user biases to 0 and the latent factors to small random values drawn from the Gaussian prior distributions with covariance matrices ΛL and ΛR for all users and all items respectively. The number of latent factors K is set to 20, without any tuning. Each user is associated to her own time origin (t=0 when the user enters in the system for the first time), and similarly for the items. The values of the hyper-parameters to be tuned are the four variances of the priors and the four volatility values. It can be shown that all values of the hyper-parameters can be divided by σ2 without changing the predicted value; so σ2 can be set to 1. The parameter values chosen are the ones that optimize the Root-Mean-Squared-Error (RMSE) on the Development set. The Test set is then used to evaluate the RMSE of the predictions, as well as the Mean Absolute Error (MAE) and the average Kendall correlation coefficient (for users with at least two ratings in the Test set). Alternative methods considered are:
(1) the static setting, where matrix factorization is derived from the ratings of the Training and Development sets (hyper-parameters tuned on the Development set) and the extracted models are then applied to the Test set;
(2) Stochastic Gradient Descent applied to the biases and latent factors, with constant learning rates (four different learning rates: one for au, one for bi, one for Lu and one for Ri);
(3) the on-line Passive-Aggressive algorithm to incrementally update the biases and latent factors as described in K. Crammer, et al., “Online passive-aggressive algorithms,” J. Machine Learning Res., (7) pp. 551-585, 2006;
(4) Linear Kalman Filters applied to update only the user biases and latent factors.
The statistical significance of the differences in performance of the exemplary method with respect to the alternative approaches is evaluated, through paired t-tests on the paired sequences of measures (squared residuals for RMSE, absolute residuals for MAE, and Kendall's tau for each user).
In TABLE 2, numbers in bold indicate that the p-value of the corresponding test is smaller than 1% (hypothesis HO: population with equal mean). The results show that the proposed method significantly improves the performances according to all RMSE, MAE and Kendall's tau metrics (Table 2). Trends are very similar for both MovieLens and Vodkaster datasets, despite their different characteristics. One particular advantage of the present method is the ability to maintain, without cost, a posterior distribution over the parameters and the prediction itself, which is a constituent for the sampling strategies of the MAB mechanism.
7.2 Extended Kalman Filters Coupled with MAB
This second set of experiments is performed on the MovieLens dataset only. The experimental protocol is aimed at the evaluation of MAB strategies. It is assumed that users enter in the system exactly as the initial datasets (so the t and u values are retained from the original sequence of tuples <u,i,r,t>), but the system is allowed to propose another item than the one that was chosen in the original sequence. It is also assumed that all items are available from the beginning (this is of course a simplified approximation of the reality). Each time the system proposes an item, it receives a “reward” or relevance feedback, which is 1 if the item was rated at least 4 and 0 otherwise. To be able to determine a reward value, during the item selection process, the items that the user never rated are excluded. Different selection strategies are compared:
(1) GREEDY: A “pure exploitation” (or “one-shot” strategy), that greedily chooses the item not yet seen by the user that has the maximum predicted value, as given by the Extended Kalman Filters;
(2) UCB: A UCB-sampling strategy (with α=2);
(3) THOMPSON: A Thompson sampling strategy (optimistic variant; pre-multiplying the variances/covariances by 0.5).
The metrics used are the average precision (or equivalently the average reward) and the average recall after the system has presented n items to a user (n=10, 50 and 100). The average is computed over all users who have rated at least 100 items. The Extended Kalman Filters derived from the first set of experiments (Adaptive Matrix Completion) are used, applied from the beginning of the dataset. The results are given in TABLE 3.
Paired t-tests indicate that both UCB and Thompson sampling strategies significantly outperform the greedy one at n=50 and n=100. Thompson sampling gives slightly better performance than UCB at n=100, but at the limit of the significance (p-value=0.051).
A single framework that combines the adaptive tracking of user/item latent factors through Extended Non-linear Kalman filters and the exploration/exploitation trade-off used by the exemplary on-line learning setting (including cold-start) through Multi-Armed Bandits strategies has been described. Experimental results show that, at least for the datasets and settings that were considered, this framework provides a useful alternative to other approaches without being computationally expensive.
It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.