The following finds application in online retail, social media network recommender systems, and so forth.
In various applications, it is desired to model relationships between entities of different types in order to predict values for such relationships between specific entities. For example, in online retail systems, it is desirable to provide a shopper with recommendations. Such recommendations can be based on the shopper's previous purchase history, but this approach is of limited value if the shopper has a short (or non-existent) purchase history on the retail site, or if the shopper is browsing a different area than usual. Another approach, known as collaborative filtering, utilizes purchase histories of other shoppers, product recommendations or reviews provided by other shoppers, and so forth in order to generate recommendations. Qualitatively, it can be seen that if other shoppers with similar profiles to the current shopper (e.g., similar age, gender, past purchase history, et cetera) have tended to buy a particular item, then that item may be something that should be recommended.
In the online shopping example, entities may include users (i.e. shoppers), items for sale, item features, user features, and so forth. Another illustrative application that can utilize such relational data between entities is social media networks, where it is desired to recommend tags for uploaded items (e.g., images or video clips), to recommend other users as potential user-user links, or so forth. The goal in such systems is to generate predictions, e.g. to predict which items for sale are likely to be purchased by the shopper (so as to present those predicted items to the shopper), or to predict item tags the user is likely to choose to assign to an uploaded image, or so forth.
Such systems can operate by modeling single relations. However, it is recognized that modeling multiple relations involving different entities can leverage a larger body of relational data and improve the predictions. Some known approaches in the multi-relational context are based on matrix co-factorization.
Disclosed herein are improved probabilistic relational data analysis techniques and recommender systems employing same.
In some illustrative embodiments disclosed as illustrative examples herein, a non-transitory storage medium stores instructions executable by an electronic data processing device to perform a method including: representing a multi-relational data set by a probabilistic multi-relational data model in which each entity of the multi-relational data set is represented by a D-dimensional latent feature vector; training the probabilistic multi-relational data model using a collection of observations of relations between entities of the multi-relational data set wherein the collection of observations include observations of at least two different relation types, the training generating optimized D-dimensional latent feature vectors representing the entities of the multi-relational data set; and generating a prediction for an observation of a relation between two or more entities of the multi-relational data set based on a dot product of the optimized D-dimensional latent feature vectors representing the two or more entities.
In some illustrative embodiments disclosed as illustrative examples herein, an apparatus comprises a non-transitory storage medium as set forth in the immediately preceding paragraph, and an electronic data processing device configured to execute the instructions stored on the non-transitory storage medium.
In some illustrative embodiments disclosed as illustrative examples herein, a method is performed in conjunction with a multi-relational data set represented by a probabilistic multi-relational data model in which each entity of the multi-relational data set is represented by a latent feature vector. The method comprises: optimizing the latent feature vectors representing the entities of the multi-relational data set to maximize likelihood of a collection of observations of relations between entities of the multi-relational data set wherein the collection of observations include observations of at least two different relation types, the optimizing generating optimized latent feature vectors representing the entities of the multi-relational data set; and generating a prediction for an observation of a relation between two or more entities of the multi-relational data set based on the optimized latent feature vectors representing the two or more entities. The optimizing and the generating are suitably performed by an electronic data processing device.
With reference to
With continuing reference to
With continuing reference to
Having provided an overview of the relational data analysis techniques disclosed herein with reference to
Relation R1 on triples (user,tag,item) is a relation of order 3 and can be represented by a tensor, whereas Relations R2 and R3 are of order 2 and are on pairs (user,user) and (item, feature) respectively. Note that unlike item features, tags are typically user-dependent.
In the context of tag recommendations, a significant difficulty arises if there are new entities (e.g. new users or new items) that were not present in the collection of training observations. This is known as the cold start problem, and arises when the relational data analysis system does not have enough data to make “intelligent” predictions. It has been estimated that in social network tagging tasks, over 90% of items are new to the service system in the sense that they do not exist in the training data. See Yin et al., “A probabilistic model for personalized tag prediction”, in KDD, 2010. The cold-start problem is not specific to the social tagging system, but rather is a common problem facing general recommender systems including recommendations of users and movies. See Yin et al., “Structural link analysis and prediction in microblogs”, in Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM 2011), October 2011.
Moreover, relational data analysis is in some applications applied to perform tasks other than tag recommendations, such as document network analysis, social network analysis, and so forth. For instance, in social networks it is useful to recommend other users for a user to be connected with in order to promote social interaction. These tasks (e.g., tag and user recommendations) are co-related, and are advantageously addressed simultaneously rather than separately.
To address such correlated tasks simultaneously, it is disclosed herein to employ a probabilistic generative model for multi-relational data that consist of relations or views with different orders (2 for matrices, 3 for tensors and more), with the aim of inferring missing relation instances. The disclosed approach models all views of a social service system (or other data source undergoing relational data analysis), and can yield multitask recommendations (e.g., recommendations of tags and users) simultaneously. The intuition of the disclosed model is the following: a multitude of observed relation instances are informative of each other when predicting unobserved instances; and by representing each entity using a hidden (i.e. latent) feature vector, this entity is involved in the generative process of other relations or views.
The relational data analysis techniques disclosed herein are described by way of further illustrative examples below.
Consider a multi-relational dataset with K types of entities, with Nk entities of the kth entity type where k∈{1, . . . , K}. The number of relations (also sometimes referred to as “views” in certain applications) is V, and each relation (i.e. view) v∈{1, . . . , V} is associated with a list Sv of entity types involved in the relation v, that is Sv=(Sv1, . . . , Sv|S
With reference back to
is a list of entity indices identifying the observation with value rm∈. The disclosed probabilistic multi-relational data model represents each entity as a latent (i.e. unobserved) continuous feature vector in the vector space D where D is the number of latent features for each entity, and is typically relatively small (e.g. of order 10 or 100). The complete set of low-dimensional latent features 12 for all entities of all K entity types is denoted by Θ=(Θ1, . . . , ΘK) where the latent features for the Nk entities of type k are
A summary of the foregoing notation is set forth in Table 1. The variance αv−1 given in Table 1 assumes that the distribution of observations is modeled as a Gaussian distribution with relation-dependent variances αv−1. This is merely an illustrative example, and other types of generalized linear models such as Poisson or Bernoulli distributions can be employed.
To facilitate understanding of the illustrative notation, the previously described example is considered. This example has K=4 entity types: users (u), items (i), item features (f), and tags (t); and three relations R1 (user u tagged item i with tag t), R2 (user u1 linked with user u2) and R3 (item i having feature f). Relation R1 links different entity types forms a three-dimensional array, while relations R2 and R3 are each encoded as two-dimensional arrays. To this end, the lists S corresponding to relations R1, R2, and R3 can be defined as {S1,S2,S3}, where S1={u, i, t}, S2={u, u}, and S3={i, f}, respectively.
With reference to
where θki
The model training performed by the parameters estimation module 8 of
Is the quadratic loss. It is also assumed that the prior distributions over Θ1, . . . , ΘK are independent isotropic Gaussian distributions with type-dependent variances σ12, . . . , σK2, that is:
Given the foregoing model, the latent variables Θ are to be inferred given the observations D. In an illustrative approach, a maximum a posteriori (MAP) estimator of Θ is employed. A suitable smooth and differentiable objective function corresponding to a negative log-likelihood may be employed, e.g.:
The objective function O may be minimized respective to Θ by various approaches. It is to be appreciated that as used herein the term “minimized” and similar phraseology encompasses iterative techniques that iteratively reduce the value of the objective function O and terminate in response to a suitable termination criterion (e.g., iteration-to-iteration improvement becoming less than a termination threshold, largest percent change in a parameter being less than a termination threshold, et cetera) where the solution at termination is not the absolute global minimum. It is also to be appreciated that as used herein the term “minimized” and similar phraseology encompasses iterative techniques that may settle upon a local minimum that is not the global minimum.
One suitable iterative minimization approach is alternating least squares (ALS). ALS is a block-coordinate descent algorithm which minimizes objective function O with respect to the latent variables of one of the entity types, say Θk, by fixing the latent variables of the other entity types. This procedure is repeated for the latent variables of each entity type sequentially, ensuring that each step decreases the objective function O. The procedure is repeated until a suitable convergence criterion is met. The inner optimization problems are ordinary least squares which can be solved optimally.
Another suitable iterative minimization approach is stochastic gradient descent (SGD), which is advantageous in applications targeting large data sets for which even one pass through the data can be computationally intensive. In contrast, ALS performs K constituent passes per iteration, and other batch-type techniques can suffer similar difficulties. In view of such considerations, the illustrative parameters estimation module 14 (see
SGD algorithms minimize sum functions of the form O=Σm=1MOm where M is typically large. At each iteration of the SGD algorithm, only the gradient of a single element of the sum, say Om, is utilized. To apply this algorithm to the objective function O of Equation (5), the objective is decomposed in terms of a sum. The negative log-likelihood term −log p(D|Θ,α)=−Σm=1M log p(rm|vm,im,Θ,α) has a suitable form, but further consideration is called for when dealing with the penalization term −log p(Θ|σ). Denoting vkn the number of observations for the nth entity of type k, it can be shown that:
holds. Hence, the penalization term can be combined with the individual likelihood terms to obtain the following expressions:
which has the form O=Σm=1MOm=Σm=1MO(v
The gradient with respect to θkn can be computed for every observation (v,i,r) according to:
if k∈Sv and n∈i, and 0 otherwise. The function l′ denotes the first derivative of the loss with respect to the first parameter, i.e. l′(
If observations are chosen at random irrespective of the relation, on average the exact gradient is recovered (up to a M−1 factor) of the full objective function O of Equation (5), i.e.:
Pseudocode for a suitable implementation of the SGD minimization is presented in Algorithm 1. Given a suitably chosen step size sequence η=(ηl)l≧1, the SGD algorithm updates the latent features for which the gradient is non-zero at every step l. Each update can be interpreted as follows. For the mth training observation, SGD predicts the rating
The time complexity of updating all the latent parameters θkn per observation is of order O(KND), where N=σkNk is the total number of entities. A single pass on the data is of order O(KNDM). Assuming the maximum number of sweeps is L, then the total time complexity is O(KNDML). Hence, since K, N, and D are constants, the time complexity is linear in the number of observations, O(ML). In experiments reported herein, early stopping was used to decide when to stop the SGD algorithm, for example after approximately 100 sweeps in some experiments.
Under the Gaussian distribution assumption the hyperparameters α1, . . . , αV correspond to a weighting of the different relations. The values of αv can be manually set, or the residue can be used to estimate them as follows:
In the following, some illustrative experimental results are presented.
The first data set considered is the MovieLens data set (available at http://www.grouplens.org/node/73). This data set consists of one million ratings from 6,000 users on 4,000 movies with time stamps between April 2000 and February 20003. The ratings are integer scores ranging from 1 to 5. To these ratings is associated user demographic information (e.g., age, gender, occupation, etc.), as well as movie information (e.g., movie title, release date, genres, etc.). In the experiments reported herein, the user features were restricted to the age, the gender, and the occupation, and only the genre feature was used to describe movies. (Note that the illustrative example of
User, user feature, movie, and movie genre are denoted herein respectively by u, f, m, and g. In the experiments, the relations considered where S1=(u,m), S2=(u, f), and S3=(m, g). Based on the time stamp of the ratings (Jan. 1, 2001), we split the ratings data into a training set of 904,796 instances and a test set of 95,413 instances. For the two other relations, that is S2 and S3, one out of every ten entries was randomly selected for testing and the rest were used as training data. The objective of the experiments was not only to predict ratings, but also to predict unobserved user features and unobserved movie genres.
The second data set was downloaded from Flickr® with the help of social-network connectors (Flickr API). This gave access to entities and relations. The data set includes 2,866 users, 60,339 tags, 46,733 items (e.g., images). Three relations were considered. Relation S1=(u, t, i) indicates that user u assigned tag t to item i. Relation S2=(i, f) characterizes item i by a feature vector f of 1024 dimensions. 1024-dimensional visual feature vectors were extracted in accord with Perronnin et al, “Fisher kernels on visual vocabularies for image categorization”, in CVPR, 2007. Relation S3=(u1,u2) encodes a partially observed adjacency matrix representing the explicitly expressed friendship relations among users. For instance, if user u1 and u2 are friends, then the value at (u1,u2) and (u2,u1) are both equal to 1, otherwise they are equal to 0.
The first task performed was the tag prediction task, that is predicting tags that users will assign to items. To this end relation S1 is modeled, for which the Flickr data set has a total 373,125 records with time stamps. The data is partitioned into training and test set by the time stamp of Apr. 1, 2010. In total, there are 2,613,388 observations for training and 205,880 observations for test. Note that there are only positive samples for the Flickr data set, 50 tags were sampled at random as negative samples for training the model. The second task that was performed is predicting image features. This corresponds to relation S2, for which 50,000×1,024 feature values were available. One tenth of the values were randomly selected as test observations. The third task performed was to recommend new friends to users. This corresponds to relation S3. Again, the data were split randomly resulting in 1,377,548 training observations and 342,576 test observations.
Finally, the third data set is Bibsonomy, which is a bookmark data set employed in the ECML-PKDD'09 Challenge Workshop (see http://www.kde.cs.uni-kassel.de/ws/dc09/). The data set includes 2,679 users, 263,004 items, 56,424 tags, 262,336 posts and 1,401,104 records. All of the posts are associated with time stamps. Each item contains textual content. A simple language model (bag-of-words) was first used to describe the items and then latent Dirichlet allocation (Blei et al., “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3: John Laffery, January 2003) was used to produce a latent representation of 100 dimensions. Two relations were considered for this data set: S1=(u, t, i), where user u applied tag t to item i, and S2=(i, f), where each item i is described by a 100-dimensional feature f. To model S1, a time stamp of Aug. 1, 2008 was used to produce a training and test set of respectively 7,214,426 and 1,585,179 observations. To model S2, there are 263,004×1024 feature values. Ten percent of them were selected as test data.
The disclosed relational data analysis was compared with conventional tensor/matrix factorization. On each data set, the problem was solved first by using individual factorizations, in which the whole recommendation is split into several independent tasks. Then the disclosed joint factorization model was used to solve these tasks simultaneously. Parameters a and σ2 were set equal to 1 for all three data sets. For the MovieLens data set, the results are shown in Table 2. (Note that the disclosed relational data analysis is referenced as “PRF”, i.e. probabilistic relation factorization, in the tables herein).
The first three rows of Table 2 indicate the results when the three relations are factorized and compared with the independent factorization. Results show that the performance of PRF is better than the traditional probabilistic matrix factorization for the tasks of predicting ratings and item features. In this system, the main task is to predict ratings, that is to predict the relation S1=(u,m), which is co-related to other two relations. To decide which relation contributes more to the rating prediction, the following two experiments were conducted. In the first experiment the PRF only processed S1=(u,m) and S2=(u,f). In the second experiment the PRF only processed S1=(u,m) and S3=(m,g). It is seen that the view S2 contributes more than S3. In addition, it is noteworthy that the combination of the three relations yields the best performance.
For the Flickr® data set, the results are shown in Table 3. The individual tensor factorization is used to predict tags. Similar experiments were conducted as for the MovieLens data set. The first three lines of Table 3 report PRF applied to factorize all three relations. Results for two PRF models are reported: PRF1 where the residue was used to estimate and update the a parameters at every iterative step, and PRF2 where α=1 throughout, which is the same setting as in the MovieLens experiment. The PRF is seen to lead to a noticeable improvement. The Root Mean Square Error (RMSE) associated to S2=(i, f) is also better than the independent factorization. We also notice that for S3=(u1, u2), the independent factorization leads to slightly better results than for PRF. However, the main task in this service is tag recommendation, and the relation S1 is co-related with S2 and S3. To decide which helps most to the tag prediction task, two experiments were conducted as with the MovieLens experiments. As expected, the item feature is more critical to the tag prediction task. This result agrees with previous estimates that over 90% cases are cold start problems in realistic social media data sets. Note that for the cold start problem, the content features of item are essential for tag prediction because the items do not exist in the training data.
The Bibsonomy data is similar to the Flickr® data, but here there is no equivalent to the S3=(u1,u2) relation. The results of Bibsonomy are shown in Table 4. There are also two versions—PRF1 and PRF2, which are the same as used in the experiment with Flickr®. The results are consistent with the Flickr® data: PRF models noticeably decrease the RMSE for tag prediction task. In the PRF1 version, it is also found that the performance in both views can lead to significant improvements: in (user, tag, item), the RMSE decrease from 1.04 to 0.3484 and in (item, feature), the RMSE decreases from 1.7387 to 1.0088.
From the experiments on the three data sets, improvement is obtained by considering the multiple relations in parallel. For Flickr® and MovieLens, where three relations were considered, the performance of modeling all three together leads to better a performance than modeling any two out of the three relations. For three tasks, the performance of the disclosed model in two out of three tasks is better than the performance of independent factorization models. In particular, for the main task, rating prediction and tag prediction, the disclosed model performs significantly better.
Reported empirical evaluation sometimes neglects the temporal information and generates test data randomly. Such a setting often leads to better reported performance; however, it does not reflect practical settings where future data cannot be used to predict past data. Accordingly, two experiments were conducted to analyze the effect of these data. Methods in which the test data is randomly sampled are referred to herein as “Static”. Methods where the test data is generated by splitting the whole data set by a time stamp are referred to herein as “Dynamic”. For the two methods, three pairs of test data sets were generated: 50% data as training and 50% for testing; 80% data as training and 20% for testing; and 90% data as training and 10% for testing. The results for MovieLens data are presented in Table 5 where “PRF” is the disclosed model where all variances are set to 1 and “Indep.” is a shorthand for the independent factorization model. It is seen that the Static test achieves much better performance compared with the Dynamic test when the same amount of data is used for training. As the training data decreases, the performance of Dynamic test decrease dramatically, which suggests that temporal analysis is advantageous. The disclosed model (PRF) in both two evaluation methods can outperform the independent matrix factorization. Especially in the Dynamic test, which is more realistic and practical, the improvement is particularly noticeable. In the following experiments, the Dynamic test is employed as the main test methods on the three data sets.
Intuitively, the strength of the disclosed model resides in the fact that it can exploit multiple relations jointly. The improvement is expected to be the most apparent when the amount of data is small. To verify this conjecture, several test sets were generated by using different time stamps to split the MovieLens data.
Next, the frequencies of the specific item in the training set for each item of the test set were computed. For test items with the same frequencies in training and test, the average RMSE is calculated. The same calculations were done for users. The results are shown in
Next, the fraction of cold start cases is analyzed is for three data sets. For MovieLens data, for rating view, the ratings matrix was split into training and test data by Jan. 1, 2001. In test data, we have 95,413 records, among which there are 1,026 records with new users (cold start) and 90 records with new items. For these cases, the independent matrix factorization does not apply. The PRF model, which makes use of the information about the whole system, performs better than independent matrix factorization on these cases, as already shown. It is also noteworthy that the number of test records with cold start users is far more than the number of test records with cold start items. This explains why the (user, profile) view contributes more in the experiments than (item, feature) view for rating prediction task.
For Bibsonomy and Flickr® data, the tagging system is a triple-element relationship, so the data is even sparser than MovieLens data. For example, in Bibsonomy, the data is split into training data and test data by Aug. 1, 2008. In test data, 167,040 records are available, among which there are 105,185 records with new users (cold start) and 155,140 records with new items. For the Flickr® data, the results of the (item features) view can contribute more than the social network view, and it is consistent with the analysis that the number of the test records with new items is more than the number of test records with new users.
Finally, experiments were conducted with latent features of different dimensions D∈{5,10,20,30,40}, and show results for MovieLens, Flickr® and Bibsonomy in Tables 6, 7 and 8 respectively. From these tables, it is seen that for MovieLens, the results show a trend toward overfitting with growing D as the test error increases. For the Flickr® data, the optimal D should lie between 10 and 20. However, for the Bibsonomy data is it less clear: there seem to be multiple minima, one for small D and one for a larger D.
Another aspect to consider is the convergence of the SGD algorithm, which is used to learn the PRF model. On all data sets, the algorithms show convergence empirically. However, in some cases the RMSE as a function of epochs (i.e., number of passes over the training data) may exhibit overfitting. The training and test RMSEs for Flickr and Bibsonomy are respectively shown in
One contemplated variant is to use more advanced factorization techniques. In the illustrative embodiments presented herein, the Parafac-Decomp formulation of tensor factorization was used, but other approaches such as Tucker decomposition (see L. Tucker, “Some mathematical notes on three-mode factor analysis”, Psychometrika, 31:279-311, 1966) are also contemplated. Tucker decomposition enables a more flexible parameterization of the problem thanks to the use of relation-specific core tensors, and would also enable entity specific latent dimensions D1, . . . , Dk instead of the constant dimension D used for all the entities in the illustrative examples set forth herein.
The performance of the probabilistic model is tied to the careful tuning of the hyper-parameter when the latent feature vectors Θ=(Θ1, . . . , ΘK) are estimated by maximize a posterior probability (MAP). When hyper-parameter are not well tuned, a point estimation such as MAP can be vulnerable to overfitting, especially when the data set is large and sparse.
Instead of using MAP estimation 16 to minimize the objective function O, an alternative estimation scheme 18 is a fully Bayesian treatment, which integrates out all model parameters and hyper-parameters, arriving at a predictive distribution of future observations given observed data. Because this predictive distribution is obtained by averaging all models in the model space specified by the priors, it is less likely to over-fit a given set of observations.
With reference to
r|Θi:(Θi,αv), where (v,i,r)∈ (12)
The prior distribution for hidden feature Θ is assumed to be Gaussian too, but the mean and the precision matrix (inverse of the covariance matrix) may take arbitrary value:
θkj:(μk,Λk−1), j=1 . . . Nk (13)
An aspect of the illustrative fully Bayesian treatment is to view the hyper-parameter Φk≡{μk,Λk} also as random variable, leading to a predictive distribution for an unobserved rating (v,i,f) of:
p({circumflex over (r)}|D)=∫∫p({circumflex over (r)}|Θi,αv)p(Θi,α,Φi|D)d{Θiαv},d{Φi} (14)
For convenience, we also define
We then choose the prior distribution for the hyper-parameters. For the gaussian parameter, we choose the conjugate distribution as priors that facilitate subsequent computation:
p(αv)=W(αv|W′0,v′0)
p(Φk)=p(μk|Λk)p(Λk)=N(μ0,(β0Λk)−1)W(Λk|W0,v0) (15)
Here W is the wishart distribution of a D×D random matrix Λ with v0 degrees of freedom and a D×D scale W0:
where C is a normalizing constant. There are several parameters in the hyper-priors: μ0,ρ0,β0,W0,v0,W′0,v′0, which reflect prior knowledge about the specific problem and can be treated as constants during training. In fact, Bayesian learning is able to adjust them according to the training data, and varying their values (within in a reasonably large range) has little impact on the final prediction, as often observed in Bayesian estimation procedures. See Xiong et al., “Temporal collaborative filtering with Bayesian probabilistic tensor factorization”, In SIAM Data Mining, 2010.
One can represent the predictive distribution of the relation value r given observation (v,i,r)∈D by marginalizing over model parameters:
p({circumflex over (r)}|D)=∫∫p({circumflex over (r)}|Θi,αv)p(Θi,α,Φi|D)d{Θi,αv},d{Φi} (16)
In performing the Bayesian inference, often the exact predictive distribution is intractable, thus one relies on approximate inference such as sampling method based on Markov Chain Monte Carlo (MCMC). See Metropolis et al., “The Monte Carlo methods”, JASA, 44 (247):335-341, 1949; Radford M. Neal, Probabilistic inference using Markov Chain Monte Carlo methods, 1993. For instance, MCMC can be used to approximate the predictive distribution of the following equation:
where the sample Θi(l) is generated by running a Markov chain whose stationary distribution is the posterior distribution over the model parameters and hyperparameter Θ,Φ.
One type of MCMC algorithm is Gibbs sampling. See Geman et al., “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images”, PAMI, 6(6):721-741, 1984. Gibbs sampling cycles through the latent variables, sampling each one from the conditional distribution given the current values of all other variables. Gibbs sampling is typically used when these conditional distributions can be sampled from easily. In the following, a derivation is given for the conditional distributions of model parameters and hyper-parameters, which are required for implementing Gibbs sampling. Note that with our model assumptions, the joint posterior distribution can be factorized as:
We start with the derivation of the conditional distributions of the model hyper-parameters. For each v, αv follows the Wishart distribution. By using the conjugate prior to αv, we have the conditional distribution of αv given Rv, Θ following the Wishart distribution:
Next, we derive the conditional probability for Φk. The graphical model (
The remaining conditional distributions are for model parameters Θk, and we describe the derivation of these distributions in this section. According to the graphical model (
Given these conditional probability for model parameters Θ and hyperparameters Φ and α, the Gibbs sampler algorithms for the Bayesian probabilistic multi-relational data factorization graphical model (BPRA) are shown in Algorithm 2.
Social media by nature are incremental processes. A number of data are temporally related. In the following, the temporal aspect of the data is considered, which is slightly different from a regular factor. Because we desire to capture the evolution of global trends by using the temporal factors, a reasonable prior belief is that they change smoothly over time. Therefore we further assume that each time feature vector depends only on its immediate predecessor. Here, we assume t is the type of temporal factor among all K types, t∈{1, . . . , K}, then:
θtj:N(θtj−1,Λt−1), j=2 . . . Nt
θt1:N(μt,Λt−1 (26)
and
p(Φt)=p(μt|Λt)p(Λt)=N(ρ0,(β0Λt)−1)W(Λt|W0,v0) (27)
Next, we would like to derive the conditional probability for Φt, i.e.:
The graphical model assumption suggests that it is conditionally independent of all the other parameters given Θt. We thus integrate out all the random variables in Eq. 4, except Θt. We obtain the Gaussian-Wishart distribution:
We consider the temporal features Θt. According to the graphical model, its conditional distribution factorizes with respect to individual entities:
where for j=1:
and for j=2, . . . , K−1:
and for j=K:
Given these conditional probability for model parameters Θ and hyper-parameters Φ and α. The Gibbs sampler algorithms for the Bayesian Probabilistic (multi-)Relational-data Analysis graphical model (BPRA) are shown in Algorithm 2.
In the following, empirical evaluations of the BPRA are reported on the already-described MovieLens, Flickr®, and Bibosonomy data sets. In addition to BPRA, the following additional methods were employed for comparison: (1) Probabilistic Matrix Factorization (PMF) in which independent collaborative filtering using probabilistic matrix factorization (see Salakhutdinov et al., “Probabilistic matrix factorization”, in NIPS, pages 1257-1264, 2008), which treats activities independently; (2) Bayesian Probabilistic Matrix Factorization (BPMF) (see Salakhutdinov et al., “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo”, in ICML, pages 880-887, 2008); (3) Tensor Factorization (TF) which independently handles high-order relational data, e.g., tag prediction (see Rendle et al., “Learning optimal ranking with tensor factorization for tag recommendation”, in SIGKDD, pages 727-736, 2009; Rendle et al., “Pairwise interaction tensor factorization for personalized tag recommendation”, in WSDM, pages 81-90, 2010]); (4) Bayesian Probabilistic Tensor Factorization (BPTF) (see Xiong et al., “Temporal collaborative filtering with Bayesian probabilistic tensor factorization”, in SIAM Data Mining, 2010) which models temporal collaborative filtering; and (5) Collective Matrix Factorization (CMF) (see Singh et al., “Relational learning via collective matrix factorization”, in SIGKDD, pages 650-658, 2008) which handles the 2-order problem with multiple matrix factorization tasks. Performance of different methods are evaluated in terms of Root Mean Square Error (RMSE) based on the parameter setting that μ0, ρ0, β0, W0, v0, W0′, v0′ all equal to one or identity vector and D=20 for the dimension of latent factors, on all three data sets. To this end, we study how the system performs when the model takes into account coupled relations.
With reference to
With reference to Table 9, prediction tests on the Flickr® data set are shown by tabulating the Root Mean Square Error (RMSE) for relations predicted using various methods. In Table 9, “PRA” employs the probabilistic multi-relational data model with MAP estimation, and “BPRA” with Bayesian estimation. The four relations are: C1=users tagging items (user, tag, item); C2=item content (item, feature); C3=social interaction (user, user); and C4=user's comments on item (user, item, comments). It is seen in Table 9 that the Bayesian method outperforms the MAP version. This is attributable to high data sparsity.
With reference to
With returning reference to Table 9, the question of whether coupled relations actually lead to mutually improved prediction performance is investigated. Experiments were conducted on modeling different relations with several combinations to study this question. The first four rows of the Table 9 present results when all four relations C1, C2, C3, C4 are modeled together. Rows 5-7 present results when three relations C1, C2, C4 are modeled together. Rows 8-10 present results when three relations C1, C3, C4 are modeled together. Rows 11-12 present results when two relations C1, C4 are modeled together. Note however that “modeled together” is not actually meaningful for a technique such as Probabilistic Matrix Factorization (PMF) or Tensor Factorization (TF) which does not leverage dependencies between relationships, and so for example the PMF and TF data for a given relation is the same for all four experiments reported in rows 1-4, rows 5-7, rows 8-10, and rows 11-12 respectively.
When using BPRA or PRA which do leverage interdependencies, the results shown in Table 9 indicate that best performances are achieved for all four relations when modeling them together. Take the prediction of relation C1:(user, tag, item) using BPRA as an example. The best performance is 0.3073 in modeling all four relations. The performance is poorer at 0.3177 when modeling the three relations C1:(user, tag, item), C4:(user, comment, item) and C3:(user, user), and further degrades to 0.3465 when only modeling the two relations C1:(user, tag, item) and C4:(user, comment, item).
With continuing reference to Table 9, the evaluation of the disclosed BPRA and PRA models together with comparisons to other techniques (PMF, BPMF, TF, BPTF, and CMF) is discussed. Note that CMF can only model 2-order data, thus we drop the user factor in the reported experiments. For instance, CMF will only predict the tags on item i, instead of predicting the tags on item i for user u. Similarly, in the relation (user-comment-tag) and (user-comment-item), CMF predicts comment-term only for items rather than for user and item pairs. For CMF, we also conduct experiments on the same combinations as in the previous section. For PMF, BPMF, PTF, and BPTF, their performance in different combinations remain unchanged since they are modeling a single relation. As seen in Table 9, PMF and PTF perform the worst in tag prediction, because they only model a single relation without encoding external information of items. Intuitively the external information of items (e.g., comments, features) is more critical to the tag prediction task. This result agrees estimates that over 90% of real-world cases are cold start problems. See Yin et al., “A probabilistic model for personalized tag prediction”, in KDD, 2010. For the cold start problem, the external information of items are essential for tag prediction because the items do not exist in the training data. CMF performance in the tag prediction problem is better than PMF and PTF by incorporating item-comment and item-feature information. However, CMF cannot take into account higher-order data, thus it does not achieve a better performance in the comment context and tag context. Overall, it is seen that for all methods, Bayesian versions can always outperform the MAP version respectively, due to the sparsity of the data. The modeling disclosed herein outperforms all of PMF, PTF, BPMF, BPTF and CMF in the comments context, social network context and tag context. In the item feature relation, the modeling disclosed herein is slightly worse than BPMF. That is because the modeling disclosed herein attempts to maximize the likelihood for all relations. Checking the total RMSE for all four relations, it is found that the modeling disclosed herein outperforms the BPMF methods.
Results for the MovieLens data set are next considered. As previously noted, the time stamps for this data set are in a range April 2000 to February 2003. The temporal factor is based on Month, so the ratings fall into 35 month starting with April 2000 and ending with February 2003. Then we have three relations. The first relation is movie rating prediction for the relation (user, movie, time), the second relation is (user, profile), and the third relation is (movie, genre). For the first relation, we randomly selected 100,000 ratings as training data and the rest was test data. For the second and third relation, we randomly selected one out of every ten entries for testing and used the rest as training. The objective was not only to predict ratings, but also to predict unobserved user features and unobserved movie genres.
The results are shown in Table 10. The relations are: C1:(user, tag, item); C2:(user, profile); and C3:(movie, genre). Table 10 shows that the disclosed modeling clearly outperforms all other methods on all three relations. We also conduct a temporal analysis: for each month's data, we test the RMSE and the results are shown in
With reference to
With reference to Table 11 and
BPTF almost fails to solve the task specified by C1:(user,tag,item) relation without item external information. On the contrary, CMF achieves better results on both relations due to combination of the two relations. The results are consistent with results for the Flickr data in that the disclosed modeling noticeably decreases the RMSE for the tag prediction task. Note that the performance for both relations can lead to significant improvements: 0.3097 in the (user, tag, item) relation and 1.0088 in the (item, feature) relation respectively. BPRA noticeably outperforms all other methods.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.