Matrix analysis techniques, e.g., singular value decomposition (SVD), have been widely used in various data analysis applications. An important class of applications is to predict missing elements given a partially observed random matrix. For example, putting ratings of users into a matrix form, the goal of collaborative filtering is to predict those unseen ratings in the matrix.
Various probabilistic models have been used to analyze the matrices. Let Y be a p×m observational matrix and T be the underlying p x m noise-free random matrix. The system assumes that Yi,j=Ti,j+εi,j,εi,j˜N(0,σ2), where Yi,j denotes the (i,j)-th element of Y. If Y is partially observed, then Y∥ denotes the set of observed elements and ∥ is the corresponding index set.
One model called the Probabilistic Principal Component Analysis (PPCA) assumes that yj, the j-th column vector of Y, can be generated from a latent vector v; in a k-dimensional linear space (k<p). The model is defined as
y
j
=WV
j+μ+εj, vj˜Nk(vj;0,Ik),
where εj˜Np(εj;0,σ2Ip), and W is a p×k loading matrix. By integrating out vj, the system obtains the marginal distribution yj˜Np(yj; μ, WWT+σ2Ip). Since the columns of Y are conditionally independent, PPCA is equivalent to
Y
i,j
=T
i,j+εi,j, T˜Np,m(T;0,S,Im),
where S=WWT, and Np,m(;0,S,Im) is a matrix-variate normal prior with zero mean, covariance S between rows, and identity covariance Im between columns. PPCA aims to estimate the parameter W in order to obtain the covariance matrix S. There is no further prior on W.
Another model called the Gaussian Process Latent-Variable Model (GPLVM) formulates a latent-variable model in a slightly unconventional way. It considers the same linear relationship from latent representation vj to observations yj. Instead of treating vj as random variables, GPLVM assigns a prior on W and see {vj} as parameters
y
j
=WV
j+εj, W˜Np,k(W;0,Ip,Ik),
where the elements of W are independent Gaussian random variables. By marginalizing out W, the system obtains a distribution that each row of Y is an i.i.d. sample from a Gaussian process prior with the covariance R=VVT and V=[v1, . . . ,vm]T. The model has the form
Y
i,j
=T
i,j+εi,j, T˜Np,m(T;0,Ip,R).
From a matrix modeling point of view, GPLVM estimates the covariance between the rows and assume the columns to be conditionally independent.
A third model called Hierarchical Gaussian Process (HGP) is a multi-task learning model where each column of Y is a predictive function of one task, sampled from a Gaussian process prior,
y
i
=t
j+εj, tj˜Np(0,S),
where εj˜Np(0,σ2Ip). It introduces a hierarchical model where an inverse-Wishart prior is added for the covariance,
Y
i,j
=T
i,j+εi,j, T˜Np,m(0,S,Im), S˜IWp(v,Σ)
HGP utilizes the inverse-Wishart prior as the regularization and obtains a maximum a posteriori (MAP) estimate of S.
To predict unobserved elements in matrices, the structures of the matrices play an importance role, for example, the similarity between columns and between rows. Such structures imply that elements in a random matrix are no longer independent and identically-distributed (i.i.d.). Without the i.i.d. assumption, many machine learning models are not applicable.
In one aspect, the system assumes that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggests a matrix variate t model (MVTM) to predict those missing elements. MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models.
In another aspect, systems and methods predict missing elements from a partially-observed matrix by receiving one or more user item ratings; generating a model parameterized by matrices U, S, V; and outputting the model.
Advantages of the system may include one or more of the following. The system learns from a partially-observed random matrix and predicts its missing elements. The system provides an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is highly efficient and scalable. Predictions can be made without computing the mode or mean of the posterior distribution. These capabilities are achieved with a good predictive accuracy. An MVTM does not require the independence assumption over elements. The implicit model selection of the MVTM encourages sparse models with lower ranks. To minimize the log-likelihood with log-determinant terms, the system proposes an optimization method by sequentially minimizing its convex quadratic upper bound. The experiments show that the approach is accurate, efficient and scalable.
Useful properties of the model may include one or more of the following. First, MVTM continues the line of gradual generalizations across several known probabilistic models on random matrices, namely, from probabilistic principle component analysis (PPCA), to Gaussian process latent-variable models (GPLVMs), and to hierarchical Gaussian processes (HGPs). MVTMs can be further derived by analytically marginalizing out the hyper-parameters of these models. From a Bayesian modeling point of view, the marginalization of hyper-parameters means an automatic model selection and usually leads to a better generalization performance. The model selection by MVTMs explicitly encourages simpler predictive models that have lower ranks. Unlike the direct rank minimization, the log-determinant terms in the form of matrix-variate t prior offers a continuous optimization surface (though non-convex) for rank constraint; Third, like multivariate Gaussian distributions, a matrix-variate t prior is consistent under marginalization, that means, if a matrix follows a matrix-variate t distribution, its any sub-matrix follows a matrix-variate t distribution as well. This property allows to generalize distributions for finite matrices to infinite stochastic processes. Under a Gaussian noise model, the matrix-variate t distribution is not a conjugate prior. It is thus difficult to make predictions by computing the mode or mean of the posterior distribution. The system suggests an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is highly efficient and scalable. The system shows very good efficiency and excellent prediction accuracy.
a)-(1d) show various exemplary models for matrix prediction.
a) shows an exemplary test matrix while
The system models the random matrix of interest as a sample drawn from a matrix-variate t distribution, which is a generalization of Student-t distribution. The predictive model under such a prior is called the matrix-variate t model (MVTM).
a)-(1d) show various exemplary models for matrix prediction.
The PPCA models the row covariance of Y, the GPLVM models the column covariance, and the HGP assigns a hyper prior to prevent over-fitting when estimating the (row) covariance. From a matrix modeling point of view, capturing the dependence structure of Y by its row or column covariance is a matter of choices. The MVTMs provide a unified solution. From a Bayesian modeling viewpoint, as many variables should be marginalized as possible. The system thus extended the HGP model in two directions: (1) assume T˜Np,m(0,S,Ω) that have covariances on both sides of the matrix; (2) marginalize the covariance S on one side (see
Pr(T)=∫Np,m(T;0,S,Ω)IWp(S;v,Σ)dS=tp,m(T;v−2p,0,Σ, Ω),
which is a matrix-variate t distribution.
The matrix-variate t distribution of p×m matrix T is given by
where v is the degree of freedom; M is a p×m matrix; Σ and Ω are positive definite matrices of size p×p and m×m, respectively;
is a multivariate gamma function, and |·| stands for determinant.
The model is depicted in
The same matrix-variate t distribution can be equivalently derived by putting another hierarchical generative process on the covariance R, as described in
The matrix-variate t distribution involves a determinant term of T, which becomes a log-determinant term in log-likelihood or KL-divergence. The log-determinant term encourages the scarcity of matrix T with lower rank. This property has been used as the heuristic for minimizing the rank of the matrix and Student's t priors have been applied to enforce sparse kernel machines.
Though the system can use evidence framework or other methods to estimate, the results are not good in many cases. Usually the system just sets this to a small number. Similar to v, the estimated σ2 does not give the system a good result either, but cross-validation is a good choice. For the mean matrix M, the system just uses sample average for all observed elements. For many tasks, the system does not have any information about the Σ or Ω. Here, the system just sets them to √{square root over (v)}Ip and √{square root over (v)}Im respectively, so that the variance of each elements of T in the prior distribution is about 1.
When the evaluation of the prediction is the sum of individual losses, the optimal prediction is to find the individual mode of the marginal posterior distribution, i.e., arg maxT
An alternative way is to use the individual mean of the posterior distribution to approximate the individual mode. Since the joint of individual mean happens to be the mean of the joint distribution, the system only need to compute the joint posterior distribution. The problem of prediction by means is written as
However, it is usually difficult to compute the exact mean. One estimation method is the Monte Carlo method, which is computationally intensive. Discussion of an approximation to compute the mean is in Section 3.4. From the system's experiments, the prediction by means usually outperforms the prediction by modes.
The MVTM has a rich set of properties. Several are listed in the following Theorem.
then
Pr(T)=tp,m(T;v,0,Σ,Ω), (5)
Pr(T|Θ,Φ,Ψ)=tp,m(T;v+q+n,M,(Σ+ΨBΨT), (Ω+ΦTAΦ)), (6)
Pr(θ)=tq,n(θ;v,0,Σ0,Ω0), (7)
Pr(Φ|θ)=tq,m((Φ;v+n,0,A−1,Ω), (8)
Pr(Ψ|θ,Φ)=tp,n(Ψ;v+q,0,ΣB−1)=Pr(Ψ|θ), (9)
E(T,θ,Φ,Ψ)=M, (10)
Cov(vec(TT)|θ,Φ,Ψ)=(v+q+n−2)−1(Σ+ΨBΨT){circumflex over (×)}(Ω+ΦTAΦ), (11)
where
A
(θΩ0−1θT+Σ0)−1,
B
(θTΣ0−1θ+Ω0)−1,
M
ΨΩ
0
−1θTAΦ=ΨBθTΣ0−1Φ.
This theorem provides some insights about MVTMs. The marginal distribution in Eq. (5) has the same form as the joint distribution; therefore the matrix-variate t distribution is extensible to an infinite dimensional stochastic process. As conditional distribution in Eq. (6) is still a matrix-variate t distribution, the system can use it to approximate the posterior distribution.
The system encounters log-determinant terms in computation of the mode or mean estimation. The following theorem provides a quadratic upper bounds for the log-determinant terms, which makes it possible to apply the optimization method discussed below.
ln|Σ+TΩ−1TT|≦h(T; T0,Σ,Ω)+h0(T0,Σ,Ω),
where
h(T; T0,Σ,Ω)tr((Σ+T0Ω−T0T)−1TΩ−1TT),
h
0(T0,Σ,Ω)ln|Σ+T0Ω−1T0T|+tr((Σ+T0Ω−1T0T)−1Σ−1)−p
The equality holds when T=T0. Also it holds that
Applying Lemma 2 with X=(Σ+T0Ω−1T0T)−1(Σ+TΩ−1TT), the system obtains the inequality. By some calculus the system has the equality of the first-order derivative. Actually h(·) is a quadratic convex function with respect to T, as (Σ+T0Ω−1T0T)−1 and Ω−1 are positive definite matrices.
An exemplary optimization method is discussed next. Once the objective is given, the prediction becomes an optimization problem. The system uses an EMstyle optimization method to make the prediction. Suppose J(T) be the objective function to be minimized. If the system can find an auxiliary function, Q(T; T+), having the following properties, the system can apply this method.
Next, exemplary predictions by modes are discussed. Following Eq. (2), the goal is to minimize the objective function
where c denotes equivalence by ignoring irrelevant terms (or constants).
To minimize Ĵ, the system introduces an auxiliary function,
Q(T; T′)l(T)+h(T; T′, Σ,Ω)+h0(T′,Σ,Ω). (13)
By Corollary 3, the system has that Ĵ(T)≦Q(T; T′), Ĵ(T′)=Q(T′, T′), and Q(T, T′) has the same first-order derivative as Ĵ(T) at T′. Because I and h are quadratic and convex, Q is quadratic and convex as well. Therefore, the system can apply the optimization method in Section 3.2 to minimize Ĵ.
However, when the size of T is large, to find {circumflex over (T)} is still time consuming and requires a very large space. In many tasks, the system only needs to infer a small portion of {circumflex over (T)}. Therefore, the system considers a low rank approximation, using UVT to approximate T, where U is a p×k matrix and V is an m×k matrix. The problem of Eq. (2) is approximated by argminU,VĴ(UVT). The system can minimize J1 by alternatively optimizing U and V. The system can put the final result in a canonical format as {circumflex over (T)}≅USVT, where U and V are semi-orthonormal and S is a k×k diagonal matrix. This result can be consider as the SVD of an incomplete matrix using matrix-variate t regularization. The details are skipped because of the limit space.
The prediction by means will be discussed next. As the difficulty in explicitly computing the posterior distribution of T. the system takes a variational approach to approximate its posterior distribution by a matrix-variate t distribution via an expanded model. The system expands the model by adding θ, Φ and Ψ with distribution as Eq. (4), with Σ0=Iq and Ω0=In. Since the marginal distribution, Eq. (5), is the same as the prior of T, the system can derive the original model by marginalizing out ν, Φ and Ψ. However, instead of integrating out θ, Φ and Ψ, the system uses them as the parameters to approximate T's posterior distribution. Therefore, the estimation of the parameters is to minimize
−ln Pr(θ,Φ,Ψ)−ln∫Pr(T|θ,Φ,Ψ)Pr(Y∥|T)dT (14)
over θ, Φ and Ψ. The first term of Eq. (14) can be written as
Because of the convexity of negative logarithm, the second term of Eq. (14) is bounded by
because −ln Pr(Y∥|T) is quadratic respective to T, thus the system only need integration using the mean and variance of Tij of Pr(T|θ,Φ,Ψ), which is given by Eq. (10) and (11). The parameter estimation not only reduce the loss (the term of l(·)), but also reduce the variance. Because of this, the prediction by means usually outperforms the prediction by modes.
Let
The foregoing analysis are performed on a computer using exemplary processes shown in
In one embodiment, Maximum Margin Matrix Factorization (MMMF) can be used to perform mode estimation. Using trace norm on the matrix as regularization, MMMF overcomes the over-fitting problem in factorizing matrix with missing values. From the regularization viewpoint, the prediction by mode of MVTM uses log-determinants as the regularization term in Eq. (12). The log-determinants encourage sparsity predictive models.
In another embodiment, Stochastic Relational Models (SRMs) can be used. SRMs extend HGPs by estimating the covariance matrices for each side. The covariance functions are required to be estimated from observation. By maximizing marginalized likelihood, the estimated S and R reflect the information of the dependency structure. Then the relationship can be predicted with S and R. During estimating S and R, inverse-Wishart priors with parameter Σ and Ω are imposed to S and R respectively. MVTM differs from SRM in integrating out the hyper-parameters or maximizing out.
In yet another embodiment, Robust Probabilistic Projections (RPP) can be used which applies Student-t distribution to extends PPCA by scaling each feature vector by an independent random variable. Written in a matrix format, RPP is
where IG is inverse Gamma distribution. Though RPP unties the scale factors between feature vectors, which could make the estimation more robust, it does not integrate out the covariance matrix, which the system did in MVTM. Moreover inherited from PPCA, RPP implicitly uses independence assumption of feature vectors. Also RPP results different models depending on which side the system assumes to be independent, therefore it is not suitable for matrix prediction.
In a test, the system generates a 30×20 matrix (
MVTM is in favor of sparse predictive models. To verify this, the system depicts the singular values of the MMMF method and two MVTM prediction methods in
The singular values of the mode estimation decrease faster than the MMMF ones at beginning, but decrease slower after a threshold. This confirms that the log-determinants automatically determine the intrinsic rank of the matrices.
In another test using Eachmovie data, the system tests the algorithms on Eachmovie from Breese, et al. “Emperical analysis of predictive algorithms for collaborative filtering” in UAI-98, pp 43-52 (1998). The dataset contains 74,424 users' 2,811,718 ratings on 1,648 movies, i.e. about 2.29% are rated by zero-to-five stars. The system puts all ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. The system compares the system's approach with other three approaches: 1) USER MEAN predicting rating by the sample mean of the same user' ratings; 2) MOVIE MEAN, predicting rating by the sample mean of users' ratings of the same movie; 3) MMMF[8]; 4) PPCA. The system does not have a scalable implementation for other approaches compared in the previous experiment. The number of dimensions is 10. The results are shown in Table 1 which shows that the two MVTM prediction methods outperform the other methods.
In summary, the matrix-variate t models for matrix prediction system models the entire matrix as a sample drawn from a matrix-variate t distribution. An MVTM does not require the independence assumption over elements. The implicit model selection of the MVTM encourages sparse models with lower ranks. To minimize the log-likelihood with log-determinant terms, the system proposes an optimization method by sequentially minimizing its convex quadratic upper bound. The experiments show that the approach is accurate, efficient and scalable in learning from a partially-observed random matrix and predict its missing elements. The system assumes that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggests the MVTM to predict those missing elements. The system shows that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. The system suggests an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model.
Exemplary applications for the above system can include one or more of the following sample applications. On such application is personalized recommendation (e.g. book recommendation at a web retailer such as Amazon.com). The method recommends items according to value of the score matrix, whose axes are customers and items. This application is unlike the conventional recommendation system which is based only on previously purchased items by the customer. Another application is link detection in a social network. In this exemplar application, known links are entered into a matrix where both axes are persons. The method can compute the score of unknown pair of persons. By the score, the system can predict the possibility of unknown links.
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.