The present application claims priority to Singapore Application No. SG 10201803291Q filed with the Intellectual Property Office of Singapore on Apr. 19, 2018, which is incorporated herein by reference in its entirety for all purposes.
The present disclosure relates to predictive analysis using machine learning and more specifically to an embedding model used in predictive analysis.
Personalized recommendation is at the core of many online customer-oriented services, such as e-commerce, social media, and content-sharing websites. Technically speaking, the recommendation problem is usually tackled as a matching problem, which aims to estimate the relevance score between a user and an item based on their available profiles. Regardless of the application domain, a user's profile usually consists of an ID (to identify which specific user) and some additional information like age, gender, and income level. Similarly, an item's profile typically contains an ID and some attributes like category, tags, and price.
Collaborative filtering (CF) is the most prevalent technique for building a personalized recommendation system. CF leverages users' interaction histories on items to select the relevant items for a user. From the matching view, CF uses the ID information only as the profile for a user and an item, and forgoes other additional information. As such, CF can serve as a generic solution for recommendation without requiring any domain knowledge. However, the downside is that it lacks necessary reasoning or explanations for a recommendation. Specially, the explanation mechanisms are either because your friend also likes it (i.e., user-based CF) or because the item is similar to what you liked before (i.e., item-based CF), which are too coarse-grained and may be insufficient to convince users on a recommendation.
To persuade users to perform actions on a recommendation, we believe it is crucial to provide more concrete reasons in addition to similar users or items. For example, we recommend iPhone 7 Rose Gold to user Emine, because we find females aged 20-25 with a monthly income over $10,000 (which are Emine demographics) generally prefer Apple products of pink color. To supercharge a recommender system with such informative reasons, the underlying recommender shall be able to (i) explicitly discover effective cross features from the rich side information of users and items, and (ii) estimate user-item matching score in an explainable way. In addition, we expect the use of side information will help in improving the performance of recommendation.
Nevertheless, none of existing recommendation methods can satisfy the above two conditions together. In the literature, embedding-based methods such as matrix factorization is the most popular CF approach, owing to the strong power of embeddings in generalizing from sparse user-item relations. Many variants have been proposed to incorporate side information, such as factorization machine (FM), Neural FM, Wide&Deep, and Deep Crossing. While these methods can learn feature interactions from raw data, cross feature effects are only captured in a rather implicit way during the learning process; and most importantly, the cross features cannot be explicitly presented. Moreover, existing works on using side information have mainly focused on the cold-start issue, leaving the explanation of recommendation relatively less touched.
According to a first aspect of the present disclosure a predictive analysis method comprises: receiving input data comprising an indication of a user, an indication of an item, a user feature vector indicating features of the user and an item feature vector indicating features of the item; constructing a cross feature vector indicating values for cross features between features of the user and features of the user; projecting each cross feature of the cross feature vector onto an embedding vector to obtain a set of cross feature embedding vectors; projecting the user feature vector onto the embedding vector to obtain a user feature embedding vector and projecting the item feature vector onto the embedding vector to obtain an item feature embedding vector; inputting the cross feature embedding vectors, the user feature embedding vector and the item feature embedding vector into an attention network to determine a set of attentive weights, the set of attentive weights comprising an attentive weight for each cross feature of the cross feature vector; performing a pooling operation over the set of attentive weights to obtain a unified representation of cross features; concatenating an elementwise product of the user embedding vector and the item embedding vector with the unified representation of cross features to obtain a concatenated vector; projecting the concatenated vector to obtain a prediction of a user item preference; and outputting an indication of the user item preference.
In an embodiment, the method further comprises outputting an indication of at east one attentive weight of the set of attentive weights.
In an embodiment, the method further comprises receiving an input indicating an adjustment to the set of attentive weights and adjusting attentive weights of the set of attentive weights in accordance with the adjustment.
In an embodiment, constructing a cross feature vector comprises using a gradient boosting decision tree.
In an embodiment, the cross feature vector is a sparse vector.
In an embodiment, the pooling operation is an average pooling operation.
In an embodiment, the pooling operation is a max pooling operation.
In an embodiment, the attentive network is a multilayer perceptron.
According to a second aspect of the present disclosure, a data processing system comprises a processor and a data storage device. The data storage device stores computer executable instructions operable by the processor to: receive input data comprising an indication of a user, an indication of an item, a user feature vector indicating features of the user and an item feature vector indicating features of the item; construct a cross feature vector indicating values for cross features between features of the user and features of the user; project each cross feature of the cross feature vector onto an embedding vector to obtain a set of cross feature embedding vectors; project the user feature vector onto the embedding vector to obtain a user feature embedding vector and project the item feature vector onto the embedding vector to obtain an item feature embedding vector; input the cross feature embedding vectors, the user feature embedding vector and the item feature embedding vector into an attention network to determine a set of attentive weights, the set of attentive weights comprising an attentive weight for each cross feature of the cross feature vector; perform a pooling operation over the set of attentive weights to obtain a unified representation of cross features; concatenate an elementwise product of the user embedding vector and the item embedding vector with the unified representation of cross features to obtain a concatenated vector; project the concatenated vector to obtain a prediction of a user item preference; and output an indication of the user item preference.
According to a yet further aspect, there is provided a non-transitory computer-readable medium. The computer-readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.
In the following, embodiments of the present invention will be described as non-limiting examples with reference to the accompanying drawings in which:
In the present disclosure, a recommendation solution that is both accurate and explainable is described. By accurate, we expect our method to achieve the same level of performance as existing embedding-based approaches. By explainable, we would like our method to be transparent in generating a recommendation and is capable of identifying the key cross features for a prediction. Towards this end, we propose a novel solution named Tree-enhanced Embedding Model (TEM), which combines embedding-based methods with decision tree-based approaches. First, we build a gradient boosting decision trees (GBDT) on the side information of users and items to derive effective cross features. We then feed the cross features into an embedding-based model, which is a carefully designed neural attention network that reweights the cross features according to the current prediction. Owing to the explicit cross features extracted by GBDTs and the easy-to-interpret attention network, the overall prediction process is fully transparent and self-explainable. Particularly, to generate reasons for a recommendation, we just need to select the most predictive cross features based on their attention scores.
As a main technical contribution, this disclosure presents a new scheme that unifies the strengths of embedding-based and tree-based methods for recommendation. Embedding-based methods are known to have strong generalization ability, especially in predicting the unseen crosses on user ID and item ID (i.e., capturing the CF effect). However, when operating on the rich side information, embedding-based methods lose the important property of explainability—the cross features that contribute most to the prediction cannot be revealed. On the other hand, tree-based methods predict by generating explicit decision rules, making the resultant cross features directly interpretable. While such a way is highly suitable for learning from side information, it fails to predict unseen cross features, thus being unsuitable for incorporating user ID and item ID. To build an explainable recommendation solution, we combine the strengths of embedding-based and tree-based methods in a natural and effective manner, which to our knowledge has never been studied before.
In this disclosure, we demonstrate the effectiveness and explainability of TEM in the recommendation scenarios. However, TEM, as an easy-to-interpret model, can be used in a wide bunch of applications like recommender systems (e.g., E-commerce recommendation), social networking services (e.g., friend recommendation or word-of-mouth marketing), and advertising services (e.g., audience detection, click-through rate prediction, and targeted advertisement). Taking the click-through rate prediction as an example, we can feed the features including the user behaviors (e.g., age, gender, and occupation), advertisement features (e.g., position, brand, device type, and duration) into TEM. We can profile the groups of user why they click the target advertisement.
n. The feature vectors [xu, xi] indicate attributes of the user u, and the item i, respectively. The feature vectors [xu, xi] are input into a gradient boosting decision tree (GBDT) model 120 to identify cross features which effect the user item preference. The gradient boosting decision tree (GBDT) model 120 is described in more detail below with reference to
Following the gradient boosting decision tree (GBDT) model 120 there is an attentive embedding layer 130. The gradient boosting decision tree (GBDT) model 120 outputs indications of cross feature vectors which are relevant to the user item preference. These feature vectors are projected onto an embedding vector to obtain a set of cross feature embedding vectors v2, v4, v7. A user embedding vector pu and an item embedding vector qi are also formed and the embedding vectors are input into an attention network 132. The attention network 132 is described in more detail below with reference to
where the first two terms model the feature biases similar to that of FM, where b0 is the global bias, bt denotes the weight of the t-th feature and fΘ(u, i, x) is the core component of TEM with parameters Θ to model the cross-feature effect. The output 150 may also comprise an indication of one or more of the attentive weights or one or more attentive scores derived from the attentive weights. The attentive weights and the attentive scores indicate the importance of particular cross features in determining the user item preference.
The technical architecture 200 includes a processor 222 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 224 (such as disk drives), read only memory (ROM) 226, random access memory (RAM) 228. The processor 222 may be implemented as one or more CPU chips. The technical architecture 220 may further comprise input/output (I/O) devices 230, and network connectivity devices 232. The technical architecture 200 further comprises activity table storage which may be implemented as a hard disk drive or other type of storage device.
The secondary storage 224 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 228 is not large enough to hold all working data. Secondary storage 224 may be used to store programs which are loaded into RAM 228 when such programs are selected for execution. In this embodiment, the secondary storage 224 has an input/output module 224a, a cross feature vector module 224b, an embedding vector module 224c, an attention network module 224d, a pooling module 224e, a prediction module 224f and an optimization module 224g comprising non-transitory instructions operative by the processor 222 to perform various operations of the methods of the present disclosure. As depicted in
The I/O devices 230 may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well-known input devices.
The network connectivity devices 232 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 232 may enable the processor 222 to communicate with the Internet or one or more intranets. With such a network connection, it is contemplated that the processor 222 might receive information from the network, or might output information to the network in the course of performing the method operations described herein. Such information, which is often represented as a sequence of instructions to be executed using processor 222, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave.
The processor 222 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 224), flash drive, ROM 226, RAM 228, or the network connectivity devices 232. While only one processor 222 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.
It is understood that by programming and/or loading executable instructions onto the technical architecture 200, at least one of the CPU 222, the RAM 228, and the ROM 226 are changed, transforming the technical architecture 200 in part into a specific purpose machine or apparatus having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules.
Although the technical architecture 200 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task. For example, but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application. Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the technical architecture 200 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 200. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider.
In step 302, the input/output module 224a of the data processing system 200 receives the input 110 comprising an indication of a user it, an item i, and their feature vectors [xu, xi]=x∈n. The feature vectors [xu, xi] indicate attributes of the user u, and the item i, respectively.
In step 304, the cross feature vector module 224b of the data processing system 200 constructs a cross feature vector q. In constructing the cross feature vector a primary consideration is to make the cross features explicit and explainable.
In the example GBDT shown in
As shown in
We represent the cross features as a multi-hot vector q, which is a concatenation of multiple one-hot vectors (where a one-hot vector encodes the activated leaf node of a tree):
q=GBDT(x|Q)=[Q1(x), . . . ,QS(x)].
Here q is a sparse vector, where an element of value 1 indicates an activated leaf node and the number of nonzero elements in q is S. Let the size of q be L=ΣSLS. For example, in
Returning now to k, where k is the embedding size. After the operation, we obtain a set of embedding vectors V={q1v1, . . . , qLvL}. Since q is a sparse vector with only a few nonzero elements, we only need to include the embeddings of nonzero features for a prediction, i.e., V={vl} where ql≠0.
In step 308, the embedding vector module 224c of the data processing system 200 projects the user feature vector and the item feature vector on to the embedding vector to obtain a user feature embedding vector and an item feature embedding vector. We use pu and qi to denote the user feature embedding vector and the item feature embedding vector, respectively.
In step 310, the attentive network module 224d of the data processing system 200 inputs the embedding vectors into the attention network 132 to determine attentive weights for each cross feature. wuil is a trainable parameter denoting the attentive weight of the l-th cross feature in constituting the unified representation, and importantly, it is personalized to be dependent with (u, i).
We model wuil as a function dependent on the embeddings of u, i, and l, rather than learning wuil freely from data. We use a multilayer perceptron (MLP) as the attention network 132 to parameterize wuil, which is defined as:
where W∈a×2k and b∈
a denote the weight matrix and bias vector of the hidden layer, respectively, and a controls the size of the hidden layer. The vector h∈
a projects the hidden layer into the attentive weight for output. We used the rectifier as the activation function and normalized the attentive weights using softmax. We term a as the attention size.
In step 312, the pooling module 224e of the data processing system 200 aggregates the embeddings of the cross features. Here we consider two ways to aggregate the embeddings of cross features, average pooling and max pooling, to obtain a unified representation e(u, i, V) for cross features:
The result of the pooling operation is a unified representation of cross features.
In step 314, the prediction module 224f of the data processing system 200 concatenates an elementwise product of the embedding vectors pu and qi with the unified representation of cross features to obtain a concatenated vector. To incorporate the collaborative filtering (CF) modeling, we concatenate e(u, i, V) with pu⊙qi, which reassembles matrix factorization (MF) to model the interaction between user ID and item ID.
In step 316, the prediction module 224f of the data processing system 200 projects the concatenated vector to obtain a prediction of an item user preference. We apply a linear regression to project the concatenated vector to the final prediction. This leads to the predictive model of our TEM as:
where r1∈k and r2∈
k are the weights of the final linear regression layer. As can
be seen, our TEM is a shallow and additive model. To interpret a prediction, we can easily evaluate the contribution of each component. We use TEM-avg and TEM-max to denote the TEM that uses eavg(⋅) and emax(⋅), respectively.
In step 318, the input/output module 224a of the data processing system 200 outputs an indication of the user item preference and an indication of at least one of the attentive weights.
In step 602, the input/output module 224a of the data processing system 200 receives observed user-item interaction data.
In step 604, the optimization module 224g of the data processing system optimizes the predictive model. Similar to the recent work on neural collaborative filtering which is described in Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. 173-182, we solve the item recommendation task as a binary classification problem. Specifically, an observed user-item interaction is assigned to a target value 1, otherwise 0. We optimize the pointwise log loss, which forces the prediction score ŷui to be close to the target yui:
where σ is the activation function to restrict the prediction to be in (0, 1), set as sigmoid σ(x)=1/(1+e−x) in this disclosure. The regularization terms are omitted here for clarity (we tuned the L 2 regularization in experiments when overfitting was observed). It will be appreciated that other objective functions, such as the pointwise regression loss and ranking loss may also be used in the optimization process. In this example, we use the log loss as a demonstration of our TEM.
The tree enhanced embedding model described above can be used as the generic solution for prediction. We now discuss how to apply TEM to build an e-commerce recommendation system.
In practical E-commerce systems, we typically have three types of data to build a recommendation service: 1) users' interaction histories on products, such as purchasing, rating, clicking histories etc. 2) user profiles, such as demographics like age, gender, hometown, income level etc. 3) product properties, such as categories, prices, descriptive tags, product images etc. For each interaction, we convert it to a training instance with the basic features include user ID and product ID; this will provide the basic collaborative filtering system. To incorporate the side information of user profiles and product properties, we need to do feature engineering based on the types of side information. For categorical variables like ages (male or female) and hometown (Shanghai, Beijing or other cities), we can append them to the feature vector via one-hot encoding.
In the following subsection, we show how to deploy TEM to two recommendation scenarios: tourist attraction recommendation and restaurant recommendation.
We collect data from two populous cities in TripAdvisor: London (LON) and New York City (NYC), and separately perform experiments of tourist attraction and restaurant recommendation.
For each dataset, we holdout the latest 20% interaction history of each user to construct the test set, and randomly split the remaining data into training (70%) and validation (10%) sets. The validation set is used to tune hyper-parameters and the final performance comparison is conducted on the test set.
Given one positive user-item interaction in the testing set, we pair it with 50 negative instances that the user did not consume before. Then each method outputs prediction scores for these 51 instances. To evaluate the prediction scores, we adopt two metrics: the error-based log loss and the ranking-aware ndcg@K.
The TEM described in the present disclosure is compared with the following methods:
For a fair comparison, we optimize all the methods with the same objective function. We implement our proposed TEM using Tensorflow. We use XGBoost to implement the tree-based components of all methods, where the number of trees and the maximum depth of trees is searched in {100, 200, 300, 400, 500} and {3, 4, 5, 6}, respectively. For all embedding-based components, we test the embedding size of {5, 10, 20, 40}, and empirically set the attention size same as the embedding size. All embedding-based methods are optimized using the mini-batch Adagrad for a fair comparison, where the learning rate is searched in {0.005, 0.01, 0.05, 0.1, 0.5}. Moreover, the early stopping strategy is performed, where we stopped training if the logloss on the validation set increased for four successive epoches. Without special mention, we show the results of tree number 500, maximum depth 6, and embedding size 20
We have the following observations:
XGBoost achieves poor performance since it treats sparse IDs as ordinary features and hardly derives useful cross features based on the sparse data. It hence fails to capture the collaborative filtering effect. Moreover, it cannot generalize to unseen feature dependencies. GBDT+LR slightly outperforms XGBoost, verifying the feasibility of treating cross features as the input of one classifier and revising the weight of each cross feature.
The performance of GB-CENT indicates that such boosting may be insufficient to fully facilitate information propagation between two models. Note that to reduce the computational complexity, the modified GB-CENT only conducts GBDT over all the instances, rather than performing GBDT over the supporting instances of each categorical feature. Such modification may contribute to the unsatisfactory performance.
When performing our recommendation tasks, FM and NFM, outperform XGBoost, GBDT+LR, and GB-CENT. It is reasonable since they are good at modeling the sparse interactions and the underlying second-order cross features. NFM benefits from the higher-order and nonlinear feature correlations by leveraging neural networks, thus leads to better performance than FM.
TEM achieves the best performance, substantially outperforming NFM w.r.t. logloss and obtaining a comparable ndcg@5. By integrating the embeddings of cross features, TEM can achieve the comparable expressiveness to NFM. While NFM treats all feature interactions equally, TEM can employ the attention networks on identifying the personalized attention of each cross feature. We further conduct one-sample t-tests to verify that all improvements are statistically significant with p-value<0.05.
To analyze the effect of cross features, we consider the variants that remove cross feature modeling, termed as FM-c, NFM-c, TEM-avg-c, and TEM-max-c. For FM and NFM, one user-item interaction is represented only by the sum of the user and item ID embeddings and their attribute embeddings, without any interactions among features. For TEM, we skip the cross feature extraction and direct feed into the raw features.
As shown in
As
Lastly, while exhibiting the lowest logloss, TEM achieves only comparable performance w.r.t. ndcg@5 to that of NFM, as shown in
To demonstrate the explainability of TEM, we focus on a sampled user, whose profile is {age: 35-49, gender: female, country: the United Kingdom, city: London, expert level: 4, traveler styles: Art and Architecture Lover, Peace and Quite Seeker, Family Vacationer, Urban Explorer}; meanwhile, we randomly select five attractions, {i31: National Theatre, i45: The View form the Shard, i49: The London Eye, i93: Camden Street Art Tours, i100: Royal opera House}, from the user's holdout testing set.
We first focus on the heat map of attention scores in
In heat map of
In addition to making the recommendation process transparent, the TEM can further allow a user to correct the process, so as to refresh the recommendation as she desires.
This property of adjusting recommendation is known as the scrutability. As for TEM, the attention scores of cross features serve as a gateway to exert control on the recommendation process.
In this disclosure, a tree-enhanced embedding method (TEM), which seamlessly combines the generalization ability of embedding-based models with the explainability of tree-based models was described. Owing to the explicit cross features extracted from tree-based part and the easy-to-interpret attention network, the whole prediction process of our solution is fully transparent and self-explainable. Meanwhile, TEM can achieve comparable performance as the state-of-the-art recommendation methods.
Whilst the foregoing description has described exemplary embodiments, it will be understood by those skilled in the art that many variations of the embodiments can be made within the scope and spirit of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10201803291Q | Apr 2018 | SG | national |