None
This disclosure relates generally to the processing of graph based data using machine learning techniques, particularly in the context of recommendation systems.
An information filtering system is a system that removes redundant or unwanted information from an information stream that is provided to a human user in order to manage information overload. A recommendation system (RS) is a subclass of an information filtering system that seeks to predict the rating or preference a user would give to an item. RSs are often used in commercial applications to guide users to find their true interests out of a substantial number of potential candidates.
Personalized RSs play an important role in many online services. The task of personalized RS is to provide a ranked list of items for each individual user. Accurate personalized RSs can benefit users as well as content publishers and platform providers. RSs are utilized in a variety of commercial areas to provide personalized recommendations to users, including for example: providing video or music suggestions for streaming and download content provider platforms; providing product suggestions for online retailer platforms; providing application suggestions for app store platforms; providing content suggestions for social media platforms; and suggesting news articles for mobile news applications or online news websites.
RSs usually employ one or both of collaborative filtering (CF) and content-based filtering. Both of these filtering methodologies apply a personality-based approach that recommends personalized products or services for different users based on their historical behaviors.
CF methodologies typically build a predictive model or function that is based on a target or active user's past behavior (e.g., items previously purchased or selected and/or a numerical rating given to those items) as well on the past behavior of other users who have behavioral histories similar to that of the active user. By contrast, content-based filtering methodologies utilize a series of discrete, pre-tagged characteristics of an item (item attributes) in order to recommend additional items with similar properties. However, content-based filtering methodologies can be impeded by the fact that a large number of items have a very limited number of associated item attributes, due at least in part to the volume of items that are continually being added.
Some RSs integrate content-based filtering methodologies into CF methodologies to create a hybrid system. However, the lack of suitable item attributes for the exploding number of items that are available through online platforms requires most RSs to still heavily rely on only CF methods that give recommendations based on users' historical behaviors.
CF methodologies can typically be summarized as: Step 1) Look for users who share the same interaction patterns with the active user (the user whom the prediction is to be made); and Step 2) Use the ratings/interactions from those like-minded users found in step 1 to calculate a prediction for the active user. Finding users who share the same interaction patterns requires identification of similar users or similar items. The process of deriving similar users and similar items includes embedding each user and each item into a low-dimensional space created such that similar users are nearby and similar items are nearby. In this regard, an embedding is a mapping of discrete, categorical, variables to a vector of continuous numbers. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Embeddings in personalized RS are useful because they can meaningfully represent users and items in a transformed vector space as low-dimensional vectors.
Existing CF approaches attempt to generate representative and distinct embeddings for each user and item. Such representative embeddings can capture complex relations between users and items. The closer that an item and a user are in a vector space, the more likely that the user will interact with or rate the item highly.
A classic and successful method for CF is matrix factorization (MF). MF algorithms characterize both items and users by vectors in the same space, inferred from observed entries of user-item historical interaction. MF algorithms work by decomposing a user-item interaction matrix into the product of two lower dimensionality rectangular matrices with the goal of representing users and items in a lower dimensional latent space (also known as embedding representation in the context of deep learning algorithms). Early work in MF mainly applied the mathematical discipline of linear algebra of matrix decomposition, such as SVD (singular value decomposition) and its variants. In recent years, artificial neural network (ANN) and deep-learning (DL) techniques have been proposed, some of which generalize traditional MF algorithms via a non-linear neural architecture parameterized by neural networks and learnable weights. In the case of both linear algebra and DL-based MF models, the goal of MF is to find the right representation of each user and each item as vector representations.
In RS, various relationships exist that can be represented as graphs, such as social networks (user-user graph), commodity similarity (item-item graph), and user-item interaction (can be modeled as a user-item bipartite graph). Graph convolution neural networks (GCNNs) have demonstrated to be powerful tools for learning embeddings. GCNNs have been applied for recommendation by modeling the user-item interaction history as a bipartite graph. GCNNs are trained to learn user and item representations of user and item nodes in a graph structure and model user-item interaction history as connecting edges between the nodes. The vector representation of a node is learned by iteratively combining the embedding (i.e., mapping of a discrete variable to a vector of continuous numbers) of the node itself with the embeddings of the nodes in its local neighborhood. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.
Most existing methods split the process of learning a vector representation (i.e., embedding) of a node (which can be an item node or a user node) into two steps: neighborhood aggregation, in which an aggregation function operates over sets of vectors to aggregate the embeddings of neighbors, and center-neighbor combination that combines the aggregated neighborhood vector with the central node embedding. GCNN-based CF models learn user and item node embeddings on graphs in a convolution manner by representing a node as a function of its surrounding neighborhood.
In some GCNN based bipartite graph RSs, the aggregation function operates over local neighborhoods of a central node (e.g., an item node or a user node), where a local neighborhood refers to the direct connection of that node in the given topology (graph). For example, the item nodes that interact with a central user node will form the local neighborhood of that user node. In the case of an ANN, the aggregation function can be implemented using an NN Multi-layer perception (MLP) that transforms the input using a learnable non-linear transformation function to learn weights on every single dimension of the input vector. The output of the MLP layer is the input vector weighted by neural network parameters, and these parameters will be updated by gradient descent of the neural network.
Existing GCNN based bipartite graph RSs treat observed graphs as a ground-truth depiction of relationships and thus treat the observed graph as very strong prior knowledge. However, because of data sparsity, the bipartite user-item interaction graphs are in fact often missing many edges, reflecting very limited information.
Learning on fixed and incomplete graphs omits all the potential preferences of users, and thus falls short in terms of diversity and efficacy in RS applications. This can lead to deterioration in recommendation performance when learning on graphs.
Existing RSs empirically take one fixed threshold value for choosing similar users and items, which is hard to generalize on different datasets. Also, existing RSs typically share one common threshold for all users and items which do not consider personalization. Furthermore, existing RSs typically adopt a two-step training procedure by first searching for the best threshold value followed by prediction model training. Such a method can lead to a sub-optimal RS.
Accordingly there is a need for a RS that is able compensate for data sparsity that is inherently present in an environment of rapidly expanding numbers of users and volume of content.
According to a first aspect of the present disclosure, there is provided a computer implemented method for a recommendation system (RS) for processing an input dataset that identifies a set of users, a set of items, and user-item interaction data about historic interactions between users in the set of users and items in the set of items. The computer implemented method includes generating, based on the user-item interaction data, a user-user similarity dataset that indicates user-user similarity scores for pairs of users in the set of users; generating, based on the user-item interaction data, an item-item similarity dataset that indicates item-item similarity scores for pairs of items in the set of items filtering the user-user similarity dataset based on a user similarity threshold vector to generate a filtered user-user similarity dataset, the user similarity threshold vector including a respective user similarity threshold value for each user in the set of users. The computer implemented also includes generating a set of user neighbour embeddings based on the filtered user-user similarity dataset and a set of user embeddings, the set of user embeddings including a respective user embedding for each user in the set of users; filtering the item-item similarity dataset based on an item similarity threshold vector to generate a filtered item-item similarity dataset, the item similarity threshold vector including a respective item similarity threshold value for each item in the set of items and generating a set of item neighbour embeddings based on the filtered item-item similarity dataset and a set of item embeddings, the set of item embeddings including a respective item embedding for each item in the set of items; generating a set of relevance scores based on the user neighbour embeddings and the item neighbour embeddings, the set of relevance scores including, for each user in set of users, respective relevance scores for the items in the set of items. The computer implemented method further includes generating a list of one or more recommended items for each user based on the set of relevance scores.
The use of personalized thresholds for each user and each item may, in some applications, enable more accurate personalized rankings to be generated by an RS. This may enable operation of an RS to be optimized such that a user is not presented with irrelevant or misleading item options. In least some aspects of the computer-implemented method of the present disclosure, optimization can improve RS efficiency as the consumption of one or more of computing resources, communications bandwidth and power may be reduced by not presenting users with irrelevant options and minimizing exploration of irrelevant options by users.
The computer implemented method may include learning the user similarity threshold vector, the set of user embeddings, the item similarity threshold vector, and the set of item embeddings.
Thus, in some aspects of the computer implemented method of the present disclosure, threshold vectors and embeddings are learned personally and adaptively for each user and item, which may improve system accuracy and enhance the advantages noted above.
Learning the user similarity threshold vector, the set of user embeddings, the item similarity threshold vector, and the set of item embeddings may include performing a bilevel optimization process that includes an inner optimization stage for learning the user embeddings and item embeddings based on a lower-level objective function and an outer optimization stage for learning the user similarity threshold vector and item similarity threshold vector based on an upper level objective function.
The computer implemented method may include performing the bilevel optimization process by computing proxy embeddings for the user embeddings and the item embeddings and using the proxy embeddings during the outer optimization stage.
The inner optimization stage for learning the user embeddings and item embeddings may include: (a) filtering the user-user similarity dataset based on an interim user similarity threshold vector to generate an interim filtered user-user similarity dataset; (b) filtering the item-item similarity dataset based on an interim item similarity threshold vector to generate an interim filtered item-item similarity dataset; (c) generating an interim set of user neighbour embeddings based on the interim filtered user-user similarity dataset and an interim set of user embeddings; (d) generating an interim set of item neighbour embeddings based on the interim filtered item-item similarity dataset and an interim set of item embeddings; (e) generating a set of interim relevance scores based on the interim user neighbour embeddings and the interim item neighbour embeddings; (f) determining a loss based on the generate a set of interim relevance scores; (g) updating the interim set of user embeddings and interim set item embeddings to minimize the loss; repeating (c to g) until the interim set of user embeddings and interim set of item embeddings are optimized in respect of the interim user similarity threshold vector and interim item threshold vector. The outer optimization stage for learning the user similarity threshold vector and the item similarity threshold vector may include: (h) filtering the user-user similarity dataset based on an interim user similarity threshold vector to generate an interim filtered user-user similarity dataset; (i) filtering the item-item similarity dataset based on an interim item similarity threshold vector to generate an interim filtered item-item similarity dataset; (j) generating an interim set of user neighbour embeddings based on the interim filtered user-user similarity dataset and a proxy set of user embeddings; (k) generating an interim set of item neighbour embeddings based on the interim filtered item-item similarity dataset and a proxy set of item embeddings; (l) generating a set of interim relevance scores based on the interim user neighbour embeddings and the interim item neighbour embeddings; (m) determining the loss based on the generate a set of interim relevance scores; (n) updating the interim user similarity threshold vector and interim item similarity threshold vector to minimize the loss; repeating (h to n) until the interim user similarity threshold vector and interim item similarity threshold vector are optimized in respect of the proxy set of user embeddings and the proxy set of item embeddings. The inner optimization stage and the outer optimization stage are successively repeated during a plurality of training iterations.
Learning the user similarity threshold vector, the set of user embeddings, the item similarity threshold vector, and the set of item embeddings may include determining a plurality of triplets based on the input dataset, wherein each triplet identifies: (i) a respective user from the set of users; (ii) a positive item from the set of items that is deemed to be positive with respect to the respective user based on the user-item interaction data; and (iii) a negative item from the set of items that is deemed to be negative with respect to the respective user based on the user-item interaction data; and learning the system parameters to optimize an objective that maximizes, for the plurality of triplets, a difference between relevance scores computed for positive items with respect to users and relevance scores computed for negative items with respect to users.
The user-user similarity scores for the pairs of users and the item-item similarity scores for the pairs of items may be determined using a cosine similarity algorithm.
Filtering the user-user similarity dataset may include, for each user: replicating in the filtered user-user similarity dataset any of the user-user similarity scores for the user from the user-user similarity dataset that exceed the respective user similarity threshold value for the user, and setting to zero in the filtered user-user similarity dataset any of the user-user similarity scores for the user from the user-user similarity dataset that do not exceed the respective user similarity threshold value for the user. Filtering the item-item similarity dataset comprises, for each item: replicating in the filtered item-item similarity dataset any of the item-item similarity scores for the item from the item-item similarity dataset that exceed the respective item similarity threshold value for the item, and setting to zero in the filtered item-item similarity dataset any of the item-item similarity scores for the item from the item-item similarity dataset that do not exceed the respective item similarity threshold value for the item.
Generating the set of user neighbour embeddings may include determining a dot product of a matrix representation of the filtered user-user similarity dataset and a matrix representation of the set of user embeddings; and generating the set of item neighbour embeddings comprises determining a dot product of a matrix representation of the filtered item-item similarity dataset and a matrix representation of the set of item embeddings.
Generating the set of relevance scores may include determining a dot product of a matrix representation of the set of user neighbour embeddings and a matrix representation of the set of item neighbour embeddings.
According to a further aspect of the present disclosure, there is provided a recommendation system for processing an input dataset that identifies a set of users, a set of items, and user-item interaction data about historic interactions between users in the set of users and items in the set of items. The recommendation system includes: a processing device; a non-transitory storage device coupled to the processing device and storing software instructions which, when executed by the processing device, cause the recommendation system to perform the following operations: generate, based on the user-item interaction data, a user-user similarity dataset that indicates user-user similarity scores for pairs of users in the set of users; generate, based on the user-item interaction data, an item-item similarity dataset that indicates item-item similarity scores for pairs of items in the set of items; filter the user-user similarity dataset based on a user similarity threshold vector to generate a filtered user-user similarity dataset, the user similarity threshold vector including a respective user similarity threshold value for each user in the set of users; generate a set of user neighbour embeddings based on the filtered user-user similarity dataset and a set of user embeddings, the set of user embeddings including a respective user embedding for each user in the set of users; filter the item-item similarity dataset based on an item similarity threshold vector to generate a filtered item-item similarity dataset, the item similarity threshold vector including a respective item similarity threshold value for each item in the set of items; generate a set of item neighbour embeddings based on the filtered item-item similarity dataset and a set of item embeddings, the set of item embeddings including a respective item embedding for each item in the set of items; generate a set of relevance scores based on the user neighbour embeddings and the item neighbour embeddings, the set of relevance scores including, for each user in set of users, respective relevance scores for the items in the set of items; and generate a list of one or more recommended items for each user based on the set of relevance scores.
The RS may be a GCNN based bipartite graph RS.
According to a further aspect of the present disclosure, there is provided a non-transitory computer-readable medium that stores software instructions which, when executed by a processing device, case the processing device to: receive an input dataset that identifies a set of users, a set of items, and user-item interaction data about historic interactions between users in the set of users and items in the set of items; generate, based on the user-item interaction data, a user-user similarity dataset that indicates user-user similarity scores for pairs of users in the set of users; generate, based on the user-item interaction data, an item-item similarity dataset that indicates item-item similarity scores for pairs of items in the set of items; filter the user-user similarity dataset based on a user similarity threshold vector to generate a filtered user-user similarity dataset, the user similarity threshold vector including a respective user similarity threshold value for each user in the set of users; generate a set of user neighbour embeddings based on the filtered user-user similarity dataset and a set of user embeddings, the set of user embeddings including a respective user embedding for each user in the set of users; filter the item-item similarity dataset based on an item similarity threshold vector to generate a filtered item-item similarity dataset, the item similarity threshold vector including a respective item similarity threshold value for each item in the set of items; generate a set of item neighbour embeddings based on the filtered item-item similarity dataset and a set of item embeddings, the set of item embeddings including a respective item embedding for each item in the set of items; generate a set of relevance scores based on the user neighbour embeddings and the item neighbour embeddings, the set of relevance scores including, for each user in set of users, respective relevance scores for the items in the set of items; and generate a list of one or more recommended items for each user based on the set of relevance scores.
Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present application, and in which:
Similar reference numerals may have been used in different figures to denote similar components.
According to example embodiments, bilevel optimization is incorporated into a machine learning (ML) based recommendation system (RS). In particular, instead of using an RS training procedure in which neighborhood threshold values are first determined and then used as a hyper parameter for generating item and user node embeddings, bilevel optimization is used to collectively and adaptively learn both item neighborhood threshold values and node embeddings during an end-to-end training process
Bilevel optimization can be considered as an optimization problem that contains another optimization problem as a constraint, for example an outer optimization task (commonly referred to as the upper-level optimization task), and an inner optimization task (commonly referred to as the lower-level optimization task). Bilevel optimization can be implemented using a computer program to model hierarchical decision processes and engineering design problems. A simple form of the bilevel optimization problem is defined below:
Where x and are a set of upper-level variables and lower-level variables respectively. Similarly, the functions F and f are upper-level and lower-level objective functions respectively, while the vector-valued functions G and g are called the upper-level and lower-level constraints respectively. Upper-level constraints G involve variables from both levels and play a very specific role. The application of bilevel optimization in a RS will be discussed in greater detail below.
As known in the art, a graph is a data structure that comprises nodes and edges. Each node represents an instance or data point. Each edge represents a relationship that connects two nodes. A bipartite graph is a form of graph structure in which each node belongs to one of two different node types and direct relationships (e.g., 1-hop neighbors) only exist between nodes of different types.
In example embodiments, users uAlice to uDavid and items v1 to v5 are represented in graph 101 as unattributed user nodes and item nodes respectively, meaning that each node has a type (item or user) and a unique identity (e.g., identity is indicated by the subscripts of v1 and uAlice), but no additional known attributes. In some examples, item identity could map to a specific class of item (e.g., movie). In alternative embodiments, the nodes may each be further defined by a respective set of node features (e.g., age, gender, geographic location, etc. in the case of a user, and genre, year of production, actors, movie distributer, etc. in the case of an item that is a movie).
The edges 102 that connect user nodes u to respective item nodes v indicate relationships between the nodes and collectively the edges 102 define the observed graph topology Gobs. For example, the presence or absence of an edge 102 between nodes represents the existence or absence of a predefined type of interaction between the user represented by the user node and the item represented by the item node. For example, the presence or absence of an edge 102 can indicate an interaction history such as whether or not a user u has previously selected the item v item for consumption (e.g., purchase, order, download, or stream an item), or submitted a scaled (e.g., 1 to 5 star) or binary (e.g. “like”) rating in respect of the item v, or interacted with the item v in some other trackable manner.
In some examples embodiments, edges 102 convey binary relationship information such that the presence of an edge indicates the presence of a positive interaction (e.g. user ualice has previously “clicked” or rated/liked or consumed an item v1) and the absence of an edge indicates an absence of a positive interaction (e.g., the lack of edge between user node representing user uAlice and the item node representing item v2 indicating that user ualice has never interacted with particular item v2, such that item v2 is a negative item with respect to user ualice. In some embodiments, edges 102 may be associated with further attributes that indicate a relationship strength (for example a number of “clicks” by a user in respect of a specific item, or the level of a rating given by a user).
Thus, bipartite graph 101 includes information about users (e.g., user node set U), information about items (e.g., item node set V) and information about the historical interactions between users and items (e.g. graph topology Gobs, which can be represented as U-I interaction matrix 204 (
In many real-life cases, the information present in an observed bipartite graph 101 has inherent data sparsity problems in that the historical interaction data present in graph 101 will often be quite limited, especially in the case of new users and items that have few interaction records. Thus, many user nodes and many item nodes may have very few connecting edges.
Accordingly, as will be described in greater detail below, example embodiments are described that may in some applications address one or more of the issues noted above that confront existing RSs.
In this regard,
Although the RS 200 shown in
The U-I interaction dataset represented by bipartite graph 101 can be provided as input to RS 200 in the form of a nuser×nitem, user-item (U-I) interaction matrix 204 (
As indicated in
Referring to
Filtering operation 214 is configured to filter user-user pairs from U-U similarity matrix (SU) 208 and item-item pairs from I-I similarity matrix (SI) 210 that fall below threshold values. Filtering of U-U similarity matrix (SU) 208 results in a filtered U-U similarity dataset FU, as represented by the function:
In Equation 1, SU is the matrix of cosine similarity scores included in U-U similarity matrix (SU) 208, and U is a personalized threshold vector that includes nuser threshold values (i.e. a personalized threshold value for each respective user u).
Similarly, filtering of I-I similarity matrix (SI) 210 results in a filtered I-I similarity dataset FI, as represented by the function:
In Equation 2, SI is the matrix of cosine similarity scores included in I-I similarity matrix (SI) 210, and I is a personalized threshold vector that includes nitem threshold values (i.e. a personalized threshold value for each respective item v).
For example, in
Among other things, the use of personalized thresholds for each item enables the resulting filtered similarity data to be directional, meaning that although a first item, second item pair the similarity threshold for can be different for the first item with respect to the second item then for the second item with respect to the first item. For example, the pair similarity score for the first item, second item pair may meet the first item similarity threshold k, but the same pair similarity score may fail to meet the second item similarity threshold. An example of this directionality is illustrated in
The filtered I-I similarity dataset FI that is included in I-I Filtered Similarity Matrix 402 can also be represented as an I-I directed graph 404 as shown in
As will be explained in greater detail below, threshold vectors and (collectively denoted as threshold vector ∈ |U|+|I|) are adaptively learned over a set of training iterations during a training phase, such that a respective, unique filtering threshold value k is learned for each user u and item v. Prior to training, initialized threshold vectors int, int can be generated by random sampling from a range or pre-defined distribution of candidate threshold values.
Filtering of U-U pairs and I-I pairs has previously been performed by using a single threshold value for all users and a single threshold value for all items. The use of personalized thresholds that are learned respectively for each user and each item may, in some applications, enable more accurate personalized rankings to be generated by a RS. This may enable operation of a RS to be optimized such that a user is not presented with irrelevant or misleading item options. In least some examples, optimization of operation of a RS can improve efficiency of the RS as the consumption of one or more of computing resources, communications bandwidth and power may be reduced by not presenting users with irrelevant options and minimizing exploration of irrelevant options by users.
Referring again to
N
U
=F
U·ΘU (Eq. 3)
In Equation 3, ΘU ∈ R|U|×d is a set of user embeddings that are learned during iterative training of RS 200, and d is the dimensionality of each embedding.
Accordingly, in example embodiments, the neighbor embeddings NU is a matrix that is the dot product of the filtered U-U similarity dataset FU and the user embeddings ΘU.
In example embodiments, generation of neighbor embeddings NV for items V can be represented by the function:
N
I
=F
I·ΘI (Eq. 4)
In Equation 4, ΘI ∈ R|I|×d is a set of item embeddings that are learned during iterative training of RS 200.
Accordingly, in example embodiments, the neighbor embeddings NI is a matrix that is the dot product of the filtered I-I similarity dataset FI and the item embeddings ΘI.
As will be explained in greater detail below, the sets of personalized user embeddings ΘU and item embeddings ΘI (collectively denoted as model embeddings Θ ∈ R(|U|+|I|)×d)) are adaptively learned over a set of training iterations performed during a training phase, such that a respective, unique embedding is learned for each user u and item v. Prior to performing the set of training iterations during the training phase, initialized user embeddings ΘinitU and item embeddings ΘinitI can be generated by random sampling from a range or pre-defined distribution of candidate embedding values.
Thus, the function performed by filter and aggregate module 212 in respect of each of the U-U similarity matrix (SU) 208 and I-I similarity matrix (SI) 210 can be represented by the equation:
In example embodiments, a relevance score generation module 218 is configured to generate a respective relevance score ŷUV for each item-user pair included in the input U-I interaction matrix. In example embodiments, a U-I relevance score matrix ŶUV can be generated as a dot product of the filtered U-U similarity matrix user dataset NU and filtered I-I similarity matrix user dataset NI as using a function:
Ŷ
UV=(NU·NI) (Eq. 6)
In Equation 6, each user-item relevance score ŷuv indicates a relevance score for a respective item v with respect to a respective user u.
As will be explained in greater detail below, the training phase of RS 200 is performed until the system parameters (in particular, model embeddings Θ and threshold vector ) have been adaptively learned to optimize a defined objective. When the training phase is complete and the defined objective optimized, a final set ŶUV of relevance scores are generated by relevance score generation module 218 during an inference phase, and this final set of final set ŶUV of relevance scores can be used by a generate ranking lists module 230 to generate a personalized recommendation list xuv of items that are most relevant for each individual user u. In some examples, the inference phase may be a final iteration of the training phase.
Training of RS 200 will now be described in greater detail according to examples embodiments. In example embodiments, a bilevel optimization objective, adapted from the Bayesian Personalized Ranking (BPR) loss), is used to train RS 200. In particular, values for the system parameters, namely model embeddings Θ and threshold vector K, are learned to optimize a training objective. In example embodiments, the training objective is a bilevel optimization objective, with the model embeddings Θ being learned during a model embeddings update phase to optimize an inner or lower level optimization task and the threshold vector K being learned during a threshold vector update stage to optimize an outer or higher level training task. In this regard, the recommendation task that is performed by RS 200 is treated as a ranking problem in which the input is user implicit feedback and the output is an ordered set of recommended items Xu with respect to each user u.
Referring to
During training, the relevance scores ŷuv generated by relevance score generation module 218 can be separated, based on user and item identity, by the loss computation module 220, into relevance scores ŷui that correspond to user-item pairs in which the item is positive with respect to the user and relevance scores ŷuj that correspond to user-item pairs in which the item is negative with respect to the user. During the training phase, the objective is joint optimization objective to learn system parameters (model embeddings Θ and threshold vector ) that will maximize the difference between the relevance scores ŷui and ŷuj that correspond to the user, positive item and negative item identified in a ground truth (u, i, j) triplet.
In this regard, a joint optimization objective can be represented as:
With the loss L in Equation 7 being denoted as:
L(u, i, j; Θ, )=−ln(σ(ŷui(f(Θ, ))−ŷuj(f(Θ, ))+Ω(Θ) (Eq. 8)
In Equation 8, Ω(·) is a regularization term.
The joint optimization objective of Equation 7 can be difficult to achieve as the threshold values in threshold vector can be very small (or zero), and no clear constraints or guidance is provided for determining threshold vector , which can result in long searching times and difficulty converging. To address this issue, in example embodiments the joint optimization is treated as a bilevel optimization problem where the threshold vector is a set of upper-level (e.g., outer) variables and the model embeddings Θ and is a set of lower-level (e.g., inner) variables. The upper level and lower level objective functions can be respectively represented as:
Where:
As indicated in
Accordingly, during the training stage for RS 200, the system parameters are learned through a two stage interactive training process. In particular, inner optimization/model embedding Θ update stage is performed during which the threshold vector is fixed and model embeddings Θ are updated using gradient descent. An outer optimization/threshold vector update stage is then performed, during which the model embeddings Θ are fixed and threshold vector is updated using gradient descent. The inner and outer update stages can be iteratively repeated until convergence is achieved. As noted above, in the case of bilevel optimization the outer optimization constraints must be enforced indirectly. Accordingly, in example embodiments by using a proxy function to generate a connection between the gradient on threshold vector with the outer objective. The proxy function is defined below:
{tilde over (Θ)}t+1:=Θt−α∇Θ
The proxy model embeddings {tilde over (Θ)}i+1 are the model embeddings Θt from the previous training iteration adjusted by the gradient descent value determined by the current training iteration as scaled by a hyperparameter scaling value α.
A pseudocode representation of bilevel optimization process for training RS 200 to learn optimized system parameters for the filter and aggregate function 212 is represented in
The present disclosure provides a novel bilevel optimization framework to achieve personalized neighborhood selection in recommendation systems such as RS 200. The similarity threshold values include in threshold vector are treated as learnable system parameters which will be learned in an end-to-end way, rather than a hyper parameter as in existing RSs. Further, instead of searching for a global optimal threshold value by using Bayesian search algorithms as is done in existing RSs, the disclosed solution uses bilevel optimization to jointly learn the item and user embeddings and the threshold vector adaptively during the training phase. The threshold values are not fixed and shared for all users and items, but rather a personalized threshold value is learned for each individual user and item for choosing neighbors.
In example embodiments, the filter and aggregate module 212, including filter operation 214 and aggregate operation 216, can be embedded into a variety of different ML models. For example, personalized RSs commonly use deep learning/graphic neural network models that are configured to learn user and item embeddings as the ultimate goal. Accordingly, one or more of the operations of filter and aggregate module 212 and relevance score generation module 218 may be embedded in a GNN model.
During an inference phase, the following operations are performed to process the user-user and item-item similarity data:
As indicated at block 704, the user-user similarity dataset is filtered based on a user similarity threshold vector to generate a filtered user-user similarity dataset, and the item-item similarity dataset is filtered based on an item similarity threshold vector to generate a filtered item-item similarity dataset. The user similarity threshold vector includes a respective user similarity threshold value for each user in the set of users, and the item similarity threshold vector includes a respective item similarity threshold value for each item in the set of items.
As indicated at block 706, a set of user neighbor embeddings is generated based on the filtered user-user similarity dataset and a set of user embeddings, the set of user embeddings including a respective user embedding for each user in the set of users. Similarly, a set of item neighbor embeddings is generated based on the filtered item-item similarity dataset and a set of item embeddings, the set of item embeddings including a respective item embedding for each item in the set of items.
As indicated at block 708, a set of relevance scores is generated based on the user neighbor embeddings and the item neighbor embeddings, the set of relevance scores including, for each user in set of users, respective relevance scores for the items in the set of items.
As indicated at block 710, a list of one or more recommended items is then generated for each user based on the set of relevance scores.
In example embodiments, the user similarity threshold vector, the set of user embeddings, the item similarity threshold vector, and the set of item embeddings collectively comprise system parameters that are learned during a training phase that precedes the inference phase. As described above, during the training phase a bilevel optimization process is performed that includes an inner optimization stage for learning the user embeddings and item embeddings based on a lower-level objective function and an outer optimization stage for learning the user similarity threshold vector and item similarity threshold vector based on an upper level objective function.
In example embodiments, the inner optimization stage for learning the user embeddings and item embeddings includes: (a) filtering the user-user similarity dataset based on an interim user similarity threshold vector to generate an interim filtered user-user similarity dataset; (b) filtering the item-item similarity dataset based on an interim item similarity threshold vector to generate an interim filtered item-item similarity dataset; (c) generating an interim set of user neighbor embeddings based on the interim filtered user-user similarity dataset and an interim set of user embeddings; (d) generating an interim set of item neighbor embeddings based on the interim filtered item-item similarity dataset and an interim set of item embeddings; (e) generating a set of interim relevance scores based on the interim user neighbor embeddings and the interim item neighbor embeddings; (f) determining a loss based on the generate a set of interim relevance scores; (g) updating the interim set of user embeddings and interim set item embeddings to minimize the loss; repeating (c to g) until the interim set of user embeddings and interim set of item embeddings are optimized in respect of the interim user similarity threshold vector and interim item threshold vector.
In example embodiments, the outer optimization stage for learning the user similarity threshold vector and the item similarity threshold vector includes: (h) filtering the user-user similarity dataset based on an interim user similarity threshold vector to generate an interim filtered user-user similarity dataset; (i) filtering the item-item similarity dataset based on an interim item similarity threshold vector to generate an interim filtered item-item similarity dataset; (j) generating an interim set of user neighbor embeddings based on the interim filtered user-user similarity dataset and a proxy set of user embeddings; (k) generating an interim set of item neighbor embeddings based on the interim filtered item-item similarity dataset and a proxy set of item embeddings; (l) generating a set of interim relevance scores based on the interim user neighbor embeddings and the interim item neighbor embeddings; (m) determining the loss based on the generate a set of interim relevance scores; (n) updating the interim user similarity threshold vector and interim item similarity threshold vector to minimize the loss; and repeating (i to n) until the interim user similarity threshold vector and interim item similarity threshold vector are optimized in respect of the proxy set of user embeddings and the proxy set of item embeddings. The inner optimization stage and the outer optimization stage are successively repeated during a plurality of training iterations.
In some examples, performing the training phase includes determining a plurality of triplets based on the input dataset, wherein each triplet identifies: (i) a respective user from the set of users; (ii) a positive item from the set of items that is deemed to be positive with respect to the respective user based on the user-item interaction data; and (iii) a negative item from the set of items that is deemed to be negative with respect to the respective user based on the user-item interaction data. Learning of the system parameters is performed to optimize an objective that maximizes, for the plurality of triplets, a difference between relevance scores computed for positive items with respect to users and relevance scores computed for negative items with respect to users.
Processing System
In example embodiment, the operations performed by RS 200 are computer implemented using one or more physical or virtual computing devices. In an example operation, the operations performed by the RS 200 may be software that forms part of a “software-as-a-service” of a cloud computing service provider.
The processing system 170 may include a processing device 172 that comprises one or more processing elements, such as a processor, a microprocessor, a general processor unit (GPU), an artificial intelligence processor, tensor processing unit, neural processing unit, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, accelerator logic, or combinations thereof. The processing unit 170 may also include one or more input/output (I/O) interfaces 174, which may enable interfacing with one or more appropriate input devices 184 and/or output devices 186. The processing unit 170 may include one or more network interfaces 176 for wired or wireless communication with a network.
The processing system 170 may also include one or more storage devices 178, which may include a mass storage device such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. The processing system 170 may include one or more memories 180, which may include a volatile or non-volatile memory (e.g., a flash memory, a random access memory (RAM), and/or a read-only memory (ROM)). The memory(ies) 180 may store instructions for execution by the processing device(s) 172, such instructions that configure the processing unit 170 to implement the operations of RS 200 and carry out examples described in the present disclosure. The memory(ies) 180 may include other software instructions, such as for implementing an operating system and other applications/functions.
There may be a bus 182 providing communication among components of the processing system 170, including the processing device(s) 172, I/O interface(s) 174, network interface(s) 176, storage device(s) 178 and/or memory(ies) 180. The bus 182 may be any suitable bus architecture including, for example, a memory bus, a peripheral bus or a video bus.
Although the present disclosure describes methods and processes with steps in a certain order, one or more steps of the methods and processes may be omitted or altered as appropriate. One or more steps may take place in an order other than that in which they are described, as appropriate. In the present disclosure, use of the term “a,” “an”, or “the” is intended to include the plural forms as well, unless the context clearly indicates otherwise. Also, the term “includes,” “including,” “comprises,” “comprising,” “have,” or “having” when used in this disclosure specifies the presence of the stated elements, but do not preclude the presence or addition of other elements.
Although the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes instructions tangibly stored thereon that enable a processing device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein.
The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive. Selected features from one or more of the above-described embodiments may be combined to create alternative embodiments not explicitly described, features suitable for such combinations being understood within the scope of this disclosure.
All values and sub-ranges within disclosed ranges are also disclosed. Also, although the systems, devices and processes disclosed and shown herein may comprise a specific number of elements/components, the systems, devices and assemblies could be modified to include additional or fewer of such elements/components. For example, although any of the elements/components disclosed may be referenced as being singular, the embodiments disclosed herein could be modified to include a plurality of such elements/components. The subject matter described herein intends to cover and embrace all suitable changes in technology.
The content of all published papers identified in this disclosure are incorporated herein by reference.