This disclosure relates generally to generating recommendations using adversarial counterfactual learning and evaluation.
Item recommendations can assist a user when selecting items online. For example, when a user views an anchor item, one or more recommended items can be displayed, which can be items that are similar and/or complementary to the anchor item. Many recommendations models are used conventionally. These recommendation models typically do not account for the underlying exposure mechanism, which can result in suboptimal recommendations.
To facilitate further description of the embodiments, the following drawings are provided in which:
For simplicity and clarity of illustration, the drawing figures illustrate the general manner of construction, and descriptions and details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the present disclosure. Additionally, elements in the drawing figures are not necessarily drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve understanding of embodiments of the present disclosure. The same reference numerals in different figures denote the same elements.
The terms “first,” “second,” “third,” “fourth,” and the like in the description and in the claims, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms “include,” and “have,” and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, device, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, system, article, device, or apparatus.
The terms “left,” “right,” “front,” “back,” “top,” “bottom,” “over,” “under,” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the apparatus, methods, and/or articles of manufacture described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.
The terms “couple,” “coupled,” “couples,” “coupling,” and the like should be broadly understood and refer to connecting two or more elements mechanically and/or otherwise. Two or more electrical elements may be electrically coupled together, but not be mechanically or otherwise coupled together. Coupling may be for any length of time, e.g., permanent or semi-permanent or only for an instant. “Electrical coupling” and the like should be broadly understood and include electrical coupling of all types. The absence of the word “removably,” “removable,” and the like near the word “coupled,” and the like does not mean that the coupling, etc. in question is or is not removable.
As defined herein, two or more elements are “integral” if they are comprised of the same piece of material. As defined herein, two or more elements are “non-integral” if each is comprised of a different piece of material.
As defined herein, “approximately” can, in some embodiments, mean within plus or minus ten percent of the stated value. In other embodiments, “approximately” can mean within plus or minus five percent of the stated value. In further embodiments, “approximately” can mean within plus or minus three percent of the stated value. In yet other embodiments, “approximately” can mean within plus or minus one percent of the stated value.
As defined herein, “real-time” can, in some embodiments, be defined with respect to operations carried out as soon as practically possible upon occurrence of a triggering event. A triggering event can include receipt of data necessary to execute a task or to otherwise process information. Because of delays inherent in transmission and/or in computing speeds, the term “real-time” encompasses operations that occur in “near” real-time or somewhat delayed from a triggering event. In a number of embodiments, “real-time” can mean real-time less a time delay for processing (e.g., determining) and/or transmitting data. The particular time delay can vary depending on the type and/or amount of the data, the processing speeds of the hardware, the transmission capability of the communication hardware, the transmission distance, etc. However, in many embodiments, the time delay can be less than approximately 0.1 second, 0.5 second, one second, two seconds, five seconds, or ten seconds.
Turning to the drawings,
Continuing with
As used herein, “processor” and/or “processing module” means any type of computational circuit, such as but not limited to a microprocessor, a microcontroller, a controller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a graphics processor, a digital signal processor, or any other type of processor or processing circuit capable of performing the desired functions. In some examples, the one or more processors of the various embodiments disclosed herein can comprise CPU 210.
In the depicted embodiment of
In some embodiments, network adapter 220 can comprise and/or be implemented as a WNIC (wireless network interface controller) card (not shown) plugged or coupled to an expansion port (not shown) in computer system 100 (
Although many other components of computer system 100 (
When computer system 100 in
Although computer system 100 is illustrated as a desktop computer in
Turning ahead in the drawings,
Generally, therefore, system 300 can be implemented with hardware and/or software, as described herein. In some embodiments, part or all of the hardware and/or software can be conventional, while in these or other embodiments, part or all of the hardware and/or software can be customized (e.g., optimized) for implementing part or all of the functionality of system 300 described herein.
Recommendation system 310 and/or web server 320 can each be a computer system, such as computer system 100 (
In some embodiments, web server 320 can be in data communication through a network 330 with one or more user devices, such as a user device 340. User device 340 can be part of system 300 or external to system 300. Network 330 can be the Internet or another suitable network. In some embodiments, user device 340 can be used by users, such as a user 350. In many embodiments, web server 320 can host one or more websites and/or mobile application servers. For example, web server 320 can host a website, or provide a server that interfaces with an application (e.g., a mobile application), on user device 340, which can allow users (e.g., 350) to browse and/or search for items (e.g., products, grocery items), to add items to an electronic cart, and/or to purchase items, in addition to other suitable activities. In a number of embodiments, web server 320 can interface with recommendation system 310 when a user (e.g., 350) is viewing items in order to recommend items to the user.
In some embodiments, an internal network that is not open to the public can be used for communications between recommendation system 310 and web server 320 within system 300. Accordingly, in some embodiments, recommendation system 310 (and/or the software used by such systems) can refer to a back end of system 300 operated by an operator and/or administrator of system 300, and web server 320 (and/or the software used by such systems) can refer to a front end of system 300, as is can be accessed and/or used by one or more users, such as user 350, using user device 340. In these or other embodiments, the operator and/or administrator of system 300 can manage system 300, the processor(s) of system 300, and/or the memory storage unit(s) of system 300 using the input device(s) and/or display device(s) of system 300.
In certain embodiments, the user devices (e.g., user device 340) can be desktop computers, laptop computers, mobile devices, and/or other endpoint devices used by one or more users (e.g., user 350). A mobile device can refer to a portable electronic device (e.g., an electronic device easily conveyable by hand by a person of average size) with the capability to present audio and/or visual data (e.g., text, images, videos, music, etc.). For example, a mobile device can include at least one of a digital media player, a cellular telephone (e.g., a smartphone), a personal digital assistant, a handheld digital computer device (e.g., a tablet personal computer device), a laptop computer device (e.g., a notebook computer device, a netbook computer device), a wearable user computer device, or another portable computer device with the capability to present audio and/or visual data (e.g., images, videos, music, etc.). Thus, in many examples, a mobile device can include a volume and/or weight sufficiently small as to permit the mobile device to be easily conveyable by hand For examples, in some embodiments, a mobile device can occupy a volume of less than or equal to approximately 1790 cubic centimeters, 2434 cubic centimeters, 2876 cubic centimeters, 4056 cubic centimeters, and/or 5752 cubic centimeters. Further, in these embodiments, a mobile device can weigh less than or equal to 15.6 Newtons, 17.8 Newtons, 22.3 Newtons, 31.2 Newtons, and/or 44.5 Newtons.
Exemplary mobile devices can include (i) an iPod®, iPhone®, iTouch®, iPad®, MacBook® or similar product by Apple Inc. of Cupertino, Calif., United States of America, (ii) a Blackberry® or similar product by Research in Motion (RIM) of Waterloo, Ontario, Canada, (iii) a Lumia® or similar product by the Nokia Corporation of Keilaniemi, Espoo, Finland, and/or (iv) a Galaxy™ or similar product by the Samsung Group of Samsung Town, Seoul, South Korea. Further, in the same or different embodiments, a mobile device can include an electronic device configured to implement one or more of (i) the iPhone® operating system by Apple Inc. of Cupertino, Calif., United States of America, (ii) the Blackberry® operating system by Research In Motion (RIM) of Waterloo, Ontario, Canada, (iii) the AndroidTM operating system developed by the Open Handset Alliance, or (iv) the Windows Mobile™ operating system by Microsoft Corp. of Redmond, Wash., United States of America.
In many embodiments, recommendation system 310 and/or web server 320 can each include one or more input devices (e.g., one or more keyboards, one or more keypads, one or more pointing devices such as a computer mouse or computer mice, one or more touchscreen displays, a microphone, etc.), and/or can each comprise one or more display devices (e.g., one or more monitors, one or more touch screen displays, projectors, etc.). In these or other embodiments, one or more of the input device(s) can be similar or identical to keyboard 104 (
Meanwhile, in many embodiments, recommendation system 310 and/or web server 320 also can be configured to communicate with one or more databases, such as a database system 315. The one or more databases can include a product database that contains information about products, items, or SKUs (stock keeping units), for example, among other information, as described below in further detail. The one or more databases can be stored on one or more memory storage units (e.g., non-transitory computer readable media), which can be similar or identical to the one or more memory storage units (e.g., non-transitory computer readable media) described above with respect to computer system 100 (
The one or more databases can each include a structured (e.g., indexed) collection of data and can be managed by any suitable database management systems configured to define, create, query, organize, update, and manage database(s). Exemplary database management systems can include MySQL (Structured Query Language) Database, PostgreSQL Database, Microsoft SQL Server Database, Oracle Database, SAP (Systems, Applications, & Products) Database, and IBM DB2 Database.
Meanwhile, recommendation system 310, web server 320, and/or the one or more databases can be implemented using any suitable manner of wired and/or wireless communication. Accordingly, system 300 can include any software and/or hardware components configured to implement the wired and/or wireless communication. Further, the wired and/or wireless communication can be implemented using any one or any combination of wired and/or wireless communication network topologies (e.g., ring, line, tree, bus, mesh, star, daisy chain, hybrid, etc.) and/or protocols (e.g., personal area network (PAN) protocol(s), local area network (LAN) protocol(s), wide area network (WAN) protocol(s), cellular network protocol(s), powerline network protocol(s), etc.). Exemplary PAN protocol(s) can include Bluetooth, Zigbee, Wireless Universal Serial Bus (USB), Z-Wave, etc.; exemplary LAN and/or WAN protocol(s) can include Institute of Electrical and Electronic Engineers (IEEE) 802.3 (also known as Ethernet), IEEE 802.11 (also known as WiFi), etc.; and exemplary wireless cellular network protocol(s) can include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Evolution-Data Optimized (EV-DO), Enhanced Data Rates for GSM Evolution (EDGE), Universal Mobile Telecommunications System (UMTS), Digital Enhanced Cordless Telecommunications (DECT), Digital AMPS (IS-136/Time Division Multiple Access (TDMA)), Integrated Digital Enhanced Network (iDEN), Evolved High-Speed Packet Access (HSPA+), Long-Term Evolution (LTE), WiMAX, etc. The specific communication software and/or hardware implemented can depend on the network topologies and/or protocols implemented, and vice versa. In many embodiments, exemplary communication hardware can include wired communication hardware including, for example, one or more data buses, such as, for example, universal serial bus(es), one or more networking cables, such as, for example, coaxial cable(s), optical fiber cable(s), and/or twisted pair cable(s), any other suitable data cable, etc. Further exemplary communication hardware can include wireless communication hardware including, for example, one or more radio transceivers, one or more infrared transceivers, etc. Additional exemplary communication hardware can include one or more networking components (e.g., modulator-demodulator components, gateway components, etc.).
In many embodiments, recommendation system 310 can include a communication system 311, a training system 312, an evaluation system 313, a rea-time serving system 314, and/or database system 315. In many embodiments, the systems of recommendation system 310 can be modules of computing instructions (e.g., software modules) stored at non-transitory computer readable media that operate on one or more processors. In other embodiments, the systems of recommendation system 310 can be implemented in hardware. Recommendation system 310 and/or web server 320 each can be a computer system, such as computer system 100 (
In many embodiments, system 300 can provide item recommendations to a user (e.g., as customer) based on an anchor item that the user has selected to view, is about to view, and/or is viewing. For example, when a user selects an item (e.g., a product) to view from a list of products, an item page for the item can display information about the item. This item can be considered the anchor item. The item page also can display information about other items related to the anchor item. These other items can be the item recommendations, and they can be items that are similar and/or complementary to the anchor item. If a user is interested in an anchor item, but it is not exactly what the user wants, a similar item might be what the user wants (e.g., butter and margarine, or butter and shortening). By contrast, complementary items are items that are different (often in different categories), but are often purchased together (e.g., hot dogs and hot dog buns, or hot dogs and ketchup).
Conventionally, item recommendations are shown to users, but a user is exposed to a very small subset of the total number of items available. Moreover, a user's interest can be affected by the items that the user has been shown. For example, if a user has been shown three items and is interested in a first item of the three items, the user's interest in the first item can be based on having seen those three items. For a fourth item that has not been shown to the user, it can be difficult to know if the user will be interest in the fourth item. Moreover, the extent of a user's exposure to items is often unknown, as a user can be exposed to items outside of web server 320, such as through television advertisements, seeing items at a physical store, seeing items on different websites, talking to other people about items. As such, it can be difficult to know if a user is interest in an item or if the user was exposed to the item elsewhere and is looking for more information about the item. A click model is sometimes often used to address this issue, but recommendations from such models often assume the user's interest, which can be an incorrect assumption.
The feedback data of recommender systems are often subject to what was exposed to the users; however, most learning and evaluation methods do not account for the underlying exposure mechanism. Applying supervised learning to detect user preferences can end up with inconsistent results in the absence of exposure information. The counterfactual propensity-weighting approach from causal inference can account for the exposure mechanism; nevertheless, the partial-observation nature of the feedback data can cause identifiability issues. In a number of embodiments, system 300 can use a minimax empirical risk formulation. The relaxation of the dual problem can be converted to an adversarial game between two recommendation models, in which the opponent of the candidate model characterizes the underlying exposure mechanism. Learning bounds can be provided, and simulation studies illustrate and justify the techniques described herein over a broad range of recommendation settings, which can shed insights on the various benefits of the techniques described herein.
In the offline learning and evaluation of recommender systems, the dependency of feedback data on the underlying exposure mechanism is often overlooked. When the users express their preferences on the products explicitly (such as providing ratings) or implicitly (such as clicking), the feedback are conditioned on the products to which they are exposed. In most cases, the previous exposures are decided by some underlying mechanism such as the history recommender system. The dependency causes two dilemmas for machine learning in recommender systems, and solutions have yet to be found satisfactorily. Firstly, the majority of supervised learning models handle merely the dependency between label (user feedback) and features, yet in the actual feedback data, the exposure mechanism can alter the dependency pathways, as shown in
From a theoretical perspective, directly applying supervised learning on feedback data can result in inconsistent detection of the user preferences. Secondly, an unbiased model evaluation can have the product exposure determined by the candidate recommendation model, which is almost never satisfied when merely using the feedback data. The second dilemma also reveals a gap between evaluating models by online experiments and using history data, because the offline evaluations can be more likely to bias toward the history exposure mechanism as it decided to what products the users might express their preferences. The disagreement between the online and offline evaluations may partly explain the controversial observations made in several recent papers, in which deep recommendation models are overwhelmed by classical collaborative filtering approaches in offline evaluations, despite their many successful deployments in the real-world applications.
In a number of embodiments, to address the above dilemmas for recommender systems, the idea of counterfactual modeling can be used to redesign the learning and evaluation methods. Counterfactual modeling can answer questions related to “what if”, e.g., what is the feedback data if the candidate model were deployed. The counterfactual methods can take account of the dependency between the feedback data and exposure. Conventional attempts have relied on excessive data or model assumptions, such as the missing-data model described below, which may not be satisfied in practice. Many of the assumptions can be essentially unavoidable due to a fundamental discrepancy between the recommender system and observational studies. In observational studies, the exposure (treatment) status can be fully observed, and the exposure mechanism can be completely decided by the covariates (features). For recommender systems, the exposure can be partially captured by the feedback data. The complete exposure status can be retrieved from the system's backend log, to which access can be highly restricted, and such access rarely exists for the public datasets. Also, the exposure mechanism can depend on intractable randomness, e.g., burst events, special offers, interference with other modules such as the advertisement, as well as the relevant features that are not attainable from feedback data.
Turning ahead in the drawings,
In a number of embodiments, the techniques described herein can acknowledge the uncertainty brought by the identifiability issue and treat it as an adversarial component. A minimax setting can be used in which the candidate model can be optimized over the worst-case exposure mechanism. By applying duality arguments and relaxations, the minimax problem can be converted to an adversarial game between two recommendation models. This approach is novel and principled, which can advantageously provide a theoretical analysis to show an inconsistent issue of supervised learning on recommender systems, which is caused by the unknown exposure mechanism. A minimax setting for counterfactual recommendation can beneficially be used and converted to a tractable two-model adversarial game. The generalization bounds for the adversarial learning described herein are shown, with analysis for the minimax optimization. Simulation and real data experiments demonstrate performance benefits of the techniques described herein.
Bold-faced letters are used to denote vectors and matrices, upper-case letters to denote random variables and the corresponding lower-case letters to denote observations. Distributions are denoted by P and Q. Let xu be the user feature vector for user u ∈ {1, . . . , n}, zi be the item feature vector for item i ∈ {1, . . . , m}, Ou,i∈ {0,1} be the exposure status, Yu,i be the feedback, and D be the collected user-item pairs where non-positive interactions may come from negative sampling. The feature vectors can be one-hot encoding or embedding, so this approach can be fully compatible with deep learning models that leverage representation learning and are trained under negative sampling. Recommendation models are denoted by such as fθ and gψ. They take xu, zi (and the exposure Ou,i if available) as input. The shorthand fθ (u, i) is used to denote the output score, and the loss with respect to the is given by δ(u,i, f74 (u, i)). These notations also apply to the sequential recommendation by encoding the previously-interacted items to the user feature vector x.
Pg (Ou,i|xu, zi) is used to denote the exposure mechanism that depends on the underlying model g. Also, p(Yu,i|Ou,i, xu, zi) gives the user response, which is independent from the exposure mechanism whenever Ou,i is observed. The stochasticity in the exposure can also be induced by the exogenous factors (unobserved confounders) who bring extra random perturbations. Explicit and implicit feedback settings are not explicitly differentiated unless specified.
Let Yu,i ∈ {−1, 1} be the implicit feedback. Set aside the exposure for a moment, the goal of supervised learning is to determine the optimal recommendation function that minimizes the surrogate loss:
where ∅ induces the widely-adopted margin-based loss. Now take account of the (unobserved) exposure status by first letting:
p
(1)(o)=p(Yu,i=1, Ou,i=o, xu, zi),
p
(−1)(o)=p(Yu,i=−1, Ou,i=o, xu, zi), o ∈{0,1},
to denote the joint distribution for positive and negative feedback under either exposure status. The surrogate loss, which now depends on p(1) and p(−1) due to including the exposure, is denoted by L∅ (fθ, {p(1), p(−1)}). In assertion 1, it is shown that if the exposure mechanism is fixed and fθ is optimized, the optimal loss and the corresponding f*θ depend merely on p(1) and p(−1).
Assertion 1. When the exposure mechanism p(Ou,i|Xu, Zi) is given and fixed, the optimal loss is:
where P(1) and P(−1) are the corresponding distributions for p(1) and p(−1), and
is the f-divergence induced by the convex, lower-semicontinuous function c. Also, the optimal f*θ that achieves the infimum is given by
for some function α*θ that depends on ∅.
The proof of Assertion 1 is provided as follows:
Proof. When taking the exposure mechanism into account, minimizing fθ over the loss is implicitly doing inff
For any fixed exposure mechanism p(O|x, z), there is
For each o ∈{0, 1}, let μ(o)=p(−1)(o)/p(1)(o) and Δ(μ)=−infα)∅(α)+∅(−αμ)).
Notice that Δ(μ) is a convex function of μ since the supremum (negative of the infimum) over a set of affine functions is convex. Since Δ is convex and continuous:
which is exactly the f-divergence DΔ(P(1)∥P(−1) and induced by Δ.
Also, up on achieving the infimum in (A.1), the optimal fθ is given by solving α∅*(μ)=arg minα(∅(α)+∅(−α)μ).
Notice that the joint distribution can be factorized into: ρ(Yu,i|ou,i, xu, xi) ∝(Yu,i|ou,i, xu, zi)·Pg(ou,i|xu, zi), so Assertion 1 implies that:
f
θ*(xu, zi; ou,i)=α∅*(p(Yu,i=1|ou,i, xu, zi)/p(Yu,i=−1|ou,i, xu, zi)).
In conclusion: (1) when the exposure mechanism is given, the optimal loss −Dc(P(1)∥P(−1)) is a function of both the user preference and the exposure mechanism; (2) the optimal model f*θdepends merely on the user preference, because f*θ is a function of p(Y|o,x,z) which does not depend on the exposure mechanism (mentioned at the beginning of this section). Both conclusions are practically reasonable, as the optimal recommendation model should detect user preference regardless of the exposure mechanisms. The optimal loss, on the other hand, depends on the joint distribution in which the underlying exposure mechanism plays a part.
However, when p(Ou,i|Xu, Zi) is unknown, the conclusions from Assertion 1 no longer hold, and the optimal f*θ will depend on the exposure mechanism. As a consequence, if the same feedback data were collected under different exposure mechanisms, the recommendation model may find the user preference differently. The inconsistency is caused by not accounting for the unknown exposure mechanism from the supervised learning.
In causal inference, the probability of exposure given the observed features (covariates) is referred to as the propensity score. The propensity-weighting approach uses weights based on the propensity score to create a synthetic sample in which the distribution of observed features is independent of exposure. This approach can be beneficial to make the feedback data independent of the exposure mechanism. The propensity-weighted loss is constructed via:
and by taking the expectation with respect to exposure (whose distribution is denoted by Q), the ordinary loss recovered is:
where the second expectation is taken with respect to the empirical distribution Pn. Let Qo be the distribution for the underlying exposure mechanism. The propensity-weighted empirical distribution is then given by Pn/Qo (after scaling), which can be thought of as the synthetic sample distribution after eliminating the influence from the underlying exposure mechanism. It is straightforward to verify that after scaling, the expected propensity-weighted loss is exactly given by: pn/Q
Conventional approaches known as the “click model” deal with the unidentifiable exposure mechanism by assuming a missing-data model:
p(click=1|x)=p(expose=1|x)·p(relevance=1|x). (3)
While the click model greatly simplifies the problem because the exposure mechanism can now be characterized explicitly, it relies on a hidden assumption that is rarely satisfied in practice. Use R to denote the relevance and Y to denote the click. The fact that Y=1 ⇔O=1 and R=1 implies:
which suggests that being relevant is independent of getting exposed given the features. This is rarely true (or at least cannot be examined) in many real-world problems, unless x contains every single factor that may affect the exposure and user preference. By contrast, in many embodiments, the techniques described herein can provide a robust solution when the hidden assumption of the missing-data (click) model is dubious or violated.
Let P* be the ideal exposure-eliminated sample distribution corresponding to P/Qo, according to the underlying exposure mechanism Qo and data distribution P. For notation simplicity, without overloading the original meaning by too much, from this point P, Pn, Qo and P* can be treated as distributions on the sample space X which includes all the observed data (xu, zi, u,i) with (u, i) ∈D. Since there are not data or model assumptions made to allow for accurately recovering P*, a minimax formulation is introduced to characterize the uncertainty and optimize fθ against the worst possible choice of (a hypothetical) {circumflex over (P)}, whose discrepancy with the ideal P* can be determined by the data to a neighborhood: Dist(P*, {circumflex over (P)})<ρ. Among the divergence and distribution distance measures, the Wasserstein distance can be chosen for this problem, which is defined as:
where c: X×X→[0, +∞) is the convex, lower semicontinuous transportation cost function with C (t, t)=0, and II({circumflex over (P)}, P*) is the set of all distributions whose marginals are given by {circumflex over (P)} and P*. Intuitively, the Wasserstein distance can be interpreted as the minimum cost associated with transporting mass between probability measures. The Wasserstein distance can be chosen instead of others in order to understand how to transport from the empirical data distribution to an ideal synthetic data distribution where the observations were independent of the exposure mechanism. Hence, the local minimax empirical risk minimization (ERM) problem can be considered:
which can directly account for the uncertainty induced by the lack of identifiability in the exposure mechanism, and can optimize fθ under the worst possible setting. However, the formulation in (5) is first of all a constraint optimization problem. Secondly, the constraint is expressed in terms of the hypothetical P*. After applying a duality argument, the dual problem can be expressed via the exposure mechanism in the following Assertion 2. {circumflex over (Q)} is used to denote some estimation of Qo.
Assertion 2. Suppose that the transportation cost c is continuous and the propensity score are all bounded away from zero, i.e., ρ(Oi,u=1|xu, zi)≥μ. Let P={P:Wc(P*, P)<ρ}, then
where co is a positive constant and {circumflex over (q)} is the density function associated with {circumflex over (Q)}.
The proof of Assertion 2 is provided as follows, which first proves the dual formulation for the minimax ERM stated in Assertion 2, and then discusses the relaxation for the dual problem:
Proof For the estimation {circumflex over (P)}=P/{circumflex over (Q)} of the ideal exposure-eliminated sample, Wc({circumflex over (P)}, P*)≤ρ is equivalent to Wc(P/{circumflex over (Q)}, P/Qo)≤ρ.
602905331.224
Observe that when P is given by the empirical distribution that assigns uniform weights to all samples, the Wasserstein's distance Wc(P/{circumflex over (Q)}, P/Qo)≤ρ is convex in {circumflex over (Q)}(−1) (since c is convex) and {circumflex over (Q)}=Qo gives Wc(P/{circumflex over (Q)}, P/Qo)=0.
Since the propensity scores are assumed to be all bounded away from zero, so P/{circumflex over (Q)} and P/Qo exist and have normal behavior. So the duality results can be established, since the Slater's condition holds. Let h=(x, z, y) ∈X and X′ be a copy of X. Thus:
where in the last line the shorthand notation δf
and it is then shown that the opposite direction also holds so it is always equality. Let be the space of measurable conditional distributions (Markov kernels) from X to X′, then
In the next step, consider the space of all measurable mappings h′h(h′) from X′ to X, denoted by . Since all the mappings are measurable, the underlying spaces are regular, and δf
where the h(·) on the LHS represents the mapping, and the h on the RHS still denotes elements from the sample space X. Now let the support of the conditional distribution K(h, h′) be given by h(h′). So according to (A.5):
Combining (A.6), (A.4) and (A.3), see that
Finally, notice that
so according to (A.2), reach the final result:
To reach the relaxation given in (5), use the alternate expression for the Wasserstein distance obtained from the Kantorovich-Rubinstein duality. Denote the Lipschitz continuity for a function f by ∥f∥L≤l. When the cost function c is l-Lipschitz continuous, Wc(P1, P2) is also referred to as the Wasserstein-l distance. Without loss of generality, consider ∥c∥L≤1 such as the 2 norm, and with that the Wasserstein distance is equivalent to:
where f:X→. In practice, when P is the empirical distribution that assigns uniform weights to all the samples:
where the-above ai are all constants induced by using the change-of-measure with important-weighting estimators, and the induced cost function {tilde over (c)} on the last line satisfies |{tilde over (c)}|L≤max{a5, a6}. Therefore, see that the Wasserstein distance between Pn/{tilde over (Q)} and Pn/Q0 can be bounded by W{tilde over (c)}({circumflex over (Q)}, Q0). Hence, for each ∝≥0 in (A.8),
is a relaxation of the result in Assertion 2. In practice, the specific forms of the cost functions c or {tilde over (c)} do not matter, because the Wasserstein distance is intractable and the data-dependent surrogates discussed below in connection with practical implementations can be used.
Considering the relaxation for each fixed ∝ (see the appendix), the minimax objective has a desirable formulation where ∝ becomes a tuning parameter:
To make sense of (6), see that while {circumflex over (Q)} is acting adversarially against fθ as the inverse weights in the first term, it cannot arbitrarily increase the objective function, since the second terms acts as a regularizer that keeps {circumflex over (Q)} close to the true exposure mechanism Q0. Compared with the primal problem in (5), the relaxed dual formulation in (6) gives the desired unconstrained optimization problem. Also, note that the exposure mechanism is often given by the recommender system that was operating during the data collection, which can be leveraged as a domain knowledge to further convert (6) to a more tractable formulation. Let g* be the recommendation model that underlies Q0. Assume for now that Pg (0=1|X,Z) is given by G(g(X,Z)) ∈ (μ,1), μ>0 for some transformation function G. The inclusion and manipulation of the unobserved factors is discussed below in connection with practical implementations. The objective in (6) can then be converted to a two-model adversarial game:
Before discussing the implications of (7), its practical implementations, and the minimax optimization, the theoretical guarantees for the generalization error are shown and discussed, in comparison to the standard ERM setting, after introducing the adversarial component.
Before stating results, the loss function corresponding to the adversarial objective can be characterized, as well as the complexity of the hypothesis space. For the first purpose, the cost-regulated loss is introduced, which is defined as:
For the second purpose, consider the entropy integral J()=∫0∞√{square root over (log(ϵ, ∥.∥∞)d∈)}, where ={δ(fθ, .)|fθ ∈}is the hypothesis class and (ϵ; , ∥.∥∞) gives the covering number for the ϵ−cover of in terms of the ∥.∥∞ norm. Suppose that |δ(y, fθ(x, z))|≥M holds uniformly. The theoretical result on the worst-case generalization bound under the minimax setting.
Theorem 1. Suppose the mapping G from gψ to q(o=1|x, z) is to one-to-one and surjective with gψ ∈. Let (ρ)={g ∈|Wc(G(g*))≤ρ}. Then under the conditions specified in Assertion 2, for all γ≥0 and ρ>0, the following inequality holds with probability at least 1−ϵ:
where c1 is a positive constants and c2 is a simple linear function with positive weights.
The proof of Theorem 1 is provided as follows:
Proof Following the same arguments from the proof in Assertion 2, a result similar to that stated in (A.8) is
then notice that
Since |δf
Then let ϵ1, . . . , ϵN be the i.i.d Rademacher random variables independent of H, and H′i be the i.i.d copy of Hi for i=1, . . . , N.
Applying the symmetrization argument, see that
It is clear that each ϵiΔγ(fθ; Hi) is zero-mean, and it is now shown that it is sub-Gaussian as well.
For any two fθ, f′θ, the bounded difference is shown:
Hence, see that
is sub-Gaussian with respect to
Therefore, Wγ can be bounded by using the standard technique for Rademacher complexity and Dudley's entropy integral:
Combining all the above bounds in (A.11), (A.12) and (A.15) obtains the desired result.
The generalization bound in Theorem 1 holds for all ρ and δ, and is it shown that when they are decided by some data-dependent quantities, the result can be converted to some simplified forms that reveal the more direct connections with the propensity-weighted loss and standard ERM results:
Corollary 1. Following the statements in Theorem 1, there exists some data-dependent γn and pn(fθ), such that when γ≥γn, for all p>0:
and when ρ=ρn(fθ), for all γ≥0:
as suggested by Theorem 1.
The proof of Corollary 1 is provided as follows:
Proof To obtain the first result, let the data-dependent γn be given by
Then according to the definition of Δγ:
It is straightforward to verify that
as well as
Which also equals to
Therefore, when γ=γn, then
Similarly, it can be shown that when γ=γn, the above equality also holds. Hence, replace P
in Theorem 1 and obtain the first result.
To obtain the second result, define the transportation map:
Then according to (A.8), the empirical maximizer for sup{circumflex over (P)}:W
where Ih assign point mass at h, since it maximizes
Then let ρn(fθ)=Wc({circumflex over (P)}(fθ), Pn), which equals to P
for some {tilde over (ρ)} that absorbs the excessive constant terms, which can be plugged it into Theorem 1 to obtain the second result for Corollary 1.
Corollary 1 shows that the approach described herein has the same 1/√{square root over (n)} rate as the standard ERM. Also, the first result reveals an extra δρ bias term induced by the adversarial setting, the second result characterizes how the additional uncertainty is reflected on the propensity-weighted empirical loss.
Directly optimizing the minimax objective in (7) can be infeasible because g* is unknown and the Wasserstein distance is hard to compute when is a complicated model such as neural network. Nevertheless, understanding the comparative roles of fθ and gφ can help in constructing practical solutions.
Recall a goal of optimizing fθ. The auxiliary gφ is introduced to characterize the adversarial exposure mechanism, so there is less interest in recovering the true g*. With that being said, the term Wc(G(φ), G(g*)) serves to establish certain regularizations on such that it is constrained by the underlying exposure mechanism. Relaxing or tightening the regularization term should not significantly impact the solution because the regularization parameter a can be adjusted. Hence, tractable regularizers can be designed to approximate or even replace Wc(G(gφ), G(g*)), as long as the constraint on gφ is established under the same principle. Similar ideas have also been applied to train the generative adversarial network (GAN): the optimal classifier depends on the unknown data distribution, so in practice, people use alternative tractable classifiers that fit into the problem. Several alternative regularizers for gφ are listed below.
The third example is focused on because it applies to almost all cases without requiring excessive assumptions. Therefore, the practical adversarial objective is now given by:
In the next step, it is considered how to handle the unobserved factors that also plays a part in the exposure mechanism. As mentioned above, having unobserved factors is inevitable practically. In particular, the Tukey's factorization used in the missing data approaches can be leveraged. In the presence of unobserved factors, Tukey's factorization suggests additionally characterizing the relationship between exposure mechanism and outcome.
For clarity, a simple logistic-regression to model G can be employed as:
G
β(gφ(x, z), y)=σ(β0+β1gφ(x, z)+β2y),
where σ(·) is the sigmoid function. The final form of the adversarial game can be expressed as follows:
β can be placed to the minimization problem for the following reason. By design, Gβ merely characterizes the potential impact of unobserved factors which is not considered to act adversarially. Otherwise, the adversarial model can be too strong for fθ to learn anything useful.
Tukey's factorization can have implications on unobserved factors for exposure. Tukey's factorization can be used by the Gβ model to handle the unobserved factors in recommender system.
The following notation is used for counterfactual outcome: Yu,i(o), o ∈ {0,1}, which represents what the user feedback would be if the exposure Ou,i were given by o ∈ {0,1}. In the factual world, observation can be limited to Yu,i for either Ou,i=1 or Ou,i=0, and the tuple (Yu,i(1), Yu,i(0)]) is not jointly observed at the same time.
In the absence of unobserved factors, the joint distribution of (Yu,i(1), Yu,i(0)) has a straightforward formulation and can be estimated effectively from data using tools from causal inference. However, when unobserved factor exists, there are confounding between (Yu,i(1), Yu,i(0)), which violates a fundamental assumption of many causal inference solutions.
The Tukey's factorization, on the other hand, characterizes the missing data distribution regardless of the unobserved factors as:
where
concludes the unknown mechanism in the missing data distribution.
To see how the counterfactual outcome is reflected in the above formulation, when O=õ:=1−o and o=1:
which gives the joint distribution of the outcome if the item was not exposed and the observed data where the item is exposed. Notice that both p(Y(o)|O=o, X, Z) and p(O=o|X, Z) can be estimated from the data, since Y(o) is observed under O=o. So the unknown mechanism in the missing data distribution is:
p
β(O|Y(o), X, Z)/pβ(O=o|Y(o), X, Z).
Hence, the counterfactual outcome distribution can be given by:
p
β(Y(o)|O=1−o, X, Z)∝pobs(Y(o)|O=o, X, Z)/Gβ(Y(o), X, Z)o ∈{0, 1}, (A.17)
where pobs denotes the observable distribution and
characterizes the exposure mechanism even when unobserved factors exist.
The unknown Gβ(Y(o), X, Z) can be treated as a learnable objective in this setting. As discussed herein, gψ can be used to characterize the role of X and Z in the exposure mechanism Gβ, hence the formulation of
in (9).
Including Y in modeling the exposure mechanism can cause the so-called self-selection problem in causal inference. This setting does not fall into that category, since the objective is to learn the fθ, rather than making inference on its treatment effect.
It is shown in the ablation studies that if the user feedback Y is not included, i.e., Gβ(Y, g104(X, Z)):=σ(gψ(X, Z)), the improvements over the original models will be less significant.
In a number of embodiments, to handle the adversarial training, the sequential optimization setup can be adopted in which the players take turn to update their model. Without loss of generality, the objective in (8) can be treated as a function of the two models: minf
Consequently, the stationary points in Algorithm 1 may not attain local Nash equilibrium. Nevertheless, when the timescale of the two models differ significantly (by adjusting the initial learning rates and discounts), it has been shown that the stationary points belong to the local minimax solution up to some degenerate cases. The local minimaxity captures the optimal strategies in the sequential game if both models are allowed to merely change their strategies locally. Hence, Algorithm 1 leads to solutions that are locally optimal. Finally, the role of Gβ is less relevant in the sequential game, and there are not observed significant differences from updating it before or after fθ and gφ.
Recommenders are often evaluated by the mean square error (MSE) on explicit feedback, and by the information retrieval metric such as DCG and NDCG on implicit feedback. After the training, the candidate model fθ, as well as the Gβ(gφ) that gives the worst-case propensity score function specialized for fθ, can be obtained. Therefore, instead of pursuing unbiased evaluation, instead consider the robust evaluation by using Gβ(gφ). It frees the offline evaluation from the potential impact of exposure mechanism, and thus provide a robust view on the true performance. For instance, the robust NDCG can be computed via:
Simulation studies, real-data analysis, as well as online experiments were conducted to demonstrate the various benefits of the adversarial counterfactual learning and evaluation approach described herein. In the simulation study, the synthetic data was generated using real-world explicit feedback dataset so that there is access to the oracle exposure mechanism. It is then shown that models trained by the techniques described herein achieve superior unbiased offline evaluation performances. In the real-world data analysis, it is demonstrated that the models trained by the techniques described herein also achieve more improvements even using the standard offline evaluation. Online experiments also were conducted, which verify that the robust evaluation described herein is more accurate than the standard offline evaluation when compared with the actual online evaluations.
The techniques described herein involve a high-level learning and evaluation approach that are compatible with most of the existing recommendation models, so these well-known baseline models were used to demonstrate the effectiveness of the described herein. Specifically, the popularity-based recommendation (Pop), matrix factorization collaborative filtering (CF), multi-layer perceptron-based CF model (MLP), neural CF (NCF) and the generalized matrix factorization (GMF), are employed as the representatives for the content-based recommendation. The prevailing attention-based model (Attn) also is considered as a representative for the sequential recommendation. Also fθ and gφ are chosen among the above baselines models for this adversarial counterfactual learning. To fully demonstrate the effectiveness of the adversarial training described herein, there was also experimenting with the non-adversarially trained propensity-score method (PS), in which gφ is first optimized merely on the regularization term until convergence, keep it fixed, and then train fθ in the regular propensity-weighted ERM setting. For the sake of notation, the learning approach described herein is listed as the ACL—(adversarial counterfactual learning).
The various methods were examined with the widely-adopted next-item recommendation task. In particular, all but the last two user-item interactions are used for training, the second-to-last interaction is used for validation, and the last interaction is used for testing. The data descriptions, preprocessing steps, train-validation-test split, simulation settings, detailed model configuration as well as the implementation procedure are now described. The training process is visualized that reveals the adversarial nature of the approach described herein. A complete set of ablation study and sensitivity analysis results are provided to demonstrate the robustness of this approach. The implementation and datasets have been made available at https://github.com/StatsDLMathsRecomSys/Adversarial-Counterfactual-Learning-and-Evaluation-for-Recommender-System.
Three real-world datasets are considered, which cover movie, book and music recommendations:
The Movielens-1M dataset has been filtered before being made available, where each user in the dataset has rated at least 20 movies. For the LastFM and Goodread datasets, infrequent items (books/artists) are first eliminated, as well as users that have fewer than 20 records. After examination, a small proportion of users are found to have an abnormal amount of interactions. Therefore, the users who have more than 1,000 interactions are treated as spam users and not included in the analysis.
The train-validation-test split is carried out based on the order of the user-item interactions. The standard setting is adopted, where for each user interaction sequence, all items but the last two are used in training, the second-to-last interaction is used in validation, and the last interaction is used in testing.
In a modern real-world recommender system, the exposure mechanism is determined by the underlying recommender model as well as various other factors. In an attempt to mimic the real-world recommender systems, a two-stage simulation approach is designed to generate the semi-synthetic data that remains truthful to the signal in the original dataset.
The purpose of the first stage is to learn the characteristic from the data, such as the user relevance (rating) model and the partial exposure model (which may be inaccurate due to the partial-observation of exposure status). In the second stage, the working method of a real-world recommender system is simulated, and the user response is generated accordingly. In order to recover the user-item relevance as accurate as possible, the explicit feedback dataset is used for the simulation, i.e., the Movielens-1M and Goodreads dataset.
In the first stage, given a true rating matrix, two hidden-factor matrix factorization models are trained. The first model tries to recover the rating matrix and by minimizing the mean-squared loss. This model is referred to as the relevance model. Since for the explicit feedback data, the rated items have all been exposed, so given the output [Ru,i|Ou,i=1], the relevance probability is defined as
p
sim1(Yu,i=1|Ou,i=1):=σ([Ru,i|Ou,i=1]+ϵ1),
where σ(.) is the sigmoid function, and the Gaussian noise ϵ1 reflects the perturbations brought by unobserved factors. The second model is an implicit-feedback model trained to predict the occurrence of the rating event {circumflex over (p)}(Ou,i=1), where instead of using the original ratings, the non-zero entries in the rating matrix are all converted to one.
After obtaining the {circumflex over (p)}(Ou,i=1), the simulation exposure probability is defined as log psim1(Ou,i=1)=log {circumflex over (p)}(Ou,i=1)+ϵ2, where ϵ2 also gives the extra randomness due to the unobserved factors.
Now, after obtaining the simulated psim(Yu,i=1|Ou,i=1) and psim(Ou,i=1), which reflects both the relevance and exposure underlies the real data generating mechanism while taking account of the effects from unobserved factors, the first-stage click data is generated based by:
p
sim1(Yu,i=1)=psim1(Yu,i=1|Ou,i=1)psim1(Ou,i=1).
So far, in the first stage, an implicit feedback dataset has been generated that remains truthful to the original real dataset. Now the self-defined components can be added, which gives more control over the exposure mechanism. Specifically, the new user and item hidden factors x, z are obtained by training another implicit matrix factorization model using the generated click data. The extra self-defined exposure function e(x, z) is generated and added to the first-stage psim1, to obtain the second-stage exposure mechanism:
log psim2(Ou,i=1)=log psim1(Ou,i=1)+e(x, z).
The final click data is then generated via:
p
sim2(Yu,i=1)=psim1(Yu,i=1|Ou,i=1)psim2(Ou,i=1).
Having the second stage in the simulation is beneficial, because the focus of the first stage is to mimic the generating mechanism of the real-world dataset. The second stage allows control of the exposure mechanism via the extra e(x, z). Also, retraining the implicit matrix factorization model in the beginning of the second stage is not required, thought it can help to better characterize the data generated in the first stage.
For the baseline models considered, other than Pop, the dimension of the user and item hidden factors, initial learning rate, and the 2 regularization strength are the basic hyperparameters. The initial learning rate is selected from {0.001, 0.005, 0.01, 0.05, 0.1}, and the 2 regularization strength from {0, 0.01, 0.05, 0.1, 0.2, 0.3}. The tuning parameters are selected separately to avoid excessive computations. The hidden dimension is fixed at 32 for the models in order to achieve fair comparisons in the experiments. Also, notice that this approach has approximately twice the number of parameters with respect to the corresponding baseline model. In practice, the hidden dimension can be treated as a hyperparameter as well. Sensitivity analysis can be provided on the hidden dimension later in this section, which can use the Hit@10 on validation data as the metric for selecting hyperparameters.
To check that the superior performance of this approach is not a consequence of higher model complexity, the hidden factor dimension of the baseline models is doubled to 64 when suitable.
Among the baseline models, the Pop, CF, GMF and Neural CF are all conventional approaches in recommender system that have relatively simpler structures, the default settings are adopted without describing their details. More described is provided for the attention-based sequential recommendation model Attn and the propensity-score method PS. For Attn, the model setting adopted has the self-attention mechanism added on top of an item embedding layer. The hidden dimension of the key, query and value matrices, and the number of dot-product attention heads are treated as the additional tuning parameters. For the PS method, there are two stages:
as a propensity-weighted ERM.
The tuning parameters for gψ and fθ are selected in each stage separately.
The configurations for the approach includes two parts: the usual model configuration for fθ and gψ, and the two-timescale train schema. Firstly, the tuning parameters selected for fθ and gψ, when being trained alone also gives the near-optimal performance in the adversarial counterfactual training setting. Therefore, the hyperparameters (other than the learning rate) selected in their individual training for fθ and gψ are directly adopted. Experimenting is run on several settings for the two-timescale update to understand the impact of the relative magnitude of the initial learning rates rθ and rψ. In practice, the learning rate discount is less relevant when using the Adam optimizer, since the learning rate is automatically adjusted. Intuitively speaking, the smaller the r (relative to rθ), the less gψ is subject to the regularization in the beginning stage, and its adversarial behavior is less restricted. As a consequence, fθ may not learn anything useful. Empirical evidence to support the-above point is provided in
The models, including the matrix factorization models, are implemented with PyTorch on a Nvidia V100 GPU machine. The sparse Adam optimizer, available at https://agi.io/2019/02/28/optimization-using-adam-on-sparse-tensors/, is used to update the hidden factors, and the usual Adam optimizer is used to update the remaining parameters. Sparse Adam is used for the hidden factors because both the user and item factor are relatively sparse in recommendation datasets. The Adam algorithm leverages the momentum of the gradients from the previous training batch, which may not be accurate for the item and user factors in the current training batch. The sparse Adam optimizer is designed to solve the above issue for sparse tensors.
The early-stopping training method is used both for the baseline models, such that the training process is terminated when the validation metric stops improving for 10 consecutive epochs. And for this approach, the minimax objective value is monitored, and the training process is terminated if it stops changing for more than ϵ=0.001 after ten consecutive epochs.
It is straightforward to tell that in a single update step, the space and time complexity of this adversarial counterfactual training is exactly the summation for that of fθ and gψ (where the complexity induced by Gβ is almost negligible). In general, this approach may take more training epochs to converge depending on the rθ/rψ, in the two-timescale training schema.
To demonstrate the underlying adversarial training process of the adversarial counterfactual training method described herein, the training progress is plotted under several settings in
From
The adversarial training on the real-world dataset using the sequential recommendation model ACL (Attn/Attn) in
Further, a set of experiment were conducted in which the outcome is not included in modeling the exposure mechanism Gβ, as shown in plots 620 and 640. First of all, it is observed that the same adversarial training patterns still hold whether or not the outcome is included in modeling Gβ. Secondly, the performances, both in terms of the loss value and evaluation metric, are less ideal when Y is not included in Gβ.
A complete ablation study was performed. Firstly, the standard evaluations on the real-world data using the propensity score model are shown in table 1210 of
Sensitivity analysis is provided for the adversarial counterfactual approach, focusing mostly on the user/item hidden factor dimension size and the regularization parameter α. The results of on the real-world datasets.
The sensitivity analysis on the regularization parameter is provided in
Additionally, the online experiments provide valuable evaluation results that reveal the appeal of the ACL approach for real-world applications. All the online experiments were conducted for a content-based item page recommendation module, under the implicit feedback setting where the users click or not click the recommendations. A list of ten items is shown to the customer on each item page, e.g., items that are similar or complementary to the anchor item on that page. The recommendation is personalized, so the user identification (ID) and user features are included in the model as well.
In each iteration of model deployment, the new item features and user features are added into the previous model. The architecture of the recommendation model generally remains unchanged during the iterations, which makes it favorable for examining the ACL approach. There have been four online experiments (A/B testing) conducted for a total of eight models that are trained offline using the adversarial counterfactual training described herein, and then evaluated using the history implicit feedback data. Unobserved factors such as the real-time user features, page layout and same-page advertisements are continually changing and are thus not included in the analysis. The metric that is used to compare the different offline evaluation methods with online evaluation is the click-through rate.
For synthetic data analysis, the explicit feedback data from MovieLens-1M and Goodreads datasets were used. A baseline CF model was trained and the optimized hidden factors are used to generate a synthetic exposure mechanism, which was treated as the oracle exposure. The implicit feedback data was then generated according to the oracle exposure as well as the optimized hidden factors. Unbiased offline evaluation was possible because of access to the exposure mechanism. Also, to set a reasonable benchmark under the simulation setting, additional experiments were provided in which gφ is given by the oracle exposure model. The results are provided in
For real data analysis, other than using the MovieLens-1M and Goodreads data in the implicit feedback setting, the LastFM music recommendation (implicit feedback) dataset is further included. The results in table 1000 of
For online experiment analysis, to examine the practical benefits of the robust learning and evaluation approach described herein in real-world experiments, several online A/B testing scenarios were carried out on Walmart.com, a major e-commerce platform in the U.S., in a content-based item recommendation setting, with access to the actual online testing and evaluation results. All the candidate models were trained offline using the approach described herein. The standard offline evaluation, popularity-debiased offline evaluation (where the item popularity is used as the propensity score), the propensity-score model approach, and the robust evaluation described herein were compared with respect to the actual online evaluations. Table 1100 of
In many embodiments, the techniques described herein can improve on the drawback of supervised learning for recommender systems, by using a theoretically-grounded adversarial counterfactual learning and evaluation framework. The theoretical and empirical results illustrate the benefits of the techniques described herein.
Turning ahead in the drawings,
Generally, therefore, system 1300 can be implemented with hardware and/or software, as described herein. In some embodiments, part or all of the hardware and/or software can be conventional, while in these or other embodiments, part or all of the hardware and/or software can be customized (e.g., optimized) for implementing part or all of the functionality of system 1300 described herein. System 1300 can be a computer system, such as computer system 100 (
In some embodiments, system 1300 can include offline components 1310, such as a data component 1311, a training component 1312, and/or an evaluation component 1313, which can be in data communication with a database 1316 that includes logging data, and can use an algorithm 1315 and/or candidates 1317, such as fθ and gφ. System 1300 also can include an online ranking and serving component 1314, which can receive a requests from a front-end component 1320 for recommendations, and can return recommendations to front-end component 1320. Data component 1311 and/or training component 1312 can be similar or identical to training system 312 (
In many embodiments, data component 1311 can receive raw logging data, such as historical user session data, from database 1316 to prepare training and evaluation data, such as personalized recommendation data and/or item recommendation data. In some embodiments, personalized recommendation data can include, for each data record, a user feature, a view sequence, an item feature, a target purchase, a label, and/or other suitable information. In various embodiments, item recommendation data can include, for each data records, an anchor item, a candidate item, an item feature, a label, and/or other suitable information. In many embodiments, the data can be used by training component 1312 and/or evaluation component 1313. In many embodiments, the label can be positive or negative, which can indicate whether the customer clicked on the item or not. The personalized recommendation data can be personalized to each user, and the item recommendation data can be generalized and not personalized to each user. The training data can provide the type of recommendation to be provided, such as recommendations for similar items or complementary items.
In a number of embodiments, training component 1312 can receive candidates 1317, which can be any candidate recommendation algorithm fθ and an adversarial exposure model gφ. In many embodiments, multiple candidate recommendation algorithms fθ can be received, such as for multiple different families of machine learning, such as linear regression, neural network, matrix factorization, etc. In several embodiments, training component 1312 can train candidate recommendation algorithms fθ and an adversarial exposure model gφ using data obtained from data component 1311 and algorithm 1315 to generate optimized candidate recommendation algorithms {circumflex over (f)}θ and the “most adversarial” exposure model ĝφ. In many embodiments, algorithm 1315 can be a gradient ascent descent, as described above in Algorithm 1, used to optimize the minimax objective, such as the objective function described above in Equation 9.
In a number of embodiments, evaluation component 1313 can perform a robust offline evaluation using the most adversarial exposure model ĝφ to evaluate the optimized candidate recommendation algorithms {circumflex over (f)}θ on the evaluation data obtained from data component 1311, and select the best of the optimized candidate recommendation algorithms {circumflex over (f)}θ, which can be denoted the optimal {circumflex over (f)}θ.
In several embodiments, the optimal {circumflex over (f)}θ can be fed to online ranking and service component 1314. In several embodiments, a recall set can be constructed, based on logging data in database 1316 to find a subset of candidate item pairs that are more likely to contain the optimal choice, as the full set of candidate item pairs can be too large for applying the model to all candidate item pairs. In several embodiments, the optimal {circumflex over (f)}θ can be used to rank the candidate items in the recall set, and the top-K recommendations can be fed to front end component 1320, such as upon request from front-end component.
Turning ahead in the drawings,
In many embodiments, system 300 (
In some embodiments, method 1400 and other activities in method 1400 can include using a distributed network including distributed memory architecture to perform the associated activity. This distributed architecture can reduce the impact on the network and system resources to reduce congestion in bottlenecks while still allowing data to be accessible from a central location.
Referring to
In a number of embodiments, method 1400 also can include an activity 1410 of training candidate recommendation models and an adversarial exposure model using the training data. The candidate recommendation models can be similar or identical to candidate recommendation algorithms fθ described above.
The adversarial exposure model can be similar or identical to adversarial exposure model gφ described above. In a number of embodiments, the candidate recommendation models can include a linear regression recommendation model, a neural network recommendation model, and/or a matrix factorization recommendation model, or other suitable recommendation models. In several embodiments, activity 1410 can include performing a gradient ascent descent to optimize a minimax objective for each of the candidate recommendation models and the adversarial exposure model. The gradient ascent descent can be similar or identical to Algorithm 1 described above. The minimax objective can be similar or identical to the objective functions described above, such as Equation 9. In many embodiments, activity 1410 can train the candidate recommendation algorithms fθ to generate optimized candidate recommendation algorithms {circumflex over (f)}θ, and/or can train the adversarial exposure model gφ to generate the “most adversarial” exposure model ĝφ. In a number of embodiments, activity 1410 can be performed at least in part by training system 312 (
In several embodiments, method 1400 additionally and optionally can include an activity 1415 of performing an evaluation of the candidate recommendation models, as trained, using the adversarial exposure model, as trained. For example, the “most adversarial” exposure model ĝφ can be used to evaluate the optimized candidate recommendation algorithms {circumflex over (f)}θ. In a number of embodiments, activity 1415 can be performed at least in part by evaluation system 313 (
In a number of embodiments, method 1400 further and optionally can include an activity 1420 of selecting the selected recommendation model from among the candidate recommendation models based on the evaluation. For example, the evaluation can be used to determine the optimal {circumflex over (f)}θ, which can be the best performing model of the optimized candidate recommendation algorithms {circumflex over (f)}θ. In a number of embodiments, activity 1420 can be performed at least in part by evaluation system 313 (
In several embodiments, method 1400 additionally can include an activity 1425 of generating recommendations based on a selected recommendation model of the candidate recommendation models. In a number of embodiments, activity 1420 can be performed at least in part by real-time serving system 314 (
In a number of embodiments, activity 1425 can include an activity 1430 of constructing a recall set of candidate recommendation pairs.
In several embodiments, activity 1425 also can include an activity 1435 of generating a ranking of the candidate recommendation pairs in the recall set using the selected recommendation model. For example, the optimal {circumflex over (f)}θ can be used to rank the candidate items in the recall set.
In a number of embodiments, activity 1425 additionally can include an activity 1440 of determining the recommendations from the ranking For example, a top-K rankings can be used as the recommendation.
In several embodiments, method 1400 additionally and optionally can include an activity 1445 of, when a user requests to view an anchor item, sending one or more of the recommendations associated with the anchor item to be displayed to the user. In many embodiments, the one or more recommendations can be one or more of the recommendations in the top-K rankings determined in activity 1440. In a number of embodiments, activity 1420 can be performed at least in part by communication system 311 (
In many embodiments, the techniques described herein can provide a practical application and several technological improvements. In some embodiments, the techniques described herein can provide for generating recommendations using adversarial counterfactual learning and evaluation. These techniques described herein can provide a significant improvement over conventional approaches that fail to account for the underlying exposure mechanism.
In a number of embodiments, the techniques described herein can solve a technical problem that arises only within the realm of computer networks, as online ordering is a concept that do not exist outside the realm of computer networks. Moreover, the techniques described herein can solve a technical problem that cannot be solved outside the context of computer networks. Specifically, the techniques described herein cannot be used outside the context of computer networks, in view of a lack of data, and the inability to train the machine-learning recommendation models without a computer.
Various embodiments can include a system including one or more processors and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, perform certain acts. The acts can include obtaining training data. The acts also can include training candidate recommendation models and an adversarial exposure model using the training data. The acts additionally can include generating recommendations based on a selected recommendation model of the candidate recommendation models.
A number of embodiments can include a method being implemented via execution of computing instructions configured to run at one or more processors. The method can include obtaining training data. The method also can include training candidate recommendation models and an adversarial exposure model using the training data. The method additionally can include generating recommendations based on a selected recommendation model of the candidate recommendation models
Although the methods described above are with reference to the illustrated flowcharts, it will be appreciated that many other ways of performing the acts associated with the methods can be used. For example, the order of some operations may be changed, and some of the operations described may be optional.
In addition, the methods and system described herein can be at least partially embodied in the form of computer-implemented processes and apparatus for practicing those processes. The disclosed methods may also be at least partially embodied in the form of tangible, non-transitory machine-readable storage media encoded with computer program code. For example, the steps of the methods can be embodied in hardware, in executable instructions executed by a processor (e.g., software), or a combination of the two. The media may include, for example, RAMs, ROMs, CD-ROMs, DVD-ROMs, BD-ROMs, hard disk drives, flash memories, or any other non-transitory machine-readable storage medium. When the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the method. The methods may also be at least partially embodied in the form of a computer into which computer program code is loaded or executed, such that, the computer becomes a special purpose computer for practicing the methods. When implemented on a general-purpose processor, the computer program code segments configure the processor to create specific logic circuits. The methods may alternatively be at least partially embodied in application specific integrated circuits for performing the methods.
The foregoing is provided for purposes of illustrating, explaining, and describing embodiments of these disclosures. Modifications and adaptations to these embodiments will be apparent to those skilled in the art and may be made without departing from the scope or spirit of these disclosures.
Although generating recommendations using adversarial counterfactual learning and evaluation has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes may be made without departing from the spirit or scope of the disclosure. Accordingly, the disclosure of embodiments is intended to be illustrative of the scope of the disclosure and is not intended to be limiting. It is intended that the scope of the disclosure shall be limited only to the extent required by the appended claims. For example, to one of ordinary skill in the art, it will be readily apparent that any element of
Replacement of one or more claimed elements constitutes reconstruction and not repair. Additionally, benefits, other advantages, and solutions to problems have been described with regard to specific embodiments. The benefits, advantages, solutions to problems, and any element or elements that may cause any benefit, advantage, or solution to occur or become more pronounced, however, are not to be construed as critical, required, or essential features or elements of any or all of the claims, unless such benefits, advantages, solutions, or elements are stated in such claim.
Moreover, embodiments and limitations disclosed herein are not dedicated to the public under the doctrine of dedication if the embodiments and/or limitations: (1) are not expressly claimed in the claims; and (2) are or are potentially equivalents of express elements and/or limitations in the claims under the doctrine of equivalents.