In some embodiments, the behavior of a model, or learner, whose task is to maximize the efficiency of promotions and ads it is generating relative to a defined reward function can be optimized. In the context of promotions, the reward function can be defined, for example, to be the revenue net costs associated with the promotion. The learner can try to learn the probability distribution underlying the customer's purchasing behavior, and the effects of promotions on this distribution, in order to maximize expected long-term rewards and therefore profit.
The method described herein can allow for ultra-personalization instead of relying on rough segmentation. It can be applied to optimize both discounts and non-discount promotions (e.g., recommendations, ads). A time dimension can be utilized so that the offer and the timing can be optimized. In addition, highly complex behavior can be learned by the reinforcement learning module. In some embodiments, training data and/or knowledge of the customer's historical buying activity is not required because the reinforcement learning model learns as it goes.
Example Framework
Consider a single customer facing a market with n products p∈{1, 2, . . . , n}.
In some embodiments, we define the following variables as follows:
At the beginning of any given period, the likelihood for the customer to purchase product p can follow a Bernoulli distribution with parameter:
The above Bernoulli parameter can go to 1 as tp increases towards Tp. The exponent can be <1 when the model chooses ϕp>0 (e.g., generate a promotion for product p to a customer with a non-zero sensitivity θ>0). As soon as the exponent is <1, the overall value of the Bernoulli parameter can increase, making a purchase more likely.
Based on the customer's action, a state s=[t1, t2, . . . , tn] can be updated to s′=[t1′, t2′, . . . , tn′] as follow:
For all
This distribution can (1) correspond to a consumption pattern of regularly consumed products (e.g., groceries, personal hygiene), and/or (2) impact the promotion (e.g., spread between the two curves) variant as tp varies, so we can anticipate a ground truth “optimal” promotion strategy in some embodiments. An optimal promotion strategy can comprise an optimal time and/or optimal promotion details based on a certain person.
Learning Via Penalties and Rewards
A Q-matrix can map each state s∈S and each action a∈A to a value representing the expected long-term reward of choosing a given action in a given state. Taking action a∈A (e.g., offer discount for a single product but none for others) given a state s∈S can push the system to a new state s′∈S (e.g., the customer does not purchase promoted product) and the learner can collect a reward that is function of a, s and s′. Based on this reward (positive vs. negative) the learner can update its Q matrix.
The reward function can penalize the learner for making a bad and ineffective promotion, and can reward it in case of success. We can introduce a new parameter representing the profit made per sale, π. During testing, the function can be defined as follow:
where 1 is the indicator function, i.e.
in other words, 1t
During training, we can modify the above reward function with additional sticks and carrots in order to further incentivize the learner:
The first two terms above can be similar to those of Rtest(s, s′, a*), but the third term can be >0 if the products that were promoted were also purchased (e.g., bonus) whereas the fourth term can be >0 if the products promoted were not purchased (e.g., malus).
Example Learning Algorithm
We can call the basic time unit corresponding to a single iteration of the algorithm a “day”. If there are n products, each can either be promoted or not promoted on each day. Thus, there are a total of 3n possible actions (i.e. 3n possible promotion vector a=[ϕ1, ϕ2, . . . , ϕn], since in our specific implementation ϕp takes value in {0, 0.1, 0.2} and a total N=Πp=1n(Tp+1) possible states.
The learning algorithm (also set forth in
Requires:
State space S={s1, s2, . . . , sN}
Actions space A={Φ1, Φ2, . . . , Φ3
Reward function R: S×S×A→
Stochastic transition function T: S×A→S (dictated by the customer behavior described in the framework above)
Learning rate α∈[0,1]
Discounting factor γ∈[0,1]
Exploration factor ϵ∈[0,1]
Procedure QLearning(S, A, R, T, α, γ, ϵ)
Simulation
As an example, here are the results of a simulation of the above model using the following set of parameters. Some parameters values below were picked to optimize the learning process and the results via grid search.
The average reward per period during training can be plotted over 200,000 iterations, as shown in
We can then use the trained model to test 50,000 iteration. Following the model's recommendation yields an average reward per period of $1.747. In some embodiments, this can compare to $1.141 for a random promotion policy, and $0.786 for a policy of not running any promotions.
Real Life Implementation
The model does not need to be plug-and-play in some embodiments, and implementing a reinforcement learning agent in a client context can require substantial adaptation to that context. For example, the algorithm implementation can be adapted to a client (e.g., a retailer) that seeks to optimize promotions targeted at the customer. In addition, the algorithm can adapt to customer behavior.
For example, in some embodiments, a potential client could be a retailer (e.g., WHOLE FOODS) with a large number of heterogeneous customers (e.g., with a general customer profiles such as mid 30s with 3 kids) and a promotion channel allowing quick iteration and feedback (e.g., email promotions for online shopping). In some embodiments, a customized client's reward function can be based on data such as an actual cost structure, profit margins per product, etc. The appropriate customer state space for client's customers and the relevant action space can be defined, while making sure the dimensionality of the learning problem does not blow up (e.g., too many states and/or actions). For example, if STARBUCKS was a potential client, with a customer being a person using STARBUCKS iPHONE app, lots of data about the customer (helping with customization) could be provided. However, in some embodiments, too much customization may be avoided so that too many states and/or actions are utilized in order to reduce the state space and make the algorithm more efficient (e.g., if we know that customer very likely won't be going back in 10 minutes, some states/actions accounting for short periods of time between purchases do not need to be used.
In some embodiments, a different reinforcement learning algorithm can be selected for the learning algorithm above based on the state/action space and the reward. For example, in the learning algorithm set forth above, a Q-learning algorithm was used. In other embodiments, other algorithms, such as a deep Q-learning algorithm, may be used. In still other embodiments, double Q-learning, delayed Q learning, or greedy GQ learning, or any combination of Q-learning, may be used.
Q-learning can be a reinforcement learning technique used in machine learning. A goal of Q-Learning can be to learn a policy, which tells an agent what action to take under what circumstances. It may not require a model of the environment and may be able to handle problems with stochastic transitions and rewards, without requiring adaptations. For any finite Markov decision process (FMDP), Q-learning may be able to find a policy that is optimal in the sense that it maximizes the expected value of the total reward over all successive steps, starting from the current state. Q-learning may identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. “Q” can name the function that returns the reward used to provide the reinforcement and can be said to stand for the “quality” of an action taken in a given state.
Deep Q-Learning learning may use experience replay, that uses a random sample of prior actions instead of the most recent action to proceed. This can remove correlations in the observation sequence and smooth changes in the data distribution. An iterative update can adjusts Q towards target values that are only periodically updated, further reducing correlations with the target.
More information on the various Q-learning algorithms can be found at the https web site en.wikipedia.org/wiki/Q-learning, which is herein incorporated by reference.
In some embodiments, scaling issues can be accounted for. In addition, in some embodiments, hyper-parameters of the reinforcement learning model can be tuned. For example, a model for one client (e.g., STARBUCKS) may not work well for another client (e.g., DUNCAN DONUTS).
While various embodiments have been described above, it should be understood that they have been presented by way of example and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope. In fact, after reading the above description, it will be apparent to one skilled in the relevant art(s) how to implement alternative embodiments. For example, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
In addition, it should be understood that any figures which highlight the functionality and advantages are presented for example purposes only. The disclosed methodology and system are each sufficiently flexible and configurable such that they may be utilized in ways other than that shown.
Although the term “at least one” may often be used in the specification, claims and drawings, the terms “a”, “an”, “the”, “said”, etc. also signify “at least one” or “the at least one” in the specification, claims and drawings.
Finally, it is the applicant's intent that only claims that include the express language “means for” or “step for” be interpreted under 35 U.S.C. 112(f). Claims that do not expressly include the phrase “means for” or “step for” are not to be interpreted under 35 U.S.C. 112(f).
This application claims the benefit of U.S. Provisional Application No. 62/744,508 filed Oct. 11, 2018, which is incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
7827060 | Wright | Nov 2010 | B2 |
9589278 | Wang | Mar 2017 | B1 |
20020062245 | Niu | May 2002 | A1 |
20050071223 | Jain | Mar 2005 | A1 |
20080249844 | Abe | Oct 2008 | A1 |
20190139092 | Nomula | May 2019 | A1 |
Entry |
---|
Tejasophon, Pat. “Prepaid calling card online store.” MS Thesis, Assumption University of Thailand, Jul. 2003, 106 pages. (Year: 2003). |
Steindór Sæundsson, Katja Hofmann, Marc Peter Deisenroth, “Meta Reinforcement Learning with Latent Variable Gaussian Processes”. arXiv:1803,07551 (Jul. 7, 2018), 11 pages. (Year: 2018). |
Lohtia, R., N. Donthu, and I. Yaveroglu, “Evaluating the efficiency of Internet banner advertisements”, J. Business Research 60, 2007, pp. 365-370, (Year: 2007). |
Image File Wrapper of U.S. Appl. No. 17/114,226. |
Number | Date | Country | |
---|---|---|---|
62744508 | Oct 2018 | US |