The disclosure relates generally to reinforcement learning, and in particular, to hierarchical adaptive contextual bandits for a resource-constrained recommendation.
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation systems, however, it is important to consider the resource consumption of exploration. In practice, there is typically a non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is an NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm's expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are difficult to extend to an infinite time horizon, since the learning process will be terminated when there is no more resource. Therefore, it is desirable to improve the reinforcement learning process in the context of MAB.
Further, MAB may find its application in areas such as online ride-hailing platforms, which are rapidly becoming essential components of the modern transit infrastructure. Online ride-hailing platforms connect vehicles or vehicle drivers offering transportation services with users looking for rides. These platforms may need to allocate limited resources to their users, the effect of which may be optimized through MAB.
Various embodiments of the specification include, but are not limited to, cloud-based systems, methods, and non-transitory computer-readable media for resource-constrained recommendation through ride-hailing platform.
In some embodiments, a computer-implemented method comprises obtaining, by one or more computing devices, a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each individual user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, and select an action from the different actions according to the different expected rewards, wherein a probability of executing the selected action is the corresponding probability. The computer-implemented method further comprises receiving, by the one or more computing devices, a real-time online signal of visiting the platform from a computing device of a visiting user; determining, by the one or more computing devices, a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action; and based on the determined resource allocation action, transmitting, by the one or more computing devices, a return signal to the computing device to present the resource allocation action.
In some embodiments, for a training of the model, the environment module is configured to receive the selected action and update the one or more first parameters and the one or more second parameters based at least on the selected action by feedbacking a reward to the resource allocation module and the personal recommendation module; and the reward is based at least on the selected action and the probability of executing the selected action.
In some embodiments, the platform is a ride-hailing platform; the real-time online signal of visiting the platform corresponds to a bubbling of a transportation order at the ride-hailing platform; the user contextual data of the visiting user comprises a plurality of bubbling features of a transportation plan of the visiting user; and the plurality of bubbling features comprise (i) a bubble signal comprising a timestamp, an origin location of the transportation plan of the visiting user, a destination location of the transportation plan, a route departing from the origin location and arriving at the destination location, a vehicle travel duration along the route, and a price quote corresponding to the transportation plan, (ii) a supply and demand signal comprising a number of passenger-seeking vehicles around the origin location, and a number of vehicle-seeking transportation orders departing from the origin location, and (iii) a transportation order history signal of the visiting user.
In some embodiments, the origin location of the transportation plan of the visiting user comprises a geographical positioning signal of the computing device of the visiting user; and the geographical positioning signal comprises a Global Positioning System (GPS) signal.
In some embodiments, the transportation order history signal of the visiting user comprises one or more of the following: a frequency of order transportation order bubbling by the visiting user; a frequency of transportation order completion by the visiting user; a history of discount offers provided to the visiting user in response to the order transportation order bubbling; and a history of responses of the visiting user to the discount offers.
In some embodiments, the determined resource allocation action corresponds to the selected action and comprises offering a price discount for the transportation plan; and the return signal comprises a display signal of the route, the price quote, and the price discount for the transportation plan.
In some embodiments, the method further comprises: receiving, by the one or more computing devices, from the computing device of the visiting user, an acceptance signal comprising an acceptance of the transportation plan of the visiting user, the price quote, and the price discount; and transmitting, by the one or more computing devices, the transportation plan to a computing device of a vehicle driver for fulfilling the transportation order.
In some embodiments, the model is based on contextual multi-armed bandits; and the resource allocation module and the personal recommendation module correspond to hierarchical adaptive contextual bandits.
In some embodiments, the action comprises making no resource distribution or making one of a plurality of different amounts of resource distribution; and each of the actions corresponds to a respective cost to the platform.
In some embodiments, the model is configured to dynamically allocate resources to individual users; and the personal recommendation module is configured to select the action from the different actions by maximizing a total reward to the platform, subject to a limit of a total cost over a time period, the total cost corresponding to a total amount of distributed resources.
In some embodiments, the method further comprises training, by the one or more computing devices, the model by feeding historical data to the model, wherein each of the different actions is subject to a total cost over a time period, wherein: the total cost corresponds to a total amount of distributed resource; and the personal recommendation module is configured to determine, based on the one or more second parameters and previous training sessions based on the historical data, the different expected rewards corresponding to the platform executing the different actions of making the different resource allocations to the individual user.
In some embodiments, the resource allocation module is configured to maximize a cumulative sum of pjØjuj; pj represents the probability of the platform making a resource allocation to users in a corresponding class j of the classes; Øj represents a probability distribution of the corresponding class j among the classes; uj represents an expected reward of the corresponding class j; and a cumulative sum of pjØj is no larger than a ratio of a total cost budget of the platform over a time period T.
In some embodiments, the one or more first parameters comprise the pj and uj.
In some embodiments, the resource allocation module is configured to determine the expected reward of the corresponding class j based on centric contextual information of the corresponding class j, historical observations of the corresponding class j, and historical rewards of the corresponding class j.
In some embodiments, the model is configured to maximize a total reward to the platform over a time period T; and the model corresponds to a regret bound of O√{square root over (T)}.
In some embodiments, if the corresponding class and the selected action exist in historical data used to train the model, the environment module is configured to identify a corresponding historical reward from the historical data as the reward; and if the corresponding class or the selected action does not exist in the historical data, the environment module is configured to use an approximation function to approximate the reward.
In some embodiments, the platform is an information presentation platform; the user contextual data of the visiting user comprises a plurality of visitor features of the visiting user; the plurality of visitor features comprise one or more of the following: a timestamp of the real-time online signal of visiting the platform, a geographical location of the visiting user, biographical information of the visiting user, a browsing history of the visiting user, and a history of click response to different categories of online information; the determined resource allocation action comprises one or more categories of information for display at the computing device of the visiting user; and the return signal comprises a display signal of the one or more categories of information.
In some embodiments, one or more non-transitory computer-readable storage media stores instructions executable by one or more processors, wherein execution of the instructions causes the one or more processors to perform operations comprising obtaining a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each individual user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, and select an action from the different actions according to the different expected rewards, wherein a probability of executing the selected action is the corresponding probability. The operations further comprise receiving a real-time online signal of visiting the platform from a computing device of a visiting user; determining a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action; and based on the determined resource allocation action, transmitting a return signal to the computing device to present the resource allocation action.
In some embodiments, a system comprises one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instructions executable by the one or more processors to cause the system to perform operations comprising: obtaining a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each individual user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, and select an action from the different actions according to the different expected rewards, wherein a probability of executing the selected action is the corresponding probability. The operations further comprise receiving a real-time online signal of visiting the platform from a computing device of a visiting user; determining a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action; and based on the determined resource allocation action, transmitting a return signal to the computing device to present the resource allocation action.
In some embodiments, a computer system includes an obtaining module configured to obtain a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each individual user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, and select an action from the different actions according to the different expected rewards, wherein a probability of executing the selected action is the corresponding probability. The computer system further includes a receiving module configured to receive a real-time online signal of visiting the platform from a computing device of a visiting user; a determining module configured to determine a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action; and a transmitting module configured to, based on the determined resource allocation action, transmit a return signal to the computing device to present the resource allocation action.
These and other features of the systems, methods, and non-transitory computer-readable media disclosed herein, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for purposes of illustration and description only and are not intended as a definition of the limits of the specification. It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the specification, as claimed.
Non-limiting embodiments of the specification may be more readily understood by referring to the accompanying drawings in which:
Non-limiting embodiments of the present specification will now be described with reference to the drawings. Particular features and aspects of any embodiment disclosed herein may be used and/or combined with particular features and aspects of any other embodiment disclosed herein. Such embodiments are by way of example and are merely illustrative of a small number of embodiments within the scope of the present specification. Various changes and modifications obvious to one skilled in the art to which the present specification pertains are deemed to be within the spirit, scope, and contemplation of the present specification as further defined in the appended claims.
In some embodiments, the multi-armed bandit (MAB) may be a sequential decision problem, in which an agent receives a random reward by playing one of K arms at each round and tries to maximize its cumulative reward. Various real-world applications can be modeled as MAB problems, such as incentive distribution, news recommendation, etc. Models that make full use of the observed d dimension features associated with the bandit learning may be referred to as contextual multi-armed bandits.
In some embodiments, the MAB may be applied in user recommendations under resource constraints. For example, when recommending items-for-purchase to Internet users through user devices, MAB-based methods not only focus on improving the number of orders and clicks but also balance the exploration-exploitation trade-off within a limit of exploration resource, so that CTR (Click Through Rate, which may be click/impression) and purchase rate are sought to be improved. Since the impressions of users are almost fixed within a certain scope (e.g., budget), the application can be formulated as a model of increasing the number of clicks under a budget scope. Thus, it is necessary to conduct policy learning under constrained resources which indicates that cumulative displays of all items (arms) cannot exceed a fixed budget within a given time horizon. Each action may be treated as one recommendation, and the total number of impressions may be treated as the budget. To enhance CTR, every recommendation may be treated equally and formulated as unit-cost for each arm. Recommendations may be decided by dynamic pricing.
In some embodiments, the policy may be learned to maximize an expected reward such as CTR or benefit to the platform under exploration constraints. The task may be formed as a constrained bandit problem. In such settings, a model recommends an item (arm) for an incoming context in each round, and observes a reward. Meanwhile, the execution of the action will produce the cost (e.g., a unit cost). This indicates that the exploration of policy learning takes resource consumption.
In some embodiments, a hierarchical adaptive learning structure is provided to, within a time period, dynamically allocate a limited resource among different user contexts, as well as to conduct the policy learning by making full use of the user contextual features. In one embodiment, the scale of resource allocation is considered both at the global level and for the remaining time horizon of the time period. The hierarchical learning structure may include two levels: at the higher level is a resource allocation level where the disclosed method dynamically allocates the resource according to the estimation of the user context value, and at the lower level is a personalized recommendation level where the disclosed method makes full use of contextual information to conduct the policy learning alternatively.
The technical effects of the disclosed systems and methods include at least the following. In some embodiments, adaptive resource allocation is provided to balance the efficiency of policy learning and exploration resource consumption under the remaining time horizon. Dynamic resource allocation is applied in the contextual multi-armed bandit problems. Thus, computing efficiencies of computer systems are enhanced, while conversing computing resources. In some embodiments, in order to utilize the contextual information for users, a hierarchical adaptive contextual bandits method (HATCH) is used to conduct the policy learning of contextual bandits with a budget constraint HATCH may include simulating the reward distribution of user contexts to allocate the resources dynamically and employ user contextual features for personalized recommendation. HATCH may adopt an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In some embodiments, various types of contextual feature information may be used to find the optimal personalized recommendation. Thus, the accuracy of the model is improved. In some embodiments, HATCH has a regret bound as low as O√{square root over (T)} are provided. The regret bound represents the convergence rate of the algorithm to the optimal solution, which measures the performance of a model relative to the performance of others. The experimental results demonstrate the effectiveness and efficiency of the disclosed method on both synthetic data sets and real-world applications.
The disclosed systems and methods may be applied in resource or incentive distribution to online-platform users. In some embodiments, a user may log into a mobile phone APP or a website of an online ride-hailing platform and submit a request for transportation service—which can be referred to as bubbling. For example, a user may enter the starting and ending locations of a transportation trip and view the estimated price through bubbling. Bubbling takes place before the submission of an order of the transportation service. For example, after receiving the estimated price (with or without a discount), the user may accept the order or reject the order. If the order is accepted, the online ride-hailing platform may match a vehicle with the submitted order. Further, the disclosed systems and methods may be applied to other platforms such as news platforms, e-commerce platforms, etc.
Before the user gets to accept or reject the order, the computing system of the online ride-hailing platform may offer incentives such as discounts to encourage acceptance. For example, the computing system of the online ride-hailing platform may return a quoted price and a discount offer to display at the user's device for the user to accept the order. With a limited amount of resources such as the incentives, it is desirable for the platform to strategize the distribution of the incentive to maximize the return to the platform. This improves computer functionality. For example, the computing efficiency of the platform computing system is improved because HATCH simulation estimates the overall long-term return to the platform based on individual user resource allocation decisions, such that the platform may simply call a trained model in real-time to generate resource allocation decisions. Further, the effectiveness and accuracy of the resource allocation decisions are improved.
The system 100 may include one or more data stores (e.g., a data store 108) and one or more computing devices (e.g., a computing device 109) that are accessible to the system 102. In some embodiments, the system 102 may be configured to obtain data (e.g., historical ride-hailing data such as location, time, and fees for multiple historical vehicle transportation trips) from the data store 108 (e.g., a database or dataset of historical transportation trips) and/or the computing device 109 (e.g., a computer, a server, or a mobile phone used by a driver or passenger that captures transportation trip information such as time, location, and fees). The system 102 may use the obtained data to train a model for resource-constrained recommendation. The location may be transmitted in the form of GPS (Global Positioning System) coordinates or other types of positioning signals. For example, a computing device with GPS capability and installed on or otherwise disposed in a vehicle may transmit such location signal to another computing device (e.g., a computing device of the system 102).
The system 100 may further include one or more computing devices (e.g., computing devices 110 and 111) coupled to the system 102. The computing devices 110 and 111 may include devices such as cellphones, tablets, in-vehicle computers, wearable devices (smartwatches), etc. The computing devices 110 and 111 may transmit or receive signals (e.g., data signals) to or from the system 102.
In some embodiments, the system 102 may implement an online information or service platform. The service may be associated with vehicles (e.g., cars, bikes, boats, airplanes, etc.), and the platform may be referred to as a vehicle platform (alternatively as service hailing, ride-hailing, or ride order dispatching platform). The platform may accept requests for transportation service, identify vehicles to fulfill the requests, arrange for passenger pick-ups, and process transactions. For example, a user may use the computing device 110 (e.g., a mobile phone installed with a software application associated with the platform) to request a transportation trip arranged by the platform. The system 102 may receive the request and relay it to one or more computing device 111 (e.g., by posting the request to a software application installed on mobile phones carried by vehicle drivers or installed on in-vehicle computers). Each vehicle driver may use the computing device 111 to accept the posted transportation request and obtain pick-up location information. Fees (e.g., transportation fees) may be transacted among the system 102 and the computing devices 110 and 111 to collect trip payment and disburse driver income. Some platform data may be stored in the memory 106 or retrievable from the data store 108 and/or the computing devices 109, 110, and 111. For example, for each trip, the location of the origin and destination (e.g., transmitted by the computing device 110), the fee, and the time may be collected by the system 102.
In some embodiments, the system 102 and the one or more of the computing devices (e.g., the computing device 109) may be integrated in a single device or system. Alternatively, the system 102 and the one or more computing devices may operate as separate devices. The data store(s) may be anywhere accessible to the system 102, for example, in the memory 106, in the computing device 109, in another device (e.g., network storage device) coupled to the system 102, or another storage location (e.g., cloud-based storage system, network file system, etc.), etc. Although the system 102 and the computing device 109 are shown as single components in this figure, it is appreciated that the system 102 and the computing device 109 can be implemented as single devices or multiple devices coupled together. The system 102 may be implemented as a single system or multiple systems coupled to each other. In general, the system 102, the computing device 109, the data store 108, and the computing device 110 and 111 may be able to communicate with one another through one or more wired or wireless networks (e.g., the Internet) through which data can be communicated.
In some embodiments, the computing device 110 may transmit a signal (e.g., query signal 124) to the system 102. The query signal 124 may be a real-time online signal of visiting the platform from a visiting user (e.g., a passenger). The computing device 110 may be associated with a passenger seeking transportation service. The query signal 124 may correspond to a bubble signal comprising information such as a current location of the vehicle, a current time, an origin of a planned transportation, a destination of the planned transportation, etc. In the meanwhile, the system 102 may have been collecting data (e.g., data signal 126) from each of a plurality of computing devices such as the computing device 111. The computing device 111 may be associated with a driver of a vehicle described herein (e.g., taxi, a service-hailing vehicle). The data signal 126 may correspond to a supply signal of a vehicle available for providing transportation service.
In some embodiments, the system 102 may obtain a plurality of bubbling features of a transportation plan of a user. For example, bubbling features of a user bubble may include (i) a bubble signal comprising a timestamp, an origin location of the transportation plan of the user, a destination location of the transportation plan, a route departing from the origin location and arriving at the destination location, a vehicle travel duration along the route, and/or a price quote corresponding to the transportation plan, (ii) a supply and demand signal comprising a number of passenger-seeking vehicles around the origin location, and a number of vehicle-seeking transportation orders departing from the origin location, and (iii) a transportation order history signal of the user. The bubble signal may be collected from the query signal 124 and/or other sources such as the data stores 108 and the computing device 109 (e.g., the timestamp may be obtained from the computing device 109) and/or generated by itself (e.g., the route may be generated at the system 102). The supply and demand signal may be collected from the query signal of a computing device of each of multiple users and the data signal of a computing device of each of multiple vehicles. The transportation order history signal may be collected from the computing device 110 and/or the data store 108. In one embodiment, the vehicle may be an autonomous vehicle, and the data signal 126 may be collected from the computing device 111 implemented as an in-vehicle computer.
In some embodiments, when making the assignment, the system 102 may send a plan (e.g., plan signal 128) to the computing device 110 or one or more other devices. The plan signal 128 may include a price quote, a discount signal, the route departing from the origin location and arriving at the destination location, an estimated time of arrival at the destination location, etc. The plan signal 128 may be presented on the computing device 110 for the user to accept or reject.
In some embodiments, the computing device 111 may transmit a query (e.g., query signal 142) to the system 102. The query signal 142 may be a real-time online signal of visiting the platform from a visiting user (e.g., a driver). The query signal 142 may include a GPS signal of a vehicle driven by the driver, a message indicating that the driver is available for providing transportation service, a timestamp or time period corresponding to the transportation service, etc. The system 102 send a plan (e.g., plan signal 144) to the computing device 111 or one or more other devices. The plan signal 144 may include an incentive (e.g., receiving a bonus after completing 10 orders by today). The plan signal may be presented on the computing device 111 for the driver to accept or reject.
In some embodiments, at step 221, the environment module 211 may cluster a plurality of users of a platform (e.g., a ride-hailing platform, a news platform, an e-commerce platform) into a plurality of classes j with a probability distribution Øj, based on user contextual data of each individual user in the plurality of users. Further details of step 221 may be referred to
In some embodiments, at step 231, step 241, and step 251, the environment module 211 may determine centric contextual information, denoted as {tilde over (x)}t, of each of the classes j, and output (i) the centric contextual information (e.g., common bubbling feature of the user class, common topics of news articles clicked by the user class) of each of the classes, denoted as {tilde over (x)}t, to the resource allocation module 212, and (ii) user contextual data (e.g., bubbling history, historically clicked news articles) of each individual user, denoted as xt, to the personal recommendation module 213. Further details of step 231, step 241, and step 251 may be referred to
In some embodiments, at step 222 and step 232, the resource allocation module 212 may obtain one or more first policy parameters (e.g., a discount policy), denoted as {tilde over (θ)}t, of each of the class j, and determine a probability, denoted as {tilde over (p)}t, of the platform making a resource allocation to users in each of the classes j, based on the one or more first policy parameters {tilde over (θ)}t of each of the classes with the probability distribution Øj, and the centric contextual information of each of the classes {tilde over (x)}t. Further details of step 222 and step 232 may be referred to
In some embodiments, at step 242, the resource allocation module 212 may output the probability {tilde over (p)}t of the platform making a resource allocation to users in each of the classes to the personal recommendation module 213. Further details of step 242 may be referred to
In some embodiments, at step 223, the personal recommendation module 213 may obtain one or more second policy parameters (e.g., discount policy), denoted as θt,i, of each individual user within each of the classes. Further details of step 223 may be referred to
In some embodiments, at step 243, the personal recommendation module 213 may determine, based on the one or more second policy parameters θt,i, different expected rewards (e.g., sending a ride request, clicking on a recommended article), denoted as uj, corresponding to the platform executing different actions of making different resource allocations (e.g., offering a discount, recommending a news article) to the individual user. Further details of step 243 may be referred to
In some embodiments, at step 263 and step 273, the personal recommendation module 213 may select an action (e.g., the action of making an offer and/or a recommendation), denoted as at, from the different actions according to the different expected rewards, and output the selected action. Further details of step 263 and step 273 may be referred to
In some embodiments, at step 261, for a training, the environment module 211 may obtain the selected action, and update the one or more first policy parameters {tilde over (θ)}t and the one or more second policy parameters θt,i based at least on the selected action by feedbacking a reward (e.g., profit from a ride, total clicks of a news article), denoted by rt, to the resource allocation module 212 and the personal recommendation module 213. Further details of step 261 may be referred to
In some embodiments, at step 221, the environment module 211 may cluster a plurality of users of a platform into a plurality of classes j with the probability distribution Øj. For example, for a plurality of users of a platform, at step 221, the environment module 211 may cluster the plurality of users of the platform into three plurality of classes with the probability distribution Ø1, Ø2, and Ø3.
In some embodiments, at step 231, the environment module 211 may determine centric contextual information {tilde over (x)}t of each of the classes j. For example, for a first class (j=1), at step 231, the environment module 211 may determine its centric contextual information {tilde over (x)}1, such as users within the first class sharing similar bubbling features and/or provided similar responses to certain recommendations. Similarly, at step 231, the environment module 211 may determine centric contextual information {tilde over (x)}2 of a second class (j=2) and centric contextual information {tilde over (x)}3 of a third class (j=3).
In some embodiments, at step 241, the environment module 211 may output the centric contextual information {tilde over (x)}t of class j into the resource allocation module 212. For example, for the first, second, and third classes, at step 241 the environment module 211 may output contextual information {tilde over (x)}1, {tilde over (x)}2 and {tilde over (x)}3 of each of the respective classes into the resource allocation module 212.
In some embodiments, at step 251, the environment module 211 may output user contextual data xt (e.g., personal bubbling feature, preferred topics on news article) of each individual user. For example, for the plurality of users, at step 251, the environment module 211 may output user contextual data x1 of a first user to the personal recommendation module 213. The user contextual data xt may include information related to a user's interaction with the platform. For example, the user contextual data xt may include a plurality of bubbling features of a user.
In some embodiments, at step 222, the resource allocation module 212 may obtain one or more first policy parameters {tilde over (θ)}t (e.g., discount policy) of each of the classes (e.g., user classes determined by the environment module 211) with the probability distribution Øj. The one or more first policy parameters {tilde over (θ)}t may be trained through the disclosed algorithm until the objective function is maximized. For instance, for the first class with the probability distribution Ø1, the resource allocation module 212 may obtain a first learning set of one or more first policy parameters {tilde over (θ)}1 at step 222.
In some embodiments, at step 232, the resource allocation module 212 may determine a probability {tilde over (p)}t of the platform making a resources allocation (e.g., offering a discount, recommending a news article) to users in each of the classes. For instance, for users in the first class, at step 232, the resource allocation module 212 may determine a probability {tilde over (p)}1 that the platform will recommend resources to users in class 1 based on the first set of one or more first policy parameters {tilde over (θ)}1. The resource may include, for example, discount, news, and the like that the platform seeks to recommend to the respective plurality of classes j. The probability {tilde over (p)}t may be any number between 0% and 100% and be determined by the resource allocation module 212.
In some embodiments, at step 242, the resource allocation module 212 may output the probability {tilde over (p)}t determined in step 232 into the personal recommendation module 213.
In some embodiments, at step 223, the personal recommendation module 213 may obtain one or more second policy parameters (e.g., discount policies) θt,i of each of the classes (t stands for the t-th round of training iteration, and i stands for the i-th user). For instance, for the first round, at step 223, the personal recommendation module 213 may obtain a first learning set of one or more second policy parameters θ1,i.
In some embodiments, at step 233, the personal recommendation module 213 may determine one or more second policy parameters θt,i for a corresponding user within each of the classes. For instance, for a first corresponding user within the first class with the probability distribution Ø1, at step 233, the personal recommendation module 213 may obtain a first learning set of one or more second policy parameters Ø1,1. Similarly, for a second corresponding user within the first class with the probability distribution Ø1, at step 233, the personal recommendation module 233 may obtain a first learning set of one or more second policy parameters θ1,2. The one or more second policy parameters θt,i may be trained through the disclosed algorithm until the objective function is maximized.
In some embodiments, at step 243, the personal recommendation module 213 may determine a corresponding probability of the platform making a resource allocation (e.g., offering discounts, recommending news articles) to the individual user.
In some embodiments, at step 253, the personal recommendation module 213 may determine different expected rewards uj corresponding to the platform executing different actions (e.g., the action of making an offer/recommendation) of making resource allocations to the individual user. Each expected reward reflects the total reward (e.g., profit from ordered rides, a number of clicks of news articles) that the platform may obtain from the plurality class of users based on different actions that the platform may take. The expected rewards may each depend on whether the user accepts recommendation of the ride-hailing platform to complete a bubbled order, whether the user clicks on a new article recommended by the new platform, etc.
In some embodiments, at step 263, the personal recommendation module 213 may select an action at (e.g., the action of offering/recommending) from the different actions according to the different expected rewards uj (e.g., clicking a recommended news hyperlink, and bubbling activities on a ride-hailing platform). For example, for users in the first class, at step 263, the personal recommendation module 213 may select an action a1 that maximizes the expected reward. The action may include: recommending information (e.g., discount policy, news article), and proposing a discount to a user of the platform.
In some embodiments, at step 273, the personal recommendation module 213 may output the selected action at (e.g., actually offer the discount/recommend the news article). For example, during training, the selected action may be outputted to the environmental module 211. For another example, in a real application, the platform may execute the action to make a resource distribution decision.
In some embodiments, at step 261, for each training cycle, the environment module 211 may update the one or more first and second policy parameters by feedbacking a total reward rt (e.g., total clicks on a recommended news hyperlink, and gross bubbling activities on a ride-hailing platform) to the resource allocation module 212 and personal recommendation module 213 respectively. For example, for the first class with the probability distribution Ø1, after the first training cycle, at step 261, the environment module 211 may update the first one or more first policy parameters {tilde over (θ)}1 to a second one or more first policy parameters {tilde over (θ)}2 in the resource allocation module 212, and update the first one or more second parameters θ1,i to a second one or more second parameters θ2,i in the personal recommendation module 213 by feedbacking a total reward r1 to the resource allocation module 212 and personal recommendation module 213 respectively
The model 200 may be used in various applications. In some embodiments, the MAB may be applied in a sequential decision problem and/or an online decision making problem. In some embodiments, the bandit algorithm updates the parameters based on feedback from the environment, and a cumulative regret measures the effect of policy learning. The model may be applied in various real-world scenarios, such as online recommendation system (e.g., news recommendation), incentive distribution (e.g., online advertising, discount allocation on a ride-hailing platform), etc.
In some embodiments, the MAB may be applied in recommending resources to users under contextual constraints, and contextual feature information may be utilized to make the choice of the optimal arm (e.g., a recommended action) to play in the current round. For example, when recommending news to users to Internet users through news websites, MAB-based methods may enhance its performance by making recommendations based on relevant contextual information (e.g., user's news reviewing history, topic preferences).
In some embodiments, the MAB may observe a d-dimensional feature vector, which includes contextual information, before making a recommendation in round t to maximize the total reward of the recommendation. Thus, in some embodiments, the MAB agent may learn the relationship between the contexts and the cumulative rewards. In some embodiments, the HATCH method is based on the assumption of a linear payoff function between the contexts and the cumulative rewards. In some embodiments, for a K armed stochastic bandit system, in each round t, the MAB agent may observe an action set t independent of the user feature context xt. In some embodiments, based on observed payoffs in previous trials, the MAB agent may determine the expectation of the total reward, denoted as rt, a
[rt|xt, a
t with the maximum expectation of the total reward rt, a
In some embodiments, the MAB may be applied in user recommendation under resource constraints (e.g., the resource is limited), which indicates that cumulative displays of all resources cannot exceed a fixed budget within a given time horizon T. In some embodiments, the resource constraints may relate to real-world scenarios, in which the budget is limited and a cost may be incurred with each chosen action at. For example, on a news platform, a cost may incur after a news article is recommended at a display location, because the platform may bear a cost to bring Internet user traffic to the display location, and the recommendation of a news article deprives recommendations of other new articles at the display location. Thus, a non-optimal action (arm) may dramatically reduce the total rewards of the MAB. Thus, to maximize rewards under a budgeted MAB, it may be necessary to conduct policy learning under constrained resources. In some embodiments, the MAB may be required to consider an infinite amount of user contextual data (e.g., a user's historical interactions with the platform, personal preference, etc.) in a limited feature space.
In some embodiments, a hierarchical adaptive framework may balance the efficiency between policy learning and exploration of resources. In some embodiments, the budget constraint may be set in the following manner in the contextual bandits problem: given a total amount of resource B and a total time-horizon T, the total t-trail payoff may be defined as Σt=1Trt, a[Σt=1Trt, a*
In some embodiments, an associated cost, denoted as cx
In some embodiments, as shown above, a hierarchical structure may be constructed to reasonably allocate the limited resources, and to efficiently optimize policy learning. In some embodiments, the HATCH may include an upper level (e.g., the resource allocation module 212) in which the HATCH may allocate resources by considering users' centric contextual information, remaining resources (e.g., time, budget), and the total reward. In some embodiments, the HATCH may include a lower level (e.g., the personal recommendation module 213) in which the HATCH may utilize the user contextual data of each individual user to determine an expected reward and to recommend an action to maximize the expected reward with the constraint of allocated exploration resource.
In some embodiments, the resource allocation process may be divided into two steps to simplify the problems of direct resource allocation and conduction policy learning. First, in some embodiments, the resource is dynamically allocated by the centric contextual information of each user class. Second, in some embodiments, a historical logging dataset may be employed to evaluate the user contextual data. In some embodiments, an adaptive linear programming is adopted to solve the resource allocation problem, and to estimate the expectation of the reward.
In some embodiments, Linear Programming (LP) may be applied to solve the problem that the exploration resource and time horizon might grow infinitely with the proportion of ρ=B/T. In some embodiments, when the average resource constraints are fixed as ρ=B/T, the LP function may provide a policy on whether to choose or skip actions recommended by MAB.
In some embodiments, the remaining resource bt may be constantly changing during the remaining time τ. Thus, the averaged resource constraint may be replaced as ρ=bt/τ, and a Dynamic resource Allocation method (DRA) may be applied to address the dynamic average resource constraint. In some embodiments, the centric contextual information and the user contextual data may be indefinite and may not be represented numerically.
In some embodiments, a finite plurality of users may be clustered into a plurality of classes based on user contextual data of each individual user in the plurality of users. In some embodiments, in round t, when the environmental module 211 executes the selected action at, a cost may occur in the environmental module 211. For example, in some embodiments, when a selected action at is recommended, the recommendation may consume resources. Thus, in some embodiments, if the selected action is not a dummy (e.g., at=0), the cost in the environmental module 211 may be assigned as 1.
In some embodiments, a class, denoted as j, which includes users with similar user contextual data, may expect a reward, denoted as uj for each recommended action. In some embodiments, the expected rewards of a class may be constants, and may be ranked in descending order (e.g., u1>u2> . . . >uJ). In some embodiments, the expected reward for the class j, denoted as ûj, may be estimated by a linear function.
In some embodiments, a MAB agent may find a user class corresponding to some user contextual data. In some embodiments, a historical user dataset may be mapped to finite classes j with the probability distribution Øj(x), which reflects a probability that a user class can be found corresponding to the user contextual data.
In some embodiments, since the user context data of each user is influenced only by user preference rather than a policy parameter, it may be assumed that in rounds t in a total time-horizon T (for t∈T), the probability distribution Øj(x) of a class may not drift from the round t to the round t+1 (e.g., Øj,t(x)˜Øj,t+1(x)). Thus, in some embodiments, in order to maximize the expected reward, the DRA may decide whether the algorithm should recommend the selected action (arm) in the round t by determining a probability pj of the platform making a resource allocation to users in the user class j. In some embodiments, the probability pj may be any number between 0-1 (e.g., pj∈[0,1]). Thus, in some embodiments, the probability vectors for the user classes can be collectively denoted as =(p1, p2 . . . , pJ). In some embodiments, for the total amount of resource B and time-horizon T, the DRA may be formulated as:
In some embodiments, the solution of equation (1) may be denoted as pj(ρ), and the maximum expected reward in a single round within averaged resource may be denoted as ν(ρ).
In some embodiments, the probability may be set as
where B may represent a total amount of resource and T may represent a total time horizon. Thus, a threshold of an averaged budget, denoted as {tilde over (J)}(p), may be determined as
Thus, in some embodiments, the optimal solution of DRA may be summarized as:
In some embodiments, the static ratio of a total amount of resource B and a total time-horizon T may not be guaranteed. Thus, in some embodiments, the static ratio ρ may be replaced as bτ/τ, where bτ may represent the remaining resources, and τ may represent a time in round t.
In some embodiments, the expected reward uj may be hard to obtain in real-world scenarios, it may be simulated. In some embodiments, the plurality of users of a platform may be clustered into a plurality of classes based on user contextual data of each individual user in the plurality of users. In some embodiments, each clustered class may include centric contextual information, which is represented by a representation center point, denoted as {tilde over (x)}. In some embodiments, for the j-th cluster, centric contextual information {tilde over (x)}t may be observed in round t, and automatically mapped. In some embodiments, the expected reward between the centric contextual information {tilde over (x)} and the total reward r may be evaluated using a linear function [r|{tilde over (x)}]={tilde over (x)}T{tilde over (θ)}j, wherein {tilde over (θ)}j is the one or more first policy parameter. In some embodiments, the parameters may be normalized as ∥x∥≤1 and ∥{tilde over (θ)}∥≤1.
In some embodiments, all historical centric contextual information of the user class j may be set collectively in a matrix {tilde over (X)}j=[{tilde over (x)}1, {tilde over (x)}1 . . . {tilde over (x)}t], where ∥{tilde over (x)}∥≤1, and every vector in {tilde over (X)}j may be equal to {tilde over (x)}j. In some embodiments, the reward of each user class may be evaluated as a ridge regression, and the one or more first policy parameters of the class j may be formulated as:
{tilde over (θ)}t,j=At,j−1{tilde over (X)}t,jYt,jT (2)
where {tilde over (θ)}t,j may be the one of more first policy parameter of the class j, Yt,j may be the historical rewards of the class j (e.g., Yt,j=[r1, r2 . . . rt]), and Ãt,j may be a first transformation matrix determined as Ãt,j=(I+{tilde over (X)}t,jT{tilde over (X)}t,j).
In some embodiments, the estimated expected reward for the user class j at round t may be ût,j={tilde over (x)}jT{tilde over (θ)}t,j, where {tilde over (θ)}t,j is the one or more first policy parameters for the user class j at round t. In some embodiments, the estimated expected reward ût,j may be used to solve DRA and to determine the probability {circumflex over (p)}x that the platform makes a resource allocation to users in each of the classes.
In some embodiments, the user contextual data xt of each individual user may be utilized to conduct the policy learning and to determine the optimal action. In some embodiments, a linear function may be established to fit the reward r and the user contextual data xt: [r|xt]=xtTθt,j,a, where θt,j,a is one or more second policy parameters for a user in the user class j at round t with the action a.
In some embodiments, the user contextual data matrix for an individual user in the class j after an action a may be set as: Xt,j,a=[x1, x2 . . . xt], where x1, x2 . . . xt are the user contextual data for the user from the first round to the t-th round.
In some embodiments, the one or more second policy parameters for a user in the user class j at round t with the action a may be determined as θt,j,a=At,j,a−1Xt,j,aYt,j,a, where Yt,j,a may be the historical rewards of a user in the class j with the action a, and At,j,a may be a transformation matrix determined as At,j,a=(λI+Xt,j,aTXt,j,a).
In some embodiments, the total reward r may be set as r=XtTθ*j,a+ϵ, where θ*j,a may be the expected value of the one or more second policy parameter θ, and ϵ may be a 1-sub-gaussian independent zero-mean random variable, where [ϵ]=0.
In some embodiments, an action (arm) which maximize the expected reward uj may be chosen from the set of recommended actions through the following formula:
where δ may be a hyperparameter, and λ>0 may be a regularized parameter, α is a constant parameter relevant to A.
In some embodiments, whether to output the selected action a*t to the environmental module may be determined by a probability pj of the platform making a resource allocation to users in the user class j.
In some embodiments, the regret bound of HATCH may be guaranteed by the following algorithm:
, both of {tilde over (α)} and a are constant
, obtain
and
In some embodiments, Algorithm 1 may execute the following actions: (i) line 2 may cluster a plurality of users into a plurality of classes j with a probability distribution Øj based on user contextual data of the plurality of users; (ii) line 7 may select an action a from the different actions according to the different expected rewards; (iii) line 9 may determine a probability {circumflex over (p)}j(b/τ) of the platform making a resource allocation to users in each of the classes; and (iv) line 10 may output the selected action a based on the probability {circumflex over (p)}j(b/τ). In some embodiments, lines 13-22 may update the following parameters: a time τ in round t, the remaining resource b, the user contextual data Xt,j,a for an individual user in the class j after an action a, the historical rewards Yt,j,a of a user in the class j after action a, the historical centric contextual information {tilde over (X)}t,j of the user class j, the historical rewards Yt,j of the class j, a first transformation matrix Ãt,j, the one or more first policy parameters {tilde over (θ)}t, the expected reward ût,j for the user class j at round t, a second transformation as At,j,a, and the one or more second policy parameters θt,j,a.
In some embodiments, Algorithm 1 may output a correct order of the expected reward uj, when executing the algorithm for a large number of iterations until the model converged. In some embodiments, for two user classes j and j′, the j-th class may appear Nj(t−1) times until round t−1. In some embodiments, the expected rewards for the user class j may be smaller than the expected rewards for the user class j′ (e.g., uj<uj′), at any round t≤T, the expected rewards for the user classes j and j′ and their appearance times may satisfy the following condition:
(ûj,t≥ûj′,t|Nj(t−1)≥lj)≤2t−1 (3)
where (a|b) means the probability of condition a under the condition b, and the defined parameter
In some embodiments, the proposed HATCH may be evaluated through a theoretical analysis on the regret (e.g., the value of the difference between a made decision and the optimal decision). In some embodiments, the upper bound of the regret (maximum regret), denoted as vt(ρ), may be summarized as:
where u*j,t may be the optimal expected rewards for an independent user in round t, which may be determined as u*j,t=xt,j,aTθ*j,a.
In some embodiments, the regret for HATCH, denoted as R(T, B), for the total amount of resource B and the total time-horizon T may be defined as
R(T,B)=U*(T,B)−U(T,B) (5)
where U*(T, B) may be the total optimal rewards, and U(T, B) may be the total rewards based on recommended actions by HATCH.
In some embodiments, Theorem 1 may be defined as follows: given a user class j, an expected reward uj and a fixed parameter ρ∈(0, 1), let Δj=in f{|uj′−uj|}, where j′∈J and j′≠j. In some embodiments, let qj=Σj′=1jØj′, and for any class j∈{1, 2, . . . J}, the regret of HATCH R(T, B) with a total amount of resource B and a total time-horizon T may satisfy the following relationships:
(i) in non-boundary cases, if ρ≠qj for any j∈{1, 2 . . . J},
R(T,B)=O(Jβ√{square root over (Φ log T log(Φ log T)+J log T))}
(ii) in non-boundary cases, if ρ=qj for any j∈{1, 2 . . . J},
R(T,B)=O(√{square root over (T)}+Jβ√{square root over (Φ log T log(Φ log T)+J log T))}
where λ is the regularized parameter, O( ) is a function that represents the regret bound, δ is a hyperparameter, Δ is a vector,
As shown, in order to utilize the contextual information for users, HATCH may be used to conduct the policy learning of contextual bandits with a budget constraint, thereby train the model 200. In various embodiments, the effectiveness of the proposed HATCH method is illustrated below with respect to: (i) a synthetic evaluation that compares the HATCH method with three other state-of-the-art algorithms, and (ii) real-world news article recommendation on a news platform.
In some embodiments, a synthetic data set may be generated to evaluate the HATCH method. In some embodiments, generated context in the synthetic data set may contain 5 dimensions (dim=5), and each dimension has a value between 0 and 1. In some embodiments, the algorithm may be evaluated based on a plurality of 10 classes (J=10) and 10 arms may be executed for each user class to generate rewards. In some embodiments, the distribution of the 10 user class may be set collectively as [0.025, 0.05, 0.075, 0.15, 0.2, 0.2, 0.15, 0.075, 0.05, 0.025], and the expected reward uj may be any random number between 0 and 1. In some embodiments, each arm may generate an optimal expected reward uj,a, which is the sum of the expected reward uj of each user class and a variable σj,a which measures the difference between the optimal expected reward uj,a and the expected reward uj (e.g., uj,a=uj+σj,a). In some embodiments, each dimension may have a weight wj,a, which may be a random number between 0 and 1, and thus ∥wj,a∥≤1. In some embodiments, a plurality of 30000 users with contextual data information may be generated and clustered into the 10 classes, and the centric contextual information {tilde over (x)}t of each class may be determined. In some embodiments, for each class with the probability distribution Øj, rewards for each of the 10 arms may be generated as a normal distribution with a mean of uj,a+{tilde over (x)}jσj,a and a variance of 1. In some embodiments, the generated rewards may be normalized as 0 or 1.
In some embodiments, the disclosed algorithm is compared with three state-of-the-art algorithms: greedy-LinUCB, random-LinUCB, and cluster-UCB-ALP. Greedy-LinUCB adopts the LinUCB strategy and chooses the optimal arm in each turn when the choice is executed, consuming one unit of resource. Random-LinUCB is the LinUCB algorithm that chooses the optimal arm in each turn. Cluster-UCB-ALP proposes an adaptive dynamic linear programming method for UCB problems (e.g., it only counts the reward and the number of occurrences for each user class and will not use class features due to the UCB setting).
In some embodiments, since the regrets may not be identical for all compared algorithms, accumulate regret, defined as the optimal reward minus the reward of executed actions, of each algorithm may be instead compared. In some embodiments, four different scenarios with time and budget constraints ρ at 0.125, 0.25, 0.375, and 0.5 may be set for each algorithm, and each algorithm may be respectively executed for 10000, 20000, and 30000 rounds.
In some embodiments, a news article recommendation in a news platform may be used to evaluate HATCH. In some embodiments, real-world data may be collected from the news platform front page for two days. In some embodiments, when users visit the news platform front page, it may recommend and display high-quality news articles from a candidate articles list. In some embodiments, 4.68 million users are observed (J=4.68M). In some embodiments, each user feature may be represented by three parameters, a user contextual data x which may include user and article selection features, a recommended action a which may include recommended candidate articles, and a reward r which may be a binary value (e.g., 0 as the user did not click the recommended candidate article, and 1 as the user clicked the recommended article). Thus, for each user, user features may be represented in the form of triples (e.g., (x, a, r)), and the user contextual dataset may collectively include user features for all users. In some embodiments, user features for 1.28 million users who were recommended the top 6 candidate articles may be randomly selected and fully shuffled to form the user contextual dataset for HATCH's learning process.
In some embodiments, half of the user contextual dataset may be applied in a predefined Gaussian Mixture Model (GMM), denoted as (x), to obtain distributions of all clustered classes. In some embodiments, the user contextual dataset may be clustered in a plurality of 10 classes, denoted as Ø1 to Ø10, based on user contextual data of the plurality of users.
In some embodiments, an algorithm, denoted as Algorithm 2, may be used for clustering the plurality of users into the plurality of classes to avoid early drifting in class distribution (e.g., an instable class in the early stage of the clustering process may lead to an abandonment of some contextual data, and thus the choice of arms will only concentrate on several arms). In some embodiments, Algorithm 2 may include the following steps:
,
2 (X)
In some embodiments, Algorithm 2 may execute the following actions: (i) line 4 may create j empty buckets; (ii) lines 5-7 may assign users with user contextual data xj into the bucket bucketj (e.g., users with user contextual data x1 into bucket bucket1; (iii) lines 8-9 may cluster a plurality of users into a plurality of classes Øj; (iv) lines 10-12 may sample data randomly from the bucket bucketj and select a recommended action at through the current bandit algorithm; (v) line 13 may put user features of a selected user, denoted as (xt, at, rt), into a historical dataset ht; and (vi) lines 14-15 may conduct a policy learning.
In some embodiments, Algorithm 2 may be applied to HATCH and three other baseline methods, namely random-LinUCB, greedy-LinUCB, and cluster-UCB-ALP to obtain averaged rewards (CTR) for each method and to evaluate the performance of HATCH. In some embodiments, Algorithm 2 may be run 50000 times for each method. In some embodiments, for random-LinUCB, greedy-LinUCB, and HATCH, a constant parameter α may be set as 1 (α=1). In some embodiments, the parameter α may be kept consistent for the resource allocation level and the personal recommendation level.
Table 1 illustrates exemplary average rewards (CTR) for HATCH and three other baseline methods after Algorithm 2 is executed for 50000 rounds, in accordance with various embodiments. Random-LinUCB generates the least awards for all time and budget constraints ρ, and thus has the worst performance among all evaluated methods. HATCH significantly outperforms the other methods as the expected rewards are much higher than the three baseline methods for all time and budget constraints ρ.
Table 2 illustrates exemplary normalized occupancy rates of different user classes, in accordance with various embodiments. In some embodiments, the occupancy rates may be decided by the allocation rate and the total number of users in each class. classes 5, 6, and 8 have the highest occupancy rates for all time and budget constraints ρ, whereas classes 1, 2, 9, and 10 have the lowest occupancy rates for all time and budget constraints ρ. Thus, HATCH tends to allocate more resources to classes with the higher average rewards and allocate fewer resources to classes with lower average rewards for all conditions.
HATCH described above may be applied in news recommendations. In some embodiments, the platform is an information presentation platform. The information may include, for example, news article, e-commerce item, etc. The user contextual data of the visiting user includes a plurality of visitor features of the visiting user. The plurality of visitor features may include one or more of the following: a timestamp of the real-time online signal of visiting the platform, a geographical location of the visiting user (e.g., a GPS location of the computing device of the visiting user), biographical information of the visiting user, a browsing history of the visiting user, and a history of click response to different categories of online information (e.g., whether the user is more receptive to a certain category of information). By executing HATCH at the system 102, one or more computing devices may determine the resource allocation action, which includes one or more categories of information for display at the computing device of the visiting user. Once determined, the system 102 may transmit a return signal comprising a display signal of the one or more categories of information to the computing device of the visiting user, such that personalized information (e.g., differentially positioned news articles on the webpage 301) is displayed at the computing device.
Block 412 includes obtaining, by one or more computing devices, a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each individual user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, select an action from the different actions according to the different expected rewards, wherein a probability of executing the selected action is the corresponding probability, and output the selected action. For example, if the resource allocation module determines probabilities P1 for class 1 and P2 for class 2, for an individual user (e.g., a visiting user of the platform in real-time, a virtual user used in training), the personal recommendation module may determine that the individual user falls under class 1 based on her user contextual data, and then determine the probability P1 for the individual user based on the determined class 1.
In some embodiments, for a training of the model, the environment module is configured to receive the selected action and update the one or more first parameters and the one or more second parameters based at least on the selected action by feedbacking a reward to the resource allocation module and the personal recommendation module; and the reward is based at least on the selected action and the probability of executing the selected action.
Block 414 includes receiving, by the one or more computing devices, a real-time online signal of visiting the platform from a computing device of a visiting user;
Block 416 includes determining, by the one or more computing devices, a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action. For example, the visiting user may be fed to the model as the individual user, and the model may determine her user contextual data, her corresponding class, and a recommended action for her.
Block 418 includes, based on the determined resource allocation action, transmitting, by the one or more computing devices, a return signal to the computing device to present the resource allocation action.
In some embodiments, the platform is a ride-hailing platform; the real-time online signal of visiting the platform corresponds to a bubbling of a transportation order at the ride-hailing platform; the user contextual data of the visiting user comprises a plurality of bubbling features of a transportation plan of the visiting user; and the plurality of bubbling features comprise (i) a bubble signal comprising a timestamp, an origin location of the transportation plan of the visiting user, a destination location of the transportation plan, a route departing from the origin location and arriving at the destination location, a vehicle travel duration along the route, and a price quote corresponding to the transportation plan, (ii) a supply and demand signal comprising a number of passenger-seeking vehicles around the origin location, and a number of vehicle-seeking transportation orders departing from the origin location, and (iii) a transportation order history signal of the visiting user. In various embodiments, a user of the ride-hailing platform may log into a mobile phone APP or a website of an online ride-hailing platform and submit a request for transportation service—which can be referred to as bubbling. For example, a user may enter the starting and ending locations of a transportation trip and view the estimated price through bubbling. Bubbling takes place before acceptance and submission of an order of the transportation service. For example, after receiving the estimated price (with or without a discount), the user may accept the order to submit it or reject the order. If the order is accepted, the online ride-hailing platform may match a vehicle with the submitted order.
In some embodiments, the origin location of the transportation plan of the visiting user comprises a geographical positioning signal of the computing device of the visiting user; and the geographical positioning signal comprises a Global Positioning System (GPS) signal.
In some embodiments, the transportation order history signal of the visiting user comprises one or more of the following: a frequency of order transportation order bubbling by the visiting user; a frequency of transportation order completion by the visiting user; a history of discount offers provided to the visiting user in response to the order transportation order bubbling; and a history of responses of the visiting user to the discount offers.
In some embodiments, the determined resource allocation action corresponds to the selected action and comprises offering a price discount (e.g., 10%, 20%, etc.) for the transportation plan; and the return signal comprises a display signal of the route, the price quote, and the price discount for the transportation plan. In some embodiments, the method further comprises: receiving, by the one or more computing devices, from the computing device of the visiting user, an acceptance signal comprising an acceptance of the transportation plan of the visiting user, the price quote, and the price discount; and transmitting, by the one or more computing devices, the transportation plan to a computing device of a vehicle driver for fulfilling the transportation order.
In some embodiments, the model is based on contextual multi-armed bandits; and the resource allocation module and the personal recommendation module correspond to hierarchical adaptive contextual bandits.
In some embodiments, the action comprises making no resource distribution or making one of a plurality of different amounts of resource distribution; and each of the actions corresponds to a respective cost to the platform.
In some embodiments, the model is configured to dynamically allocate resources to individual users; and the personal recommendation module is configured to select the action from the different actions by maximizing a total reward to the platform, subject to a limit of a total cost over a time period, the total cost corresponding to a total amount of distributed resources.
In some embodiments, the method further comprises training, by the one or more computing devices, the model by feeding historical data to the model, wherein each of the different actions is subject to a total cost over a time period, wherein: the total cost corresponds to a total amount of distributed resource; and the personal recommendation module is configured to determine, based on the one or more second parameters and previous training sessions based on the historical data, the different expected rewards corresponding to the platform executing the different actions of making the different resource allocations to the individual user.
In some embodiments, the resource allocation module is configured to maximize a cumulative sum of pjØjuj; pj represents the probability of the platform making a resource allocation to users in a corresponding class j of the classes; Øj represents a probability distribution of the corresponding class j among the classes; uj represents an expected reward of the corresponding class j; and a cumulative sum of pjØj is no larger than a ratio of a total cost budget of the platform over a time period T. In some embodiments, the one or more first parameters comprise the pj and uj, and the one or more second parameters comprise θj. In some embodiments, the resource allocation module is configured to determine the expected reward of the corresponding class j based on centric contextual information of the corresponding class j, historical observations of the corresponding class j, and historical rewards of the corresponding class j.
In some embodiments, the model is configured to maximize a total reward to the platform over a time period T; and the model corresponds to a regret bound of O√{square root over (T)}.
In some embodiments, if the corresponding class and the selected action exist in historical data used to train the model, the environment module is configured to identify a corresponding historical reward from the historical data as the reward; and if the corresponding class or the selected action does not exist in the historical data, the environment module is configured to use an approximation function to approximate the reward.
In some embodiments, the platform is an information presentation platform; the user contextual data of the visiting user comprises a plurality of visitor features of the visiting user; the plurality of visitor features comprise one or more of the following: a timestamp of the real-time online signal of visiting the platform, a geographical location of the visiting user, biographical information of the visiting user, a browsing history of the visiting user, and a history of click response to different categories of online information; the determined resource allocation action comprises one or more categories of information for display at the computing device of the visiting user; and the return signal comprises a display signal of the one or more categories of information.
In some embodiments, the computer system 510 may include an obtaining module 512 configured to obtain a model comprising an environment module, a resource allocation module, and a personal recommendation module. The environment module, the resource allocation module, and the personal recommendation module may correspond to instructions (e.g., software instructions) of the model. The environment module is configured to: cluster a plurality of users of a platform into a plurality of classes based on user contextual data of each user in the plurality of users, determine centric contextual information of each of the classes, output the centric contextual information of each of the classes to the resource allocation module, and output user contextual data of each individual user to the personal recommendation module. The resource allocation module comprises one or more first parameters of each of the classes and is configured to: determine probabilities of the platform making resource allocations to users in the respective classes, based on the one or more first parameters of each of the classes and the centric contextual information of each of the classes, and output the probability to the personal recommendation module. The personal recommendation module comprises one or more second parameters of each of the classes and is configured to: determine, based on user contextual data of an individual user, a corresponding class of the individual user among the classes, and the probabilities, a corresponding probability of the platform making a resource allocation to the individual user, determine, based on the one or more second parameters, different expected rewards corresponding to the platform executing different actions of making different resource allocations to the individual user in the corresponding class, select an action from the different actions according to the different expected rewards, wherein a probability of the platform executing the action is the corresponding probability, and output the selected action. The computer system 510 may further include a receiving module 514 configured to receive a real-time online signal of visiting the platform from a computing device of a visiting user; a determining module 516 configured to determine a resource allocation action by feeding user contextual data of the visiting user to the model as the individual user and obtaining the selected action as the resource allocation action; and a transmitting module 518 configured to, based on the determined resource allocation action, transmit a return signal to the computing device to present the resource allocation action.
The computer system 600 also includes a main memory 606, such as a random access memory (RAM), cache, and/or other dynamic storage devices, coupled to bus 602 for storing information and instructions to be executed by processor 604. Main memory 606 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 604. Such instructions, when stored in storage media accessible to processor 604, render computer system 600 into a special-purpose machine that is customized to perform the operations specified in the instructions. The computer system 600 further includes a read-only memory (ROM) 608 or other static storage device coupled to bus 602 for storing static information and instructions for processor 604. A storage device 610, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 602 for storing information and instructions.
The computer system 600 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware, and/or program logic which in combination with the computer system causes or programs computer system 600 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 600 in response to processor(s) 604 executing one or more sequences of one or more instructions contained in main memory 606. Such instructions may be read into main memory 606 from another storage medium, such as storage device 610. Execution of the sequences of instructions contained in main memory 606 causes processor(s) 604 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The main memory 606, the ROM 608, and/or the storage 610 may include non-transitory storage media. The term “non-transitory media,” and similar terms, as used herein refers to a media that stores data and/or instructions that cause a machine to operate in a specific fashion. The media excludes transitory signals. Such non-transitory media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 610. Volatile media includes dynamic memory, such as main memory 606. Common forms of non-transitory media may include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.
The computer system 600 also includes a network interface 618 coupled to bus 602. Network interface 618 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, network interface 618 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, network interface 618 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, network interface 618 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
The computer system 600 can send messages and receive data, including program code, through the network(s), network link, and network interface 618. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network, and the network interface 618.
The received code may be executed by processor 604 as it is received, and/or stored in storage device 610, or other non-volatile storage for later execution.
Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code modules executed by one or more computer systems or computer processors including computer hardware. The processes and algorithms may be implemented partially or wholly in application-specific circuitry.
The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The exemplary blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed exemplary embodiments. The exemplary systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed exemplary embodiments.
The various operations of exemplary methods described herein may be performed, at least partially, by an algorithm. The algorithm may be included in program codes or instructions stored in a memory (e.g., a non-transitory computer-readable storage medium described above). Such algorithm may include a machine learning algorithm. In some embodiments, a machine learning algorithm may not explicitly program computers to perform a function, but can learn from training data to make a predictions model that performs the function.
The various operations of exemplary methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented engines that operate to perform one or more operations or functions described herein.
Similarly, the methods described herein may be at least partially processor-implemented, with a particular processor or processors being an example of hardware. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented engines. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS).
Any process descriptions, elements, or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of the embodiments described herein in which elements or functions may be deleted, executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those skilled in the art.
As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general, structures and functionality presented as separate resources in the exemplary configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Although an overview of the subject matter has been described with reference to specific exemplary embodiments, various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present disclosure. Such embodiments of the subject matter may be referred to herein, individually or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single disclosure or concept if more than one is, in fact, disclosed.
The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.