AUTOMATED POLICY FUNCTION ADJUSTMENT USING REINFORCEMENT LEARNING ALGORITHM

Information

  • Patent Application
  • 20230298080
  • Publication Number
    20230298080
  • Date Filed
    February 13, 2023
    a year ago
  • Date Published
    September 21, 2023
    a year ago
Abstract
An online system may receive, from a content provider, a content presentation campaign that includes one or more objectives. The online system may define a set of one or more policy functions that automatically controls the content presentation campaign. A policy function may control one or more criteria in bidding content slots. The online system may monitor a realized outcome of the content presentation campaign. The online system may apply a reinforcement learning algorithm in adjusting the set of policy functions. The reinforcement learning algorithm adjusts one or more parameters in the set of policy functions to reduce a difference between the realized outcome and the desired outcome set by the content provider. The online system generates an adjusted set of policy functions and uses the adjusted set of policy functions in bidding content slots to present one or more content items provided by the content provider.
Description
Claims
  • 1. A method comprising, at an online concierge system comprising a processor and a computer-readable medium: receiving, by the online concierge system from a computer system associated with a content provider, a content presentation campaign that includes one or more objectives set by the content provider, at least one of the objectives defining a desired outcome;defining a set of one or more policy functions that automatically controls the content presentation campaign, at least one of the policy functions controlling one or more criteria in bidding content slots of the online concierge system;monitoring a realized outcome of the content presentation campaign that is controlled by the set of policy functions;applying a reinforcement learning algorithm to adjust the set of policy functions, the reinforcement learning algorithm adjusting one or more parameters in the set of policy functions to reduce a difference between the realized outcome of the content presentation campaign and the desired outcome set by the content provider;generating an adjusted set of policy functions by the reinforcement learning algorithm; andusing the adjusted set of policy functions in bidding content slots associated with the online concierge system to present one or more content items provided by the content provider.
  • 2. The method of claim 1, wherein using the adjusted set of policy functions in bidding content slots associated with the online concierge system comprises inputting estimates of a content slot to a first policy function of the policy functions to generate a bid value.
  • 3. The method of claim 2, wherein inputting estimates of a content slot to the first policy function to generate the bid value comprises: generating a plurality of features related to the content slot;inputting the plurality of features to a machine learning model to predict the estimate; andusing the estimates generated by the machine learning model as inputs to the first policy function.
  • 4. The method of claim 1, wherein the set of policy functions comprises a plurality of policy functions, each policy function being defined based on an objective provided by the content provider.
  • 5. The method of claim 1, wherein at least one of the policy functions comprises a plurality of states that record past actions and outcomes associated with the policy function.
  • 6. The method of claim 1, wherein the reinforcement learning algorithm updates the set of policy functions through a counterfactual policy estimation.
  • 7. The method of claim 1, wherein the reinforcement learning algorithm is heuristic based, and wherein applying the heuristic based reinforcement learning algorithm comprises: defining a rule in adjusting a parameter in a policy function;examining the policy function at a previous state that has the a known realized outcome; andadjusting the parameter based on the rule and the known realized outcome of the previous state.
  • 8. The method of claim 1, wherein the reinforcement learning algorithm is a machine learning based.
  • 9. The method of claim 1, wherein one of the content items in the content presentation campaign is a sponsored item offered on one or more interfaces hosted by the online concierge system.
  • 10. A non-transitory computer readable medium configured to store code comprising instructions, the instructions, when executed by one or more processors, cause the one or more processors to: receive, by an online concierge system from a computer system associated with a content provider, a content presentation campaign that includes one or more objectives set by the content provider, at least one of the objectives defining a desired outcome;define a set of one or more policy functions that automatically controls the content presentation campaign, at least one of the policy functions controlling one or more criteria in bidding content slots of the online concierge system;monitor a realized outcome of the content presentation campaign that is controlled by the set of policy functions;apply a reinforcement learning algorithm to adjust the set of policy functions, the reinforcement learning algorithm adjusting one or more parameters in the set of policy functions to reduce a difference between the realized outcome of the content presentation campaign and the desired outcome set by the content provider;generate an adjusted set of policy functions by the reinforcement learning algorithm; anduse the adjusted set of policy functions in bidding content slots associated with the online concierge system to present one or more content items provided by the content provider.
  • 11. The non-transitory computer readable medium of claim 10, wherein using the adjusted set of policy functions in bidding content slots associated with the online concierge system comprises inputting estimates of a content slot to a first policy function of the policy functions to generate a bid value.
  • 12. The non-transitory computer readable medium of claim 11, inputting estimates of a content slot to the first policy function to generate the bid value comprises: generating a plurality of features related to the content slot;inputting the plurality of features to a machine learning model to predict the estimate; andusing the estimates generated by the machine learning model as inputs to the first policy function.
  • 13. The non-transitory computer readable medium of claim 10, wherein the set of policy functions comprises a plurality of policy functions, each policy function being defined based on an objective provided by the content provider.
  • 14. The non-transitory computer readable medium of claim 10, wherein at least one of the policy functions comprises a plurality of states that record past actions and outcomes associated with the policy function.
  • 15. The non-transitory computer readable medium of claim 10, wherein the reinforcement learning algorithm updates the set of policy functions through a counterfactual policy estimation.
  • 16. The non-transitory computer readable medium of claim 10, wherein the reinforcement learning algorithm is heuristic based, and wherein applying the heuristic based reinforcement learning algorithm comprises: defining a rule in adjusting a parameter in a policy function;examining the policy function at a previous state that has the a known realized outcome; andadjusting the parameter based on the rule and the known realized outcome of the previous state.
  • 17. The non-transitory computer readable medium of claim 10, wherein the reinforcement learning algorithm is a machine learning based.
  • 18. The non-transitory computer readable medium of claim 10, wherein one of the content items in the content presentation campaign is a sponsored item offered on one or more interfaces hosted by the online concierge system.
  • 19. An online concierge system comprising: one or more processors; andmemory configured to store code comprising instructions, the instructions, when executed by the one or more processors, cause the one or more processors to: receive, by an online concierge system from a computer system associated with a content provider, a content presentation campaign that includes one or more objectives set by the content provider, at least one of the objectives defining a desired outcome;define a set of one or more policy functions that automatically controls the content presentation campaign, at least one of the policy functions controlling one or more criteria in bidding content slots of the online concierge system;monitor a realized outcome of the content presentation campaign that is controlled by the set of policy functions;apply a reinforcement learning algorithm to adjust the set of policy functions, the reinforcement learning algorithm adjusting one or more parameters in the set of policy functions to reduce a difference between the realized outcome of the content presentation campaign and the desired outcome set by the content provider;generate an adjusted set of policy functions by the reinforcement learning algorithm; anduse the adjusted set of policy functions in bidding content slots associated with the online concierge system to present one or more content items provided by the content provider.
  • 20. The online concierge system of claim 19, wherein using the adjusted set of policy functions in bidding content slots associated with the online concierge system comprises inputting estimates of a content slot to a first policy function of the policy functions to generate a bid value.
Provisional Applications (1)
Number Date Country
63310022 Feb 2022 US