The invention relates generally to computer systems, and more particularly to an improved system and method for matching objects belonging to hierarchies.
Content match is a common procedure performed for placing appropriate ads on web-pages. An objective of placing appropriate ads on web pages is to maximize total revenue from user clicks. In general, there may be many applications like content match where random elements of a set S arrive sequentially and are matched to elements in another set A. Every match may receive a stochastic reward with an unknown probability, and the goal is to maximize expected reward accumulated through time. Such applications include product recommendations for users visiting an e-commerce website like amazon.com based on visitors' demographics, previous purchase history, etc. In this case, set S may consist of unique visitors who are matched to a set A of products with an objective of maximizing total sales revenue.
When placing ads on pages in the context of content match, information that may be useful includes page attributes (e.g., page topic, content, etc.), ad attributes (e.g., theme of the ad, anchor text, landing page, etc.), and other contextual information (user demographics, their recent behavior, etc.). Assuming both pages and ads have been mapped to high dimensional feature spaces and each click on an ad earns some revenue, an online advertising service would want to be able to map points in a feature space of page attributes to another feature space of ad attributes to maximize total expected revenue. This may involve exploring different ads to find good ones more effectively and exploiting the ads that are currently known to have good click rates. However, designing effective policies for matching ads to web pages in this context is a daunting task for several reasons. First of all, the data may be sparse. The feature spaces are extremely large (billions of pages, millions of ads with a lot of diversity and heterogeneity in both pages and ads) and the data extremely sparse since only a few interactions may be observed for a majority of page-ad feature pairs. Second, the click-through rate (CTR hereafter) defined as the number of clicks per impression (number of showings) for a majority of page-ad feature pairs are small, leading to increased learning time. Third, exploration for effective ads needs to be accomplished with good short term performance. Business considerations often constrain learning CTR values. A policy should learn CTR values in an online setting for a large majority of page-ad feature pairs. This is important since the available inventory is finite. For instance, there may be some best ads that run out for certain pages and there may be an opportunity to increase overall revenue by understanding alternative matchings. Accordingly, CTR values need to be learned within a reasonable time horizon and without incurring large drops in revenue, even in the short run. A policy for matching ads to web pages that does excessive exploration may result in providing gradual but slow revenue growth before it converges to the optimal matching. On the other hand, a policy that merely tries to achieve optimality quickly may incur an unnecessarily large revenue loss during the learning period. An ideal policy would converge rapidly to the optimal matching while having a smooth revenue profile.
To deal with these difficulties, existing content match techniques may reduce dimensionality of both web page and ad features by assuming CTRs are simple functions of both web page and ad features. Although functional, the assumption of linearity and additivity of page and ad features is often violated in content matching and leads to CTR estimates that are biased. In fact, interactions among features typically occurs and are extremely important for learning CTRs. What is needed is a way to match objects in one set arriving sequentially with objects in another set by using features of the objects. Such a system and method should be able to match objects in order to maximize expected reward accumulated through time where the sets are large and sparse.
Briefly, the present invention may provide a system and method for matching objects belonging to hierarchies. In various embodiments, a server may include an operably coupled matching engine that may provide services for matching objects classified in one taxonomy with objects classified in another taxonomy by running multi-armed bandits for multiple levels of the taxonomies in order to maximize an overall payoff. The matching engine may include an operably coupled index generator for generating indexes for accessing multiple taxonomies and payoff probabilities, a multi-armed bandit engine for running bandits to determine payoff probabilities for matching an object from a taxonomy with objects from another taxonomy, and a shrinkage estimator for performing shrinkage estimation of the payoff probabilities for matched objects from the taxonomies.
The present invention may provide a framework for learning an optimal matching between two feature spaces that may be organized as taxonomies using multi-armed bandits. In an embodiment, a content match application may use the present invention for placing advertisements on web pages to maximize total revenue from user clicks. In general, an allocation step may be performed when a page class arrives by matching it to an appropriate ad class based on the current estimates of the CTR values. Then an estimation step may be performed to estimate CTR values after taking into account the outcomes of previous allocations.
In particular, a taxonomy of web page classes may be partitioned into web page class groups and a taxonomy of ad classes may be partitioned into ad class groups. A multi-level policy may run bandits at two levels of the taxonomies: first, a bandit may be run on the ad class groups corresponding to a page class group to select an add class group, and then a bandit may be run on the ad classes of the selected ad class group to select an ad class. After the arriving page class may be allocated to an ad class resulting in a click or no-click, then CTR values may be estimated for page-ad pairs of the group. The CTR estimates may be derived from a beta-binomial model, if the beta-binomial model may be a good fit for the page-ad group. If the beta-binomial model may not be a good fit, maximum likelihood estimates may be used instead.
Accordingly, the present invention may be used to learn an optimal matching between two feature spaces that may be organized as taxonomies. The matching may be performed through a multi-level exploration of the hierarchical feature spaces by using multi-armed bandits where the arms of the bandit may be dependent due to the structure induced by the taxonomies. Advantageously, the present invention may use the taxonomy structures and may perform shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation. Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer system 100 may include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer system 100 and includes both volatile and nonvolatile media. For example, computer-readable media may include volatile and nonvolatile computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer system 100. Communication media may include computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For instance, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The system memory 104 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 106 and random access memory (RAM) 110. A basic input/output system 108 (BIOS), containing the basic routines that help to transfer information between elements within computer system 100, such as during start-up, is typically stored in ROM 106. Additionally, RAM 110 may contain operating system 112, application programs 114, other executable code 116 and program data 118. RAM 110 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by CPU 102.
The computer system 100 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer system 100 may operate in a networked environment using a network 136 to one or more remote computers, such as a remote computer 146. The remote computer 146 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 100. The network 136 depicted in
The present invention is generally directed towards a system and method for matching objects belonging to hierarchies and may be used to learn an optimal matching between two feature spaces that may be organized as taxonomies. The matching may be performed by using multi-armed bandits where the arms of the bandit may be dependent due to the structure induced by the taxonomies. A multi-stage hierarchical allocation may then be employed that may improve exploration of the feature spaces using multi-armed bandits. More particularly, the present invention may use the taxonomy structures and may perform shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation.
As will be seen, the framework of the present invention may be used for many online applications including content match applications for placing advertisements on web pages to maximize total revenue from user clicks. As will be understood, the various block diagrams, flow charts and scenarios described herein are only examples, and there are many other scenarios to which the present invention will apply.
Turning to
In various embodiments, a computer 202, such as computer system 100 of
In general, the matching engine 204 may provide services for matching objects classified in one taxonomy with objects classified in another taxonomy by running multi-armed bandits for multiple levels of the taxonomies in order to maximize an overall payoff. The matching engine 204 may include an index generator 206 for generating one or more indexes 222 for accessing multiple taxonomies 214 and payoff probabilities 224, a multi-armed bandit engine for running bandits to determine payoff probabilities for matching an object from a taxonomy with objects from another taxonomy, and a shrinkage estimator 210 for performing shrinkage estimation of the payoff probabilities for matched objects from the taxonomies. Each of these modules may also be any type of executable software code such as a kernel component, an application program, a linked library, an object with methods, or other type of executable software code.
There are many applications which may use the present invention for matching objects classified in one taxonomy with objects classified in another taxonomy. For example, applications like product recommendation or content match for placing appropriate ads on web pages may use the present invention. In the case of an application for product recommendation, unique visitors arriving sequentially to a website may be classified in a taxonomy of users and may be matched to products classified in a taxonomy of products with the objective of maximizing total sales revenue. Similarly, for an application like content match, web pages arriving sequentially may be classified in a taxonomy of web pages and may be matched to ads classified in a taxonomy of ads with the objective of maximizing total revenue from user clicks. Those skilled in the art will appreciate that the techniques of the present invention are quite general, and will also apply for other applications where random objects of a set may arrive sequentially, may be classified in a taxonomy, and may be matched to other objects classified in another taxonomy.
Returning to
In an embodiment, an optimal matching of web page classes to ad classes may be learned using multi-armed bandits. For each page class, a v-armed bandit may be created, where there may be an arm for each of the ad classes so that v=|A| and the payoff probabilities may be derived from the CTR values. Thus, there may be u-bandits that may arise simultaneously, where u=|S|. In general, those skilled in the art may appreciate that a multi-armed bandit may derive its name from an imagined slot machine with k≧2 arms. The ith arm may have a payoff probability pi which may be unknown. When arm i may be pulled, a player may win a unit reward with payoff probability pi. The objective is to construct N successive pulls of the slot machines to maximize the total expected reward. This gives rise to a dilemma between choosing to explore unknown payoff probabilities by gathering information on the unknown payoff probabilities and exploiting the best known rewards by sampling arms with the best payoff probabilities empirically estimated so far. A bandit policy or allocation rule may provide an adaptive sampling process that provides a mechanism to select an arm at any given time instant based on all previous pulls and their outcomes. A popular metric to measure performance of a policy is called regret, which is the difference between the expected reward obtained by playing the best arm and the expected reward given by the policy under consideration. A large body of bandit literature has considered the problem of constructing policies that achieve tight upper bounds on regret as a function of the time horizon N (total number of pulls) for all possible values of the payoff probabilities. The seminal work of T. Lai and H. Robbins, Asyymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, 6:4-22, 1985, showed how to construct policies for which the regret is of O(log N) asymptotically for all values of payoff probabilities. They further proved and constructed policies that achieve asymptotic lower bounds of log N for the regret. Subsequent work has constructed policies that are simpler and achieve the logarithmic bound uniformly rather than asymptotically. (See for example, P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47:235-256, 2002 and the references therein.) The main idea in all these policies is to associate with each arm a priority function which is a sum of the current empirical payoff probability estimate plus a factor that depends on the estimated variability. By sampling an arm with the highest priority at any point in time, arms with little information may be explored and arms which are known to be good based on accumulated empirical evidence may be exploited. As N may increase, the sampling variability may be reduced, resulting in convergence to an optimal arm.
It is important to note that in constructing multi-armed bandits for learning the optimal matching of web page classes to ad classes, the v arms of each bandit created for a page class, and the bandits themselves, may not be independent of each other since S and A may be partitioned into page-class groups and ad-class groups. In particular, the arms in the same group may be likely to have similar payoff probabilities. By exploiting this structure, bandit policies may be constructed that may be optimal asymptotically and yet may achieve better performance in the short run.
Consider, for instance, the suffix ij to denote a pair corresponding to a page class si and an ad class aj. Also consider πi and kj to denote group IDs of a page-class group and an ad-class group respectively. Bπ
For any set U of pairs corresponding to a page class si and an ad class aj, consider pU, SU and NU to denote the true CTR, number of clicks and sample size (number of impressions or pulls) after the nth allocation may have been made. Also, consider {circumflex over (p)}U=SU/NU to denote the maximum likelihood estimate of pU and
to denote an estimated coefficient of variation for U (assuming a binomial distribution with uniform CTR for pairs of U). Also consider CVπ
The feature spaces for matching web pages to ads may be extremely large. For instance, there may be billions of pages and millions of ads. In practice, the data for CTRs may be extremely sparse since only a few interactions may be observed for a majority of page-ad feature pairs. However, a small fraction of page-ad pairs may have relatively higher CTRs. This may provide an ideal situation for improving overall estimation accuracy by using Bayesian smoothing or shrinkage estimation. The method assumes that the CTR values, pij, may be drawn from a prior distribution F({pij};θ) that depends on the parameter vector θ. (to be estimated from data). The posterior distribution of pij values may provide “smooth” estimates with better mean squared error compared to a simple scheme like maximum likelihood estimation under the assumption of independence. However, the degree of smoothing may depend on the choice of F. Advantageously, the presence of groups or blocks BIJ derived from the taxonomies enables a separate prior distribution to be estimated for each group or block. In an embodiment, smoothing across groups or blocks may be introduced through hyperpriors on group or block priors.
Since better estimation may depend on being able to estimate prior distributions for each group or block, a multi-stage allocation strategy may be employed that runs a bandit at the group level on the k2 distinct sets Bπ
Given an arriving page-class si, the multi-level policy may run bandits at multiple levels of the taxonomies during the allocation step. For example, in an embodiment the multi-level policy may run bandits at two levels of the taxonomies: first, a bandit may be run over groups or blocks Bπ
At step 602, the web page may be mapped to a web page class group. In an embodiment, the node of the first taxonomy assigned the web page may be mapped to a group of nodes of the first taxonomy representing a web page class group that includes the web page class assigned the web page. At step 604, it may be determined whether a policy criteria may be less than a threshold. In an embodiment, a statistical criterion based on CVπ
A bandit may then be run on the ad classes of the selected ad class group at step 610 to select an ad class. In an embodiment, an add class may be selected using the following multi-level policy for the second stage at step 610:
{circumflex over (p)}ik may be the estimated CTR based on the model in Bπ
The multi-level policy may use any multi-armed bandit as a subroutine. For instance, the UCB1 scheme described by P. Auer, N. Cesa-Bianchi, and P. Fischer (see Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47:235-256, 2002) may be used in an embodiment. The optimal ad class k* corresponding to a page class si may be determined by the following function:
The priorities of the arms may be obtained by superimposing estimated CTRs with a component that denotes the size of an upper one-sided confidence interval containing the true CTR with overwhelming probability. The first component may help in exploiting good ad classes while the second component supports exploration. This policy may have a logarithmic regret uniformly in the number of pulls.
After the nth arriving page class may be allocated to an ad class aj resulting in a click or no-click, then the multi-stage allocation policy may perform an estimation step to estimate CTR values for pairs of the group or block Bπ
In performing the estimation step, it may be assumed that the number of clicks Sij are binomially distributed such that Sij|pij˜Bin(Nij,pij)(X|Y), where X|Y may denote the conditional distribution of X given Y and where Nij may represent the total number of observations (henceforth, sample size) of pair siaj, and pij may represent the true CTR of pair siaj. Further assume that all Sijs are conditionally independent given pijs. If the Nijs may be large, the true CTRs may be estimated for pairs using maximum likelihood estimators (MLE) {circumflex over (p)}ij=Sij/Nij. Although some pairs of page class and ad class may have higher CTRs (e.g., ski ads may have higher CTRS with pages about winter sports), a majority of pairs of page class and ad class may have low CTRs and hence may receive relatively fewer pulls by the bandit policy, leading to small sample sizes Nij used to estimate the CTRs of the pairs of page class and ad class. Because a large sample size may imply better information about a CTR of a pair of page class and ad class, a shrinkage estimator may be applied in which the estimate of a particular pair of page class and ad class may be a convex combination of a global estimator and an estimator (usually the MLE) exclusively derived from the information of sample size. If the MLE may be based on a large sample size, more weight may be given to the estimator; otherwise, more weight may be given to the global estimator, if the MLE may be based on a small sample size.
An empirical Bayes approach based on a beta-binomial model may provide an attractive way to accomplish shrinkage estimation. In particular, {pij:ijεBIJ} may be drawn from a beta distribution with parameters αB
It may be instructive to look at the mean and variance of Sk after marginalizing over pk. The mean may be represented by E(Sk)=Nkα and the variance of Sk may be represented by Var(Sk)=Nkα(1−α)[1+(Nk−1)/(γ+1)]. When compared to the variance of a binomial model with parameters Nk and α, the variance term in the function Var(Sk)=Nkα(1−α)[1+(Nk−1)/(γ+1)] may involve an additional factor which is a function of γ. This may account for the extra-binomial variation or over dispersion which may be present in the data of CTRs. For additional details of a beta-binomial distribution, see M. J. Kahn and A. E. Raftery, Discharge Rates of Medicare Stroke Patients To Skilled Nursing Facilities: Bayesian Logistic Regression With Unobserved Heterogeneity, Journal of the American Statistical Association, 91:29-41, 1996.
Thus, the CTR estimates used at the second stage of the multi-level policy may be derived from a beta-binomial model, if the beta-binomial model may be a good fit for the group or block. In particular, the CTR estimates may be taken to be the posterior mean, and sample sizes may be adjusted by adding the effective sample size parameter from the beta prior distribution. The prior distributions may be quickly estimated during the first stage, especially in the beginning when there may be small samples. This may provide better estimates of the individual pair CTRs by incorporating the taxonomies in the estimation through a hierarchical Bayesian model. If the beta-binomial model may not be a good fit, maximum likelihood estimates may be used.
Thus, the present invention may match objects belonging to hierarchies by using a multi-level bandit policy to learn an optimal matching between two feature spaces that may be organized as taxonomies. The taxonomies induce dependencies among arms of the bandit which the multi-level policy may exploit in two ways. First, it may enhance exploration with a multistage allocation scheme that matches parents followed by a match among their children. Second, it may improve estimation of rewards through shrinkage estimation in a Bayesian framework. Consequently, the multi-level bandit policy described may perform better than existing bandit policies designed for flat feature spaces.
As can be seen from the foregoing detailed description, the present invention provides an improved system and method for matching objects belonging to hierarchies. Such a system and method may efficiently be used for many online applications including content match applications for placing advertisements on web pages to maximize total revenue from user clicks. The methods described are general and may apply broadly to any learning problems with a hierarchical reward structure. For instance, in reinforcement learning, arms of a bandit may correspond to actions and payoff probabilities may correspond to reward distribution. As a result, the system and method provide significant advantages and benefits needed in contemporary computing and in online applications.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.