The present teaching generally relates to providing electronic content. More specifically, the present teaching relates to content recommendation.
With the development of the Internet and the ubiquitous network connections, more and more commercial and social activities are conducted online. Online content is served to millions, some requested and some recommended. For example, a user can request certain content using queries via, e.g., keywords to facilitate searches. In interacting with users, online content serving engines may also, while delivering what a user asked for, provide recommended content to the user via a recommender system. Content recommended by such a recommender system includes general content and/or advertisements (ads). A recommended advertisement in auction may be presented to the user on a webpage having content display therein and with an appropriate display arrangement.
A typical recommender system includes a backend server and a front-end serving engine. In the context of ad serving, the backend server is typically used for selecting an ad to be recommended in response to a request from the frontend that intends to participate in an auction to win an impression opportunity on a webpage. An example recommender system 100 is illustrated in
Upon receiving a request for a recommended ad, the ad selection backend server 140 may operate to select one or more ads from an ad storage 160, based on, e.g., user information and the contextual information associated with the ad display opportunity in accordance with some previously trained ad selection models 150. The selected ad(s) may then be sent from the backend server 140 to the frontend serving engine 130 for auction. If the auction for a recommended ad is successful, the recommended ad is displayed to the user 110 on the device 120.
The selection of a recommended ad may be based on, e.g., information surrounding the ad displaying opportunity, including, e.g., information about the user, information characterizing the platform on which the advertisement is to be displayed, or information about the content on the webpage with which an ad is to be displayed. The selection may be performed by an ad selection model that may be trained or optimized with respect to some objective such as maximizing the click through rate (CTR) of the ads. The recommended ad is then sent to the front-end to be used for auction. If the auction is successful, the ad is served to the user. In some situations, a recommended ad may also be displayed in a manner specified by the backend server. In this process, the frontend serving engine 130 serves what is provided by the backend.
There is a need for a solution that can enhance the performance of the traditional approaches in maximizing the revenue of advertising in the recommender systems.
The teachings disclosed herein relate to methods, systems, and programming for information management. More particularly, the present teaching relates to methods, systems, and programming related to hash table and storage management using the same.
In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for generating combination distributions for ads. Features are computed based on training data associated with ads, each of which has a plurality of attributes. The training data include asset combinations with past performance thereof for each of the ads. Each combination includes multiple assets representing respective attributes of an ad. The features are used in machine learning to obtain an auxiliary model, which is used to generate combination distributions for each ad based on predicted performance for each combination associated with the ad. Such generated combination distributions are sent to an explore/exploit layer (EEL) for a frontend ad serving engine to draw a combination therefrom for an auction winning ad for rendering on a webpage viewed by a user on a user device.
In a different example, a system is disclosed for generating combination distributions for ads. The system includes an asset combination processor, a machine learning engine, an asset combination generator, and an asset combination transmitter. The asset combination processor is configured to extract features based on training data associated with ads, each of which has a plurality of attributes. The training data include asset combinations with past performance thereof for each of the ads. Each combination includes multiple assets representing respective attributes of an ad. The machine learning engine is configured to obtain, via learning an auxiliary model, which is used by the asset combination generator to generate combination distributions for each ad based on predicted performance for each combination associated with the ad. The asset combination transmitter is configured to send such generated combination distributions to an explore/exploit layer (EEL) for a frontend ad serving engine to draw a combination therefrom for an auction winning ad for rendering on a webpage viewed by a user on a user device.
Other concepts relate to software for implementing the present teaching A software product, in accordance with this concept, includes at least one machine-readable non-transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.
Another example is a machine-readable, non-transitory and tangible medium having information recorded thereon for generating combination distributions for ads. The information, when read by the machine, causes the machine to perform various steps. Features are computed based on training data associated with ads, each of which has a plurality of attributes. The training data include asset combinations with past performance thereof for each of the ads. Each combination includes multiple assets representing respective attributes of an ad. The features are used in machine learning to obtain an auxiliary model, which is used to generate combination distributions for each ad based on predicted performance for each combination associated with the ad. Such generated combination distributions are sent to an explore/exploit layer (EEL) for a frontend ad serving engine to draw a combination therefrom for an auction winning ad for rendering on a webpage viewed by a user on a user device.
Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or system have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.
The present teaching discloses solutions of adding a thin explore/exploit layer (EEL) at a frontend ad serving engine of a recommender system to provide an additional degree of freedom (DOF) to the operation. Each ad may have different attributes such as a title, an image, and a description. Although traditionally each ad may be displayed in a fixed manner, recently for each dynamic-creative optimization (DCO) ad with multiple attributes, advertisers may provide multiple assets for each attribute. For example, if three assets are provided for each of the three attributes of an advertisement, there are 3×3×3=27 possible combinations that system can choose from. These alternative combinations correspond to different ways to render the DCO ad. Combination distributions may be created according to different objectives to maximize the return of displaying ads at the backend server and transmitted, at some regular basis, to the EEL at the frontend serving engine so that for each successfully auctioned DCO ad, a specific combination may be drawn from available combination distributions for rendering to a user at an ad slot. Each alternative combination corresponds to a particular mix of assets of the attributes of the advertisement.
The combination distributions may be created based on predicted performance scores with respect to different situations. This is to facilitate evaluation of a combination according to each specification situation because each combination may yield different performances under different conditions. For instance, a combination of a DCO ad rendered to a user in a female young professional social group may yield a different expected performance than that for a user in a male young professional social group. Thus, to support the real-time draw of a combination at the frontend serving engine, combination distributions of DCO ads may be generated with probabilities associated therewith determined based on predicted performances with respect to some criteria (e.g., CTR, conversion rate CVR, etc.). The predicted performances may be provided by some prediction model that is trained via machine learning and optimized with respect to ads, contexts, as well as different classes of users. The present teaching discloses different embodiments of generating combination distributions using different prediction models trained, via machine learning, based on different optimization criteria.
The ad recommendation backend server 230 herein is also provided to obtain asset combination prediction models 260 via machine learning based on training data (stored in the asset combinations archive 270) with combinations used for displaying ads and additional information such as click or conversion activities from users or lack thereof. Such obtained asset prediction models 260 may be used by the ad recommendation backend server 230 to generate combination distributions according to predicted performances. With training data continually collected, the asset prediction models 260 may be updated via retraining and the combination distributions may also be accordingly updated over time. Such generated combination distributions may be periodically sent to the frontend ad serving engine 210 to create the EEL 220 for on-the-fly drawing of a most appropriate combination given a DCO ad and other information related to each ad display opportunity. When the recommended DCO ad wins the auction, the ad serving engine 210 draws a specific combination from EEL 220 based on, e.g., the DCO ad identifier and the segment key associated with the user 110. The drawn combination of the DCO ad is then used to render the DCO ad on the device 120 to the user. The subsequent action of the user on the rendered DCO ad may then be collected and provided to the ad recommendation backend server 230 as continuously collected training data for updating both the ad selection models 240 and the asset combination prediction models 260 (not shown in
A more general situation is shown in
Each of the combinations in 360 may be used for rendering when the ad 300 is to be displayed. As discussed herein, the determination of the combination for rendering may be made in such a manner to achieve some objective, such as maximizing the return on displaying the ad. To facilitate that, each combination may be assessed with respect to its predicted return by a prediction model trained based on past historic performance data related to different combinations. In some embodiments, the predicted performance may be provided with some indication on, e.g., the confidence in the predicted performance. In some situations, the predicted performance of a combination may vary with a change of the display environment, i.e., a combination that had yielded a good return from a particular user in a specific setting may not mean that it would lead to the same performance for other users or in other settings. For example, if a combination of a DCO ad is used to render the DCO ad to users in age group of 15-30 and achieved a good return, it does not mean that when the same combination is used to render the DCO ad to a user in age group 55-70 will achieve the same return. This is due to the fact that users with different characteristics may appreciate different ways an ad is rendered in different display settings. That is, the performance of a combination may change with different variables or parameters associated with a display opportunity.
As discussed herein, characteristics of users may impact the expected return or performance of a combination of an ad. Other factors may also impact the return of a displayed ad. For instance, people in different age groups and/or different geographical regions may have different preferences as to preferred rendering styles.
A traffic segment can be defined based on users. For instance, a user segment may be a cohort according to some criteria and may include a plurality of users who meet the criteria and may share some common characteristics or traits. For example, some user segments may be formed based on age, some may be based on profession, some may be based on interest, such as hobbies, etc. User segments may also be organized as hierarchies with some user segments having sub user segments. This is illustrated in
To facilitate a determination of a best combination of a DCO ad given a particular rendering environment, the performance of each combination with respect to different rendering environments may be predicted by, e.g., prediction models trained based on past performance data. Through such past performance data, the prediction models may learn as to what parameters or information in the rendering environment may influence the performance of a combination. Such knowledge may be captured in the prediction models so they can then be used to predict the performance of a combination given the parameters associated with the rendering environment. Parameters in a rendering environment that may influence expected return may include, e.g., user information, a locale of the user, or a platform on which a combination is to be rendered, etc. For instance, information about the user may indicate a cohort or a user segment to which the user belongs. The predicted performance may be used to generate a combination distribution for each DCO ad. For instance, if there are 4 combinations associated with a DCO ad with a distribution (01, 0.1, 0.2, 0.6) for a particular traffic segment. With this distribution, if the DCO ad wins an auction, then each of the four combinations will be drawn according to their distributions. That is, on average, each of combinations 1 and 2 is to be selected 10% of the time; combination 3 is to be selected 20% of the time, and combination 4 will be selected 60% of the time. That is, distribution is provided with each combination having a probability indicative of a likelihood of achieving the predicted performance given certain contextual information.
On the other hand, the asset combination prediction models 260 are used for predicting the performance of each combination of assets for a DCO ad. To train the asset combination prediction models 260, training data may indicate specific combinations of assets for displayed ads as well as information about the user activities occurred in response to the displayed combinations. The collected combination data may first be processed by an asset combination processor 420 and may then be filtered by the training data filter 430 prior to machine learning. As discussed herein, the filtering may be based on minimum time period for the underlying data or according to the activeness of the ads as data related to inactive ads should not influence how to optimize the ads recommendation.
Once the training data is processed and filtered in accordance with, e.g., application needs, the machine learning engine 440 learns, based on the training data, the ad selection models 240 and asset combination prediction models 260. As discussed herein, using the ad selection models 240, the ad recommendation backend server 230 selects, in response to a request for a DCO ad for participating in a bid for an ad display opportunity, an ad from the ad storage 250, that is optimized with respect to certain targeted performance, such as maximizing the click through rate (CTR). The selection is made based on information, which may be received with the request, including, e.g., user information, a webpage for the ad, or information about the content on the webpage, etc. The selected advertisement is then recommended for auction.
The asset combination prediction models 260 are obtained, via learning, based on past data on combinations used to display ads to different users under different conditions as well as the user activities in response to the rendered combinations of such display ads (clicks, conversions, etc.). Such learned asset combination prediction models 260 may then be used for predicting the performance of each combination for each ad (from the ad storage) with respect to, e.g., each of the traffic segments. Based on the predicted performances for each combination with respect to each traffic segment, the combination distributions are generated by calculating a probability for each of the combinations. Such combination distributions based on predicted performances may be regularly sent to the EEL 220 at the frontend so that the ad serving engine 210 may draw, at the serving time after a DCO ad wins the auction, a best combination of the DCO ad that optimizes against some criterion. For example, the criteria may be CTR, CVR, or any other performance related criteria for maximizing the revenue from display ads. These criteria may also be used during training as objective functions so that the trained models capture how different parameters related to displaying ads impact the return.
At the backend, the models 240 and 260 may be regularly updated by the machine learning engine 440 by continued learning using continuously arrived training data. That is, the ad selection models 240 are regularly kept up to date so that the selection of an ad for an auction can be made in a manner that is consistent with the current online dynamics. Similarly, the asset combination prediction models 260 may also be regularly learned and updated with the most recent data collected from real operations. Whenever the asset combination prediction models 260 are updated, they may then be used to update the predicted performances of the combinations and generating the updated combination distributions. The updated combination distributions may then be transmitted to the EEL 220 at the frontend so that each ad may be served based on a combination drawn from the most updated combination distributions.
In some embodiments, the frontend ad serving engine 210 may be distributed, i.e., there may be multiple frontend serving engines distributed in, e.g., different jurisdictions, each of which may be for serving ads in a designated jurisdiction with ads allocated to that jurisdiction. For example, ads related to snow tires may be allocated in winter season to only states that do snow. In this case, the backend server may transmit, to each frontend serving engine serving ads in a designated jurisdiction, only a subset of the combination distributions related to the ads allocated to that jurisdiction. As another example, different frontend serving engines may be responsible for ad display on different designated online platforms, respectively. In this case, the combination distributions to be transmitted to each frontend ad serving engine may also be a subset of combination distributions related to the ads to be displayed on certain online platforms. Selective transmission of combination distributions may be achieved by the asset combination transmitter 480 based on selection criteria configurations stored in 470, which may specify which frontend serving engine is to receive combination distributions associated with which DCO ads.
The selection criteria configurations stored in 470 may specify in what situations, which subsets of combination distributions are to be transmitted to which frontend. Definitions of different traffic segments such as user segments may be provided so that each situation may be mapped to a specific traffic segment according to the definitions. In addition, allocation of ads to each jurisdiction may also be specified in 470 to facilitate the selection of a subset of combination distributions to be transmitted to a frontend ad serving engine. Furthermore, criteria related to other considerations may also be specified. For instance, each frontend ad serving engine may be provided to optimize performance in accordance with some objective, e.g., maximizing CTR or CVR. The asset combination prediction models 260 may be trained to operate with respect to different objectives so that they may be used to create different combination distributions, each directed to certain objectives. Some combination distributions may be provided with the predicted performance with respect to, e.g., CTR and some with respect to CVR. At the time of selecting a subset of combination distributions for, e.g., transmission, the selection may be made based on the criteria to be maximized.
In some embodiments, once the ad recommendation backend server 230 receives a confirmation from the frontend, its operation may end there until the next request for a recommended DCO ad is received. In this case, to render a DCO ad successfully auctioned, the frontend ad serving engine 210 may then proceed to draw a combination of the recommended DCO ad from the EEL 220 based on, e.g., a user segment identifier from the combination distributions sent to the frontend previously. In some embodiments, the ad recommendation backend server 230 may operate in a different mode of operation, in which it is also involved in recommending a combination of a DCO ad to the frontend serving engine 210. At 545, it is determined whether real-time DCO combination is needed. If it is not needed, the processing returns to step 515 to wait for the next request for a recommended DCO ad.
In a mode of operation where a real time DCO combination is to be identified by the backend server 230, e.g., in the event that the frontend EEL 220 is not operational or could not identify a combination (e.g., the combinations of the DCO ad are not accessible), the frontend ad serving engine 210 may request the backend server 230 for a combination associated with a winning DCO ad. In this case, the ad recommendation backend server 230 may draw, at 555, a combination from the combination distributions that it created previously based on, e.g., the information associated with the DCO ad, including a user segment associated with the user or the webpage on which a combination for the DCO ad is to be displayed. The drawn combination is then sent, at 565, to the frontend serving engine 210 for rendering the DCO ad to the user. Then the processing returns to step 515.
In some embodiments, to facilitate effective learning, training data may be filtered to ensure, e.g., only appropriate training data to be used in machine learning.
Accordingly, the exemplary system diagram of the training data filter 430 comprises a time-based filter 610, a click-based filter 630, and an active state based filter 650. Each of these filters is provided to remove those training data that do not satisfy the requirements associated with minimum time, volume, and activity level. The time-based filter is provided to filter out those training data that do not meet the minimum time period criterion, which is defined by time window filtering criteria 620. The click-based filter 630 is provided to filter out those training data that do not have adequate level of volume (e.g., the number of clicks) for being used in machine learning. The criterion defining the required volume is stored in click-based filtering criteria 640. The active state based filter 650 is provided to remove training data associated with those DCOs that are not active based on definitions specified in activeness filtering criteria 660.
In some embodiments, the training data filter 430 may also filter out combinations with predicted performance below some predefined level of confidence and includes a confidence-based filter 670 that remove combinations that satisfy the predefined confidence level specified in CS based filtering criteria 680. Details on confidence-based filtering is provided later in this disclosure.
As discussed herein, to determine a combination for a winning DCO ad, there may be alternative way to achieve that. One is to draw a combination from combination distributions in the EEL 220. If the combination distributions are not available (e.g., either the EEL 220 is not up to date or not yet replenished), the frontend ad serving engine 210 may also request the ad recommendation backend server 230 to draw a combination and provided to the frontend. Upon a successful auction on a DCO ad, the DCO ad combination selector 760 is invoked to obtain a combination for the DCO ad. It is first determined, at 755, whether combination distributions are available in the EEL 220. If combination distributions are available in the EEL 220, the DCO ad combination selector 760 draws, at 785, a DCO ad combination based on the combination distributions stored in EEL 220. According to the present teaching, given a combination distribution associated with a DCO ad, each of the combination in the distribution may be associated with a respective probability, and such information may be used to draw one combination from an appropriate distribution. As discussed herein, if, for a DCO ad, there are four combinations in a combination distribution with probabilities 0.1, 0.1, 0.2, and 0.6, respectively. Such probabilities may be determined based on the predicted performances as discussed earlier. When this DCO ad wins an auction, a combination may be drawn statistically based on this probability distribution. That is, on average, combination 1 will be selected 10% of the time, combination 2 will be selected 10% of the time, combination 3 will be selected 20% of the time, and combination 4 will be selected 60% of the time.
If no combination distribution is available in EEL 220, the DCO ad combination determiner 760 sends a request to the ad recommendation backend server 230 to request, at 765, a combination for the DCO ad. Then the DCO ad combination determiner 760 receives, at 775, from the backend server 230 a combination for the DCO ad drawn from the combination distribution associated with a traffic segment in accordance with the probabilities associated with the combinations in the distribution. With the combination determined (either drawn from EEL or received from the backend server 230), the DCO ad generator 770 generates the rendering instructions based on the specific combination for the DCO ad and such instructions are then used by the DCO ad renderer 780 to render, at 795, the DCO ad to the user.
As discussed herein, the present teaching allows advertisers to provide several assets per each native ad attribute, creating a plurality of combinations for each DCO ad. Since different combinations may appeal to different crowds, it is recognized that it may be beneficial to present certain combinations preferred by certain users to maximize revenue. In addition, in optimizing the selection of a combination to maximize some objective, different criteria may be used, including CTR or CVR or other optimization criteria. The above discussions provide a general description of the system construct and operational flows at a high level. Depending on the criteria (CTR or CVR) to be optimized to facilitate combination selection for rendering DCO ads at serving time, the detailed operations in how to generate the combination distributions may differ. Below, details related to CTR and CVR based approaches in generating combination distributions are disclosed. It is noted that both CTR and CVR based approaches as disclosed herein adopt the two-stage operation (i.e., auction first and rendering a winning DCO ad second based on a combination selected by maximizing the underlying criteria). Different aspects of the present teaching as disclosed are related to the second stage directed to a winning DCO ad.
First, the CTR based approach is discussed which employs a post-auction successive elimination based algorithm for ranking DCO combinations according to their measured CTRs. The CTR based approach is developed based on an ad click prediction method developed based on the so-called One-pass Factorization of Feature Sets (OFFSET). Because CTR of a combination changes over time, a combination successive elimination (CSE) based solution is employed for DCO ads combination optimization. The CSE solution is to update, in every training period, a distribution over the combinations (based on the combinations' CTR measurements) per DCO ad and each certain traffic segment. The combination distributions are periodically transmitted to the EEL 220 at the frontend ad serving engine 210, enabling the frontend to draw the selected combination according to the relevant distribution before rendering the winning DCO ad. In some embodiments, the combination distributions may be packed in a DCO model file when being sent to the frontend. Although deterministic lists of all combinations would suffice in allowing the frontend to select a combination, to ensure light weight EEL, the lists are optimized via successive elimination in accordance with the present teaching.
Using the OFFSET ad click prediction algorithm, the predicted click-probability or predicted click-through-rate (pCTR) of a given user u on an ad a is given by
where vu, va∈RD denote the user and ad latent factor vectors respectively, and b∈R denotes the model bias. The product vuTva denotes the tendency score of user u towards ad a, where a higher score corresponds to a higher pCTR. Note that Θ={vu, va, b} are model parameters learned from the training data.
Both ad and user vectors are constructed using their features, which allow to address data sparsity issues. For an ad, a simple summation between the D dimensional vectors of the unique creative identifier (id), campaign id, and advertiser id may be used. The combination between the different user and user context feature vectors may be more elaborate and may allow non-linear dependencies between feature pairs. Context feature vectors may include features such as age, gender, geo, site id, device type, some category taxonomy, time, day, etc. Using such features to represent users allows a model to include only a few thousands latent factor vectors instead of hundreds of millions of unique user latent factor vectors.
To learn model parameters Θ, OFFSET minimizes the logistic loss (or LogLoss) of the training data set (i.e., past impressions and clicks) using one-pass stochastic gradient descent (SGD). OFFSET uses an incremental approach where it continuously updates its model parameters with each new batch of training events (e.g., every 15 minutes for the click model). OFFSET may also include an automatic hyper-parameter online tuning mechanism, which may take advantage of OFFSET's parallel map-reduce architecture and strive to tune its hyper-parameters to match the varying temporal and trending effects of the marketplace.
The disclosed CTR based approach is to be used to identify and serve the best DCO combinations for optimizing CTR and revenue. As discussed herein, a two-stage process is used. Although the present teaching is mainly directed to the operation related to the second state, some background information is provided herein. When a user arrives at a website, and a native slot on a webpage is to be populated by an ad, an auction takes place. To accommodate the auction, the ad recommendation backend server 230 generates a list of eligible active ads for the user as well as each ad's score. The list of eligible ads generated for the certain user in a certain context is generally related to targeting. The score for each ad is a measure that attempts to rank the ad according to their expected revenue with respect to the arriving user and the context (e.g., user's features, such as age, gender, geographical features, day, time, site, device type, etc.). In general, an ad's score is defined as
Score(u,a)=bid(a)·pCTR(u,a)
where pCTR(u, a) represents a predicted click through rate and is provided by the OFFSET algorithm (see above) and bid(a) (in USD) is the amount of money the advertiser is willing to pay for a click on ad a.
To encourage advertiser truthfulness, the cost incurred by the winner of the auction is according to a generalized second price (GSP), which is defined as
where a and b correspond to the winner of the auction and the runner up, respectively. Note that by definition gsp bid(α), which means the winner will pay no more than its bid. Moreover, the winner will pay the minimal price required for winning the auction. In particular, if both ads have the same pCTR, the winner will pay the bid of the runner-up (i.e., bid(b)).
As discussed with respect to
Assume that a DCO ad has M attributes and mi assets for the ith attribute i=1 . . . M. Hence, there are N=Πi=1Mmi virtual native ad combinations. When the probability of each remaining combination is the same, each combination has an equal chance to be selected from these native ad combinations. As native ad combinations that resemble the surrounding page items are considered less intrusive to the users, such native ad combinations may provide a better user experience in general and likely improve the return. As such, the goal of the present teaching is to use the additional degree-of-freedom of N combinations, to improve the CTR of the DCO ads by rendering each DCO ad based on a combination that provides the biggest potential in user experience and revenue.
As discussed herein, the best combination may vary in accordance with traffic segments, which may include user segment or other types of segments or some combinations thereof. In some embodiments, traffic segments may be defined by configurable segment keys that relate to user features (such as age, gender, geo, etc.) and/or context (such as device, vertical, OS, etc.). For example, if gender (i.e., male, female, and unknown) and device (i.e., mobile, desktop, tablet, and unknown) are used as segment keys, then for each DCO ad, there are 12 segments (e.g., (female, mobile), (female, tablet), (male, mobile), etc.). For each incoming impression, a segment key may be extracted, and a corresponding traffic segment may be identified. A segment name may be used to update its corresponding distribution during training and the corresponding distribution may then be used for drawing a combination during the serving time. For instance, if an auction winning ad is an active DCO ad A with N combinations CA={Cn}n=1N. The frontend ad serving engine may then extract the segment keys of the impression to determine an impression segment S and locate the respective combination distribution PA,S={p(Cn)}n=1N in the model file. Then, it simply draws one combination according to PA,S for rendering DCO ad A.
As discussed herein, the CSE algorithm may be applied to each active DCO ad and traffic segment independently. Successive elimination means that when combinations with lower CTRs than that of the “best” combination are gradually eliminated, the “surviving” combinations with uniform distribution are continually explored until a predefined size is achieved. When the size of the remaining combinations is larger than one, the probability mass between the remaining combinations is divided according to their CTRs' using a soft-max distribution. In some embodiments, it is also possible to continue the process until there is one best combination remaining.
In some applications, it may be assumed that while the CTRs of the combinations vary over time, their relative ranking generally remains the same over the entire lifespan of the DCO ad. This assumption may not hold, however, in certain situations where combinations may strike a temporal trend and combination ranking may change. In this situation, it is reasonable to assume that CTR ranking changes slowly over time so that periodical retraining with respect to each DCO ad may operate to capture or adapt to the new trend.
In each training period of the CSE approach, the following CSE operational steps are performed, as illustrated in
To generate a combination distribution for each DCO ad via successive elimination, relevant statistics may be updated first at 830, which include a number of impressions and/or a number of clicks with respect to each of the combinations. Based on the updated statistics, a “best” combination is identified, at 840, in terms of CTR of all “surviving” combinations. Then the “surviving” combination set SA is updated, at 850, by eliminating those combinations having CTRs lower than that of the “best” combination with, e.g., a certain level of confidence. With the elimination, the combination distribution PA is updated, at 860, according to the updated SA. It is then checked, at 865, whether the updated “surviving” set reaches a predefined size Neg. If the condition related to Neg is not yet reached, the process proceeds to the next iteration by going back to step 840. If Neg is reached, the operation for the specific DCO ad is complete and the combination distribution for the DCO ad is derived, at step 870, by dividing the probability mass according to their CTRs' using a soft-max distribution. After that, it is checked, at 875, whether there is any remaining DCO ads for which combination distributions need to be generated. If combination distributions have been generated for all DCO ads, such combination distributions for DCO ads are packed at 880 so that they can be transmitted to the EEL 220 at the frontend. Otherwise, the process continues to generate the combination distribution for the next DCO ad by going back to step 830.
As discussed with reference to
With regard to eliminating combinations with a predefined level of confidence, as it is based on CTR comparison, certain test may be used. In some embodiments, the Z-test score may be considered as the basis to derive a reduced test Z (Cn) (see step 16 in Algorithm 1). Hence, the Z-test based elimination criterion used herein, provides a confidence level larger than Prc (an input parameter of CSE—see Algorithm 1 below) in case
Z
W(Cn)>Z=Q−1(1−Prc)
where Q−1(⋅) is the inverse Q function of a standard normal distribution N(0,1).
As discussed herein, the elimination process ends when the “surviving” set SA includes a predetermined number, Neg, of combinations (a system parameter). Then, the probability mass is divided among the endgame set combinations according to the soft-max distribution:
where, α>0 is the soft-max factor (system parameter), and Cb is the “best” combination in terms of CTR. It can be verified that the ratio between the probability of two endgame combinations Cn,Ci∈SA with CTR(Cn)>CTR(Ci) is
which is an increasing function of α. Therefore, combinations with higher CTRs get higher probabilities than those with lower CTRs, and the ratio between these probabilities increases with α.
The above described CSE approach addresses the issue associated with the varying combination CTRs and impression rates. At any stage, all “surviving” combinations may have the same probability so that each has the same chance to be drawn and used for rendering a DCO ad. In addition, because CSE works incrementally, it can tolerate feedback delays. This approach is also less sensitive to regrets (or revenue loss due to the time it takes for a DCO ad to be resolved) and more focused on identifying the top-K combinations in terms of CTR.
Below is an exemplary implementation of the CSE algorithm in pseudo code.
This two-stage post-auction based CTR approach is provided for its reduced system complexity because, as discussed earlier, it does not increase OFFSET click model size because the combination selection is performed after the auction and no further index queries are needed to rank the combinations. Instead, only a simple draw-based selection is needed according to the combination distribution.
In addition, to combat the time varying combination CTRs and impression rate, the successive elimination based approach is used and thus, at any stage, all “surviving” combinations have the same probability and have the same chance to be presented. This makes combination CTR measurements as described in Algorithm 1 comparable. The only interface between the DCO system and the serving frontend is via combination distributions. As the CSE algorithm works incrementally, it can tolerate feedback delays.
The CSE approach is also robust and practical. Since it is less sensitive to regrets (or revenue losses due to the time it takes for a DCO ad to be resolved) and more focused on identifying the top-K combinations in terms of CTR, the elimination of combinations does not start until there are Nclickmin clicks to remove combinations that have lower CTR compared to the “best” combination with a predefined confidence level. A statistical test is adopted to eliminate combinations because the expected combination CTR differences are now known in advance in real life situations, which is another justification for adopting the successive elimination based approach.
Below, the alternative CVR based approach is described for generating combination distributions to facilitate the selection of the best combinations for DCO ads that maximize CVR. The two-stage solution is also employed in the CVR based approach. To calculate the combination distributions, an auxiliary combination CVR prediction model is employed, which is trained via machine learning and used to estimate predicted CVRs (pCVR) for combinations with respect to different traffic segments. The predictions are then turned into combination distributions where higher pCVR combinations are assigned with higher probabilities. When more conversions are accumulated over time and certain DCO ad combinations are predicted to have higher CVRs, the distributions may be adapted to the changing situation via retraining so that the CVR based approach will impress such combinations more frequently than others.
As discussed above, OFFSET is used to drive a click model for predicting a click event (pCTR). OFFSET may also be used to drive other types of native models, such as conversion models, as in the CVR based approach. The CVR based approach developed herein is based on an event prediction method via OFFSET event prediction algorithm. Using the OFFSET event prediction algorithm, the predicted event probability (pEVENT) of a user u and ad a is provided by:
pEVENTu,a=σ(Su,a)∈[0,1],
where σ(x)=(1+e−x)−1 is the Logistic sigmoid function, and
S
u,a
=b+v
u
T
v
a
where vu, va∈RD denote the user and ad latent factor (LF) vectors respectively, and b ∈R denotes the model bias. The product vuTva indicates the tendency of user u towards ad a, where a higher score translates into a higher pEVENT. It is noted that Θ={vu, va, b} are the model parameters which are learned from past events that are logged.
The OFFSET event prediction algorithm is for predicting a conversion given a click event or pCONV or an ad close model for predicting an ad close event (pCLOSE). The ad and user LF vectors may be constructed using their respective features to overcome data sparsity issues (ad events such as a click, a conversion, or a close, are quite rare). For ads, a summation of their feature LF vectors (e.g., ad id, campaign id, advertiser id, ad categories, etc., which all are in dimension D) may be used. However, due to the interaction between the user feature LF vectors in different d-dimension, producing the user D-dimension LF vector is more complicated to support non-linear dependencies between feature pairs.
User vectors may be constructed using their K-feature learned vectors vk ∈Rd, k ∈{1, . . . , K} (e.g., gender values, age values, device types, geo values, etc.). In particular, o entries may be allocated to each pair of user features, and s entries are devoted to each feature vector alone. The dimension of a single feature value vector is therefore d=(K−1)·o+s, whereas the dimension of the combined user vector is D=(K 2)·o+K·s. An illustration of this construction is depicted in
The model includes only O(K) LF vectors, one for each feature value (e.g., three for gender—female, male, and unknown) rather than hundreds of millions of unique user LF vectors. To learn the model parameters Θ, OFFSET minimizes the logistic loss (or LogLoss) of the training data set i (i.e., past negative and positive events) using a one-pass stochastic gradient descent (SGD).
y∈{0,1} is the indicator (or label) for the event involving user u and ad a, and A denotes the L2 regularization parameter.
According to the present teaching, the OFFSET algorithm may be applied in an incremental mode for training, where it continuously updates its model parameters with each batch of new training events (e.g., every 15 minutes for the click model, or 4 hours for the conversion model). The OFFSET algorithm may also include an adaptive online hyper-parameter tuning mechanism, which utilizes a parallel map-reduce architecture of a native backend platform to attempt to tune OFFSET hyper-parameters to match the varying marketplace conditions (changed by trend and temporal effects).
OFFSET may utilize different types of features. For instance, it may use weighted multi-value feature type, with which the model includes a d-dimension LFV for each of the m feature values seen so far. In this case, the d-dimension vector of this weighted multi-value feature for user u is:
where {li} and {wl
In serving ads, when a native slot needs to be populated by an ad, an auction takes place with selection of ads for auction based on scores as described above with respect to the CTR based approach. In the auction, a winning conversion DCO ad is used to optimize the best combination for rendering the current impression. Assume the auction winning ad is an eligible DCO ad A with N combinations CA={Cn}n=1N. The frontend ad serving engine 210 extracts the segment key(s) of the incoming user, determines the traffic segment S, and locates the corresponding combination distribution QA,S={QC
Since CVR is in general much lower than CTR, predictions are used herein instead of event counting when generating combination distributions using the CVR based approach. With this approach, an auxiliary combination CVR prediction model is trained via machine learning and at the end of each training period, the auxiliary combination CVR prediction model is used for predictions in order to generate combination distributions per DCO ad with respect to different traffic segments (see
To predict the CVRs of DCO ads combinations, a variant of the OFFSET event prediction model is used. With this variant, impressions are used as negative events while conversions (both post-click and post-view) are used as positive events. As conversions may be reported with a long delay (up to 30 days after the actual view or click), conversions and their corresponding impressions may not be paired before training the models. Given that, each positive event may also be used in training as a negative event. As a result, the predicted CVRs may be slightly under-predicted and such predictions are to be “corrected” before being turned into the actual combination distributions (described in detail below). In addition, to reduce the backend resources required to train the auxiliary prediction model, the impressions may be down sampled prior to training, which is another reason that the CVR predictions are to be “corrected” prior to turning the predictions into actual distributions. The auxiliary model described herein is trained with respect to all conversion ad traffic instead of only DCO ad traffic for triggering collaborative filtering patterns that help in “filling” the gaps due to conversion data sparsity.
In order to be able to predict CVR for each combination of each DCO ad, an additional ad feature may also be used that specifies the actual assets included in the combination for the DCO ad associated with each event (impression or conversion). For example, assume that a certain DCO ad was impressed using a certain combination, that includes an asset for the description attribute with a description ID, e.g., De123, an image asset for the image attribute with an image ID, e.g., Im456, and a title asset for the title attribute with a title ID, e.g., Ti789. Such information may be used to form a multi-value feature {(De123,1),(Im456,1),(Ti789,1)} indicative of the specific assets used to render this impression/event. Such an asset multi-value representation (instead of representing the combination number as an ad feature) is capable of creating dependencies among the DCO ad's combinations that share assets (e.g., share the same title and description but include different images) and triggers collaborative patterns that help “filling” the gaps caused by data sparsity. For events involving non-DCO ads, their features may be assigned with “NONDCO” value and a unit weight, i.e., {(NONDCO,1)}. Such constructed features are used as training data to train the auxiliary model.
Such extracted user, ad, and combination asset features are then used to train, at 1010, the auxiliary model via machine learning. As discussed herein, at the end of every training period (e.g., 4 hours), the trained auxiliary mode is used to predict the CVRs for the combinations of the DCO ads involved in learning (whose event data are used for training). Each of such DCO ads is identified at 1020 so that the auxiliary model can be used to estimate the CVR predictions for the combinations of the DCO ad with respect to each traffic segment. To do that, the feature values for each particular DCO ad are determined at 1030 and then used to query, at 1040, the auxiliary model using the ad features and other features such as segment keys. Based on the query results, the combination latent factor (LF) vectors are constructed, at 1050, for all combinations of the DCO ad. In some embodiments, as the system is designed in such a way that traffic key segments correspond with the user features of the auxiliary prediction model, to obtain the user LF vector, the segment key values the define the user segment is used to query the model (e.g., female and desktop for gender and device segment keys), and then construct the user vector based on the query result. Upon obtaining the LF vectors for the user and the combinations, the model bias is determined, at 1060, and equation for pEVENTu,a is used to predict, at 1070, the CVRs of the combinations of the DCO ad. As discussed herein, the predictions are then “corrected,” at 1080, to compensate for the effect of the non-join and impression down sampling operations. To generate the distribution, a SoftMax operation is performed that translates, at 1090, the predictions into distributions. In some embodiment, a uniform component may be added to produce the final distribution. The process repeats until the CVR predictions are computed for all combinations of all DCO ads, determined at 1095. After the CVR predictions are completed, all distributions may be integrated into a DCO distributions file (or table) and sent to the EEL 220 so that the frontend ad serving engine 210 may draw a combination for each winning DCO ad involving a particular user at the serving time. As discussed herein, the auxiliary model is to be retrained regularly to adapt to the dynamics of the marketplace. It is determined at 1097 when the retraining is to start. When it is time to restart the next round of the training, the process proceeds to step 1000.
Below, an exemplary pseudo code implementation of an Algorithm 2 for turning predictions into distributions is illustrated.
In implementing the CVR based algorithm, to ensure quality of the learning to derive an effective auxiliary model, certain operations may be performed. For example, ads with fewer conversions may be handled so that DCO ads with fewer than a predefined number of conversions Nc (a system parameter, e.g., Nc=1) may be assigned with uniform distributions for all traffic segments. In addition, traffic segments with low maximal prediction PM, i.e., traffic segments with PM<min (a system parameter, e.g., Pmin=1e−9) may also be assigned with a uniform distribution. Furthermore, inactive DCO ads may be removed from training data to reduce the auxiliary prediction model size and training time. For example, inactive ads, including both DCO and non-DCO ads, that had neither impressions nor conversions for a predefined time period (e.g., a week) may be classified as inactive ads. If they become active again, they may be treated as new DCO ads when they reappear in traffic.
According to the present teaching, SoftMax may be used to facilitate a controlled mechanism to provide an explore-exploit trade-off and let the system follow trends for presenting “better” combinations more frequently. This choice is merely for illustration rather than limitation. There are other solutions that may be used to achieve the same. For instance, the predictions can be used to select the closest distribution on the unit simplex. As another example, one can simply normalize each prediction by the sum of the predictions. All such solutions are within the scope of the present teaching.
In operation, when a conversion DCO ad A has won the auction for an impression that belongs to a traffic segment u, the objective is to maximize the chance for leading up to a conversion via exploiting the additional degree-of-freedom and selecting a combination of the DCO ad that has the best chance to entail a conversion. Accordingly, the objective is to maximize
Pr(conversion|u,A)=ΣC∈C
where the probability Pr(C selected, u, A) can be approximated by the true CVR prediction Pc (see step 15 of the Algorithm 2 illustrated above) predicted using the trained auxiliary model, and the probability Pr(u, A) is the probability Qc for a traffic segment combination that is needed for DCO ad A. A trivial solution to maximize Pr(C selected,u,A) is to assign all the probability mass to the combination that has the highest CVR prediction. However, it may not be desirable to present the same combination repeatedly to all users in the traffic segment (every time DCO ad A wins the auctions) because it may enhance the ad fatigue phenomenon and also prevent the system from exploring and following the dynamic trends which may prefer other combinations in terms of CVR. Therefore, it may be preferable to “pull” the distribution towards a uniform distribution in a controlled manner so that it creates a trade-off between exploration and exploitation. A natural way to do the latter may be to add an Entropy regularization term in optimizing Pr(C selected, u, A) and rewrite the above as
where the two constraints are added to ensure Q={Qc} is a distribution function (in a vector representation), and α>0 is the regularization parameter. Then, a well-known result in convex analysis may be applied so that the convex conjugate of ϕ(z)=Σ i=1d zi ln (zi) defined over the unit simplex is
and according to the conjugate-sub gradient theorem, we have
where P={Pc} are the combinations' predictions (in a vector presentation). Note that the result of Q* is a scalar while the result of ϕ*(x) is a vector. By setting α=PM/β where PM=max{Pc}, and because SoftMax is invariant under a shift, it can be derived that:
SoftMax(βP/PM)=SoftMax(−β(1−P/PM))
In this manner, we obtain the SoftMax combination distribution component of the Algorithm 2 (step 18). The final presentation of the SoftMax argument is preferred to allow relative interpretation, e.g., setting β=6.93 means that a 10% difference between the “best” combination and its runner-up entails approximately twice the probability. It is noted that the uniform components may be added to ensure a minimal amount of exploration.
As discussed herein, the initial CVR predictions for combinations need to be corrected and details about the correction operation is provided herein. There are multiple reasons for the correction. One is associated with the down sampling of the impressions (e.g., factor rds=100) performed to reduce the resources needed for training the auxiliary model. Another reason has to do with the fact that the conversions are not corresponding or “joined” with their impressions to avoid long training delays (conversions may be reported up to 30 days after they occur). For these reasons, to obtain a “correct” CVR prediction, the initial CVR predictions need to be compensated accordingly with respect to these operations. Below, an approach is disclosed that can accurately, on average, compensate via approximation. Assuming there are V conversions and S skips (i.e., impressions without conversions) for a certain ad. Then, the average “raw” CVR with down sampling of rds and non-join operation may be written as
Since the correct average CVR is
It is can be verified that
where the minimum operation is required to keep CVR<1 for very high average “raw” conversion rates
Unlike the CTR based DCO solution, which uses event (i.e., clicks) counting, the CVR based approach as discussed herein uses combination CVR prediction enabled via an auxiliary CF model and is capable of revealing collaborative conversion patterns that help in “filling” the gaps caused by data sparsity. In addition, the CVR based approach may combat the ad fatigue effect by reducing the probabilities of combinations the users are weary of, so that to increase the probabilities of other less exploited combinations. Unlike the CTR based DCO successive elimination solution disclosed above, the CVR based approach is “stateless” and therefore may be easily trained on traffic associated with other candidate models. This makes the ramp-up process of the auxiliary model relatively easy compared to that of the CTR based model.
Both CTR and CVR approaches to generate the combination distributions may be applied to CVR and CTR optimization goals. It means that conversion can be counted to get CVR predictions, and an auxiliary CTR prediction model can be trained to get CTR predictions and are able to follow and discover trends that affect the attractiveness of different combinations throughout the ad's lifespan because the regularly occurring retraining facilitates the dynamic update of the models (the CTR or CVR based prediction models) and, hence, the combination distributions.
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming, and general operation of such computer equipment and as a result the drawings should be self-explanatory.
Computer 1200, for example, includes COM ports 1250 connected to and from a network connected thereto to facilitate data communications. Computer 1200 also includes a central processing unit (CPU) 1220, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 1210, program storage and data storage of different forms (e.g., disk 1270, read only memory (ROM) 1230, or random-access memory (RAM) 1240), for various data files to be processed and/or communicated by computer 1200, as well as possibly program instructions to be executed by CPU 1220. Computer 1200 also includes an I/O component 1260, supporting input/output flows between the computer and other components therein such as user interface elements 1280. Computer 1200 may also receive programming and data via network communications.
Hence, aspects of the methods of information analytics and management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with information analytics and management. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server. In addition, the techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/software combination.
While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.
The present application is related to U.S. patent application Ser No. ______ (Attorney Docket No.: 146555.550717), entitled “SYSTEM AND METHOD FOR THIN EXPLORE/EXPLOIT LAYER FOR PROVIDING ADDITIONAL DEGREE OF FREEDOM IN RECOMMENDATIONS”, and U.S. patent application Ser. No. ______ (Attorney Docket No. 146555.552969), entitled “METHOD AND SYSTEM FOR CLICK RATE BASED DYNAMIC CREATIVE OPTIMIZATION AND APPLICATION THEREOF”, both of which are hereby incorporated by reference in their entireties.