The present teaching generally relates to data processing. More specifically, the present teaching relates to big data analytics and modeling thereof.
With the development of the Internet and the ubiquitous network connections, more and more commercial and social activities are conducted online. Digitized content is served or recommended online to millions of users. Advertisements are displayed to the users while they consume the digitized content, and the users can interact with the ads by either viewing them or clicking on them to visit the advertiser's webpage, where they may make a purchase. To make online advertising more effective, targeting has been practiced. This includes targeting users from the perspective of advertisers and selecting appropriate ads for online users who may be interested in the content of the ads. Targeting related processing is usually behind the scenes and the goal is to match each ad with the user segments that are most likely to react to the ads positively in order to maximize the financial return. Ad targeting involves prediction of which segments a user or an ad opportunity context belongs to, from a very large list of possible segments. These segment-related products include interest segments, predictive audiences, . . . , lookalike segments, as illustrated in
There are various challenges in ad targeting. First, traditionally, targeting is performed through modeling users based on observed past user online behavior or preferences. It is commonly known that over the years, tracking user online activities has been widely achieved via, e.g., cookies. However, in recent years, the cookies have been gradually phased out and this trend is continuing. In this brave new cookieless world, tracking and understanding user preferences and online behavior becomes unattainable. Given that, some targeting products for producing user-based audiences and segments for advertisers to target started to shift to contextual based modeling counterparts. In many situations, such contextual based targeting products share most of the properties with the user-based counterpart products. This is particularly the case when so-called “panel users” (i.e., a set of users whose online behavior can be tracked, and can serve as exemplars of the desired behavior or preferences) are used to build models for Predictive Audiences (PA) and Lookalike Segments (LAL).
The second challenge has to do with modeling capacity and scalability. Current solutions treat an underlying prediction problem (e.g., predicting the probability of conversion or click through rate) as a binary classification problem. This kind of solution leverages a separate binary classifier or model for estimating the conversion probability for each of the targeting segments. This is depicted in
An ad serving system using such traditional solutions is depicted in
Formulating the prediction problem this way leads to undesirable consequences, including lower predictive power of the models, inability to consider the interactions among different segments, and wasted hardware and computing times.
Thus, there is a need for a solution that addresses the challenges discussed above, and to enhance the operations in ad targeting.
The teachings disclosed herein relate to methods, systems, and programming for information management. More particularly, the present teaching relates to methods, systems, and programming related to hash table and storage management using the same.
In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for predictive targeting. Training data are obtained with pairs of data. Each pair includes an ad opportunity context corresponding to an ad served to a plurality of audiences and a label vector having a plurality of labels, each of which indicates a reaction, with respect to the ad served, of a corresponding one of the plurality of audiences in the ad opportunity context. Based on the training data, model parameters of a joint predictive model are learned via machine learning based on an initialized model with initial model parameters by minimizing a loss in an iterative process. The learned joint predictive model is to be used to map an input context of an ad opportunity to an output label vector having a plurality of probabilities, each of which predicts a likelihood of a reaction of a corresponding one of the plurality of audiences to the input context of the ad opportunity.
In a different example, a system is disclosed for predictive targeting, including a training data generator, a model initializer, and a machine learning controller. The training data generator is configured for generating training data comprising pairs of data, each of the pairs includes an ad opportunity context corresponding to an ad served to a plurality of audiences and a label vector having a plurality of labels, each of which indicates a reaction, with respect to the ad served, of a corresponding one of the plurality of audiences in the ad opportunity context. The model initializer is configured for initializing a joint predictive model with initial model parameters. The machine learning controller is configured for machine learning, based on the training data, model parameters of the joint predictive model based on the initial model parameters by minimizing a loss in an iterative process, where the learned joint predictive model is to be used to map an input context of an ad opportunity to an output label vector having a plurality of probabilities, each of which predicts a likelihood of a reaction of a corresponding one of the plurality of audiences to the input context of the ad opportunity.
Other concepts relate to software for implementing the present teaching. A software product, in accordance with this concept, includes at least one machine-readable non-transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.
Another example is a machine-readable, non-transitory and tangible medium having information recorded thereon for predictive targeting. The information, when read by the machine, causes the machine to perform various steps. Training data are obtained with pairs of data. Each pair includes an ad opportunity context corresponding to an ad served to a plurality of audiences and a label vector having a plurality of labels, each of which indicates a reaction, with respect to the ad served, of a corresponding one of the plurality of audiences in the ad opportunity context. Based on the training data, model parameters of a joint predictive model are learned via machine learning based on an initialized model with initial model parameters by minimizing a loss in an iterative process. The learned joint predictive model is to be used to map an input context of an ad opportunity to an output label vector having a plurality of probabilities, each of which predicts a likelihood of a reaction of a corresponding one of the plurality of audiences to the input context of the ad opportunity.
Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or systems have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.
The present teaching discloses solutions that address challenges in ad targeting. To avoid modeling individual segments separately as in the traditional approaches, the present teaching relates to an integrated joint predictive approach for simultaneously predicting conversion probabilities of a large number of audience segments. To do so, a joint predictive model is trained via machine learning based on data related to all audience segments. In some embodiments, the joint predictive modeling is formulated as an extreme multi-label classification (XMLC) problem, which leverages the benefits of the conventional Factorization Machine (FM) model for the purpose of joint multi-audience conversion prediction. The present teaching discloses a solution that addresses the issue related to unavailable user identifiers, focusing on performance (e.g., conversion) prediction based on cookieless traffic involving several thousands of contextual predictive audiences.
In framework 200, the learned XMLC ML model 230 is used to predict jointly the performance of a large number of segments. When an ad opportunity is presented with given context, a contextual segment selector 250 invokes a performance prediction unit 240, which estimates the probabilities of performance of all segments of users based on the XMLC ML model 230 and generates a probability vector P=[P1, P2, . . . , Pi, . . . , Pk], wherein each attribute in the probability vector represents the probability of performance (e.g., conversion) for a corresponding user segment. Based on the probability vector P, the contextual segment selector 250 then selects one or more segments as the target segments.
The framework as depicted in
The targeting unit 330 includes the performance prediction unit 240 and the contextual segment selector 250. As discussed herein, the performance prediction unit 240 is provided to estimate, based on the XMLC ML model 230, a probability vector P with probabilities with respect to a large number of segments of predicted performance given the context associated with the current ad display opportunity. The contextual segment selector 250 is provided to select, based on the predicted performance probabilities for all segments, one or more segments as the targeted segments. In some embodiments, the targeting unit 330 may be implemented as a part of the DSP 320. In some embodiments, the target unit 330 may be provided as an independent service provider residing outside of the DSP 320 or even outside of the ad serving system 300.
In the illustrated embodiment, the back end of the ad serving system 300 is provided to establish and update the XMLC ML model 230 so that the XMLC ML model 230 can be used by the targeting unit 330 to target user segments. In
Taking the probability vector P as input, the contextual segment selector 250 selects, at 470, targeted segments according to some pre-determined selection criteria. Such targeted segments are then sent, as the output of the targeting unit 330, to the DSP 320. As shown above, the probability vector includes a plurality of probabilities, each of which corresponds to an estimated probability that a corresponding user segment achieves a certain performance (such as a conversion) given the contextual information associated with the ad opportunity. When multiple DSPs are involved, each of the DSPs performs their respective targeting with their selected segments. In some embodiments, the XMLC ML model 230 may be shared among multiple DSPs and may be trained based on training data from such DSPs. In some embodiments, each DSP may be associated with its own XMLC ML model, trained based on training data from the DSP.
Referring back to
The improvement facilitated by the ad serving system 300 as compared with the traditional ad serving system depicted in
First, the formulation of the conversion prediction problem across multiple (and a potentially large number of) contextual predictive audiences is presented herein in the form of an extreme multi-label classification problem, where each label represents an audience. Let S={(x1, y1), (x2, y2), . . . , (xN, yN)} be a set of labeled contexts of ad opportunities. An ad opportunity context xi∈ corresponds to data from the data storage. Such data may be preprocessed so that it (1) represents a concatenation of one-hot encodings of the available contextual fields (e.g., resulting in D features, out of which {circumflex over (D)} are active), and (2) is associated with a vector yi∈{0,1}L including conversion labels for L audiences, for each i=1, . . . ,N. Note that yi,l takes a value of 1 or 0 depending on whether or not a conversion was registered for audience l, as an indirect result of xi, ∀l=1, . . . , L. The objective is to learn a function ƒ:→ that maps a context of a given ad opportunity x to a label vector y of estimated conversion probabilities for all contextual audiences. It is noted herein that the terms “label” and “audience” may be used interchangeably. In addition, the problem formulated above may be referred to as multi-audience conversion prediction.
In some embodiments, such (xi, yi) tuples are derived based on data stored in storage 340, which represent a collection of records of contexts of ad opportunities and their corresponding conversions (observed on any platform) across different predictive audiences. In some embodiments, the data collection process may involve three data sources: predictive audience definitions, contextual ad opportunities, and user conversion data. Some procedures may be deployed to extract, filter, and integrate data from different sources.
Regarding definitions of audience, data collection may include selecting records from the database of predictive segments such as tables concerning audience pixels, audience definitions, accounts, interest taxonomies and pixel rules. When such tables are joined, the resulting records includes information about a certain audience's identifier, dot pixel, dot rule, country code, and device screen type. In some embodiments, country codes and screen types may be extracted from the audience definition table. In some embodiments, only active audiences with valid pixel IDs may be considered, stemming from, e.g., a multi-tenant streaming system for real-time ingestion of event data (e.g., conversion events) and segment scoring. Such resulting records may be considered as audience conversion rules.
Regarding geographical location or geolocation, such information included in the audience definitions may be represented by country codes. In some embodiments, when representation of such codes is not compliant with the standardized use of this type of geolocation information across different data sources, such country codes from the audience definitions may be mapped to, e.g., corresponding ISO3 counterparts.
With respect to data related to contextual fields, features describing contexts of ad opportunities may be collected from relevant databases. In some embodiments, ad opportunities may be selected if each ad won an ad auction, the corresponding ad was displayed to a certain user, and a user impression was registered upon displaying the ad. In some embodiments, the data selection may also be limited to only ad opportunities that resulted in traffic-protected, valid, or viewable impressions. For such a set of filtered ad opportunities, some fields may be extracted such as event identifier, user_id, webpage top-level domain (TLD), webpage subdomain, Where On Earth Identifiers (WOEIDs) of city, country and region, a user's local time in terms of day-of-week and hour, device type, device category, operating system, browser type, mobile device manufacturer, mobile model, application name, publisher identifier, publisher category, identifier of the publisher's request, ad layout, ad position, ad placement, bidding host machine, video placement type, video content length, site placement, postal code, video player size, metropolitan area identifier, media channel, connection type, and carrier identifier.
In some embodiments, contextual fields, along with the geolocation field (i.e., the ISO3 country code) may be one-hot encoded, resulting in millions of binary features, each indicating a presence of a certain field's category. In some embodiments, features having a frequency of some present (nonmissing) values, e.g., lower than 5 may be filtered out. The remaining ones may be ordered according to their frequencies so that a certain top number of ordered features (e.g., top 200,000) may be selected. The remaining low-frequency features may be replaced by an additional feature which may be designated to represent an “unknown” category.
With regard to labels on performance, e.g., conversion labels, historical activity trails of users (e.g., from the uat_fact table in the uat database) may be searched through to detect if an audience-targeted conversion rule was detected in any activity within a preset period (e.g., seven days of registering a certain user impression). This may be achieved by joining the selected audience definitions with user activity records based on third party event identifiers associated with the audience definitions (i.e., the activity pixel and activity rule identifiers in the case of UAT). Such operations essentially associate users with binary conversion labels indicating for which advertiser (i.e., as a part of which audience) the users converted.
In some embodiments, users' contextual, geolocation, and device type features may be associated with their corresponding conversion labels across all audiences based on, e.g., the users' identifiers and an audience definition (e.g., BMW X audience of mobile web users in Canada is eligible when an ad opportunity is presented for a user browsing on a mobile phone in Canada). In some embodiments, a random selection was performed on such associated data to identify a certain volume (e.g., one million) of ad opportunities that resulted in conversions within a period (e.g., a week) of time, and a certain volume (e.g., one million) of ad opportunities which have not resulted in any conversions. Such resulting augmented dataset may then be packed into the sparse (LibSVM) data format and the corresponding vocabularies of features and labels may be generated.
Such generated data may then be used for training the XMLC ML model 230. In some embodiments, dimensionality reduction on representations of features may be performed. For example, a feature embedding lookup may be created such that every feature, representing a binary random variable Xj from a vocabulary Vfeat={X1, X2, . . . XD}, is assigned an embedding vj∈, such that M«D. The values of such created feature embeddings may be initialized with random uniform values. A mapping g:Vfeat→ is used to retrieve the embedding vj=g(Xj) of the feature Xj, for every j=1, . . . , D. In formulating the learning, interactions among features may be considered according to the present teaching. When there is a larger number of features associated with each ad opportunity context, the feature interactions to be incorporated into the formulation may be limited to a certain extent. For example, in some situations, only second-degree feature interactions may be considered. This is illustrated in the following.
A decision function with a second-degree feature interaction may be defined as:
where w0=[w0,1, w0,2, . . . , w0,L] is an L-dimensional vector where w0,l is the bias term for the l-th audience. W=[wj,l]D×L is a two-dimensional matrix in which entry wj,l represents the weight of the j-th feature with respect to the l-th audience. Wint=[Wj,k,lint]{circumflex over (D)}×{circumflex over (D)}×L, is a three-dimensional matrix such that wj,k,lint is the strength of the interaction between the j-th and the k-th active feature. {circumflex over (X)} is a set with cardinality {circumflex over (D)} containing the indices of the active features of x. V is a D×M embedding matrix (applied to all audiences) in which each row vj is the M-dimensional embedding for the j-th feature, meaning that v{circumflex over (x)}j corresponds to the embedding the j-th active feature. In the above formulation, operator ·,· computes the dot product between the embeddings of two features as shown below:
for each j, k=1, . . . , {circumflex over (D)}.
This extends the capacity of the linear formulation given by the first two separate terms of equation (1) and thus allows for modeling between-feature interactions. In this exemplary formulation, instead of using a separate parameter for each feature interaction with respect to each audience, the feature interactions are essentially modeled by factorizing the interaction strengths. This is one of the central benefits of factorization machine models which aids in obtaining estimates of the feature interaction values even when dealing with considerably sparse features, as is the case frequently with ad opportunity features.
To learn the parameters of the XMLC ML model 230, considering a dataset S={(x1, y1), (x2, y2), . . . , (xN, yN)} of labeled contexts of ad opportunities, the categorical cross-entropy loss for the i-th context is defined as:
where pi,l=P(yi,l|xi)=1/(1−e−ƒ(x
In this exemplary formulation, a sigmoid function, instead of a softmax function, is used to obtain the conversion probabilities since an ad opportunity may result in conversions for multiple audiences. The parameters to be learned include the values of the embeddings for different features, as well as the model parameters w0*, W*, Wint*, V*. Such parameters may be initially assigned with randomly initialized parameter values and adjusted in the learning process by minimizing the total loss calculated over all ad opportunity contexts in S.
This model and corresponding learning mechanism via multi-label factorization machine (MLFM) formulation has a space complexity of O(L+DM+{circumflex over (D)}LM+L{circumflex over (D)}({circumflex over (D)}−1)/2), where D is the number of features, {circumflex over (D)} is the number of active features, M is the feature embedding dimension, and L is the number of audience segments. Similarly, the time complexity becomes asymptotically linear in terms of {circumflex over (D)} and M, i.e., O({circumflex over (D)}M).
As depicted in
The training data generator 520 may be provided to organize the data into a form that is needed for further processing. For instance, each of the ad opportunity contexts needs to be pairs with a corresponding label, representing the outcome performance after the ad display to form S={(x1, y1), (x2, y2), . . . , (xN, yN)} etc. As another example, each of the ad opportunity contexts, xi may be encoded as a one hot vector and then mapped to an embedding of a reduced dimension. The context feature vector initializer 530 may be provided to initialize the values of embeddings representing the features in the training ad opportunity contexts. The initialization of embeddings may be according to a profile stored in 540 that specifies the scheme used for initializing the embeddings. For instance, the profile may specify to use randomly generated numbers as initial values of the embeddings. Other profiles for initialization of embeddings may also be specified in 540 so that the learning mechanism can be flexibly adapted to a different scheme in operation.
The initialized embedding values may be stored in the XMLC ML model 230 as the initial state. At the same time, various weights used in the XMLC ML model 230, e.g., the model parameters w0*, W*, Wint*, V*, may be initialized by the model weight initializer 550. Similarly, such initialization may also be performed based on a scheme specified by the configured profile in 540. For instance, initial model weights may be assigned a certain number, i.e., everything weighted equally. Once the model parameters of the XMLC ML model 230 are initialized, they are stored as current model parameters in 230. To learn the model parameters, the machine learning controller 560 manages an iterative learning process. It invokes the loss determiner 570 and model parameter optimizer 580 in each iteration. For example, based on the current model parameters in 230, the loss determiner 570 may compute the loss during each iteration based on a specified loss function defined in a profile stored in the learning control profiles 540. Then the model parameter optimizer 580 determines, based on the loss computed, how to adjust the model parameters to minimize the loss. If the loss indicates that the learning has not converged, another iteration starts by computing a loss based on adjusted model parameters. The machine learning controller 560 may control the learning process in accordance with some convergence condition, e.g., a level of loss defining convergence, specified in a learning control profile stored in 540. The learning process may not stop until the convergence condition is met.
Based on the initialized model parameters (including both embedding values as well as model weights), the machine learning controller 560 controls an iterative learning process involving steps 640-690. Specifically, the loss determiner 570 computes, at 640, the loss based on the current model parameters. It is then determined, at 650, whether the loss satisfies a convergence condition. If the loss does not satisfy the convergence condition, it means that the model parameters are not yet optimal and the model parameter optimizer 580 then proceeds to adjust, at 660, the model parameters by minimizing the loss. Once the model parameters are adjusted, the adjusted model parameters are stored in 230 for the next round of learning. To do so, the processing proceeds to step 640 to compute the loss based on the updated model parameters. The iterations continue until the learning converges, i.e., the loss meets the convergence condition, determined at 650. For instance, the loss may meet the convergence condition when the loss is below a pre-determined threshold. In this case, the current model parameters are converged, and they are stored, at 670, in 230 to represent the learned XMLC ML model 230.
As the data collection is continuing from the ad serving operations, the trained XMLC ML model 230 may need to be adapted to a new environment or situation. Such adaptation may be regularly scheduled or dynamically activated when, e.g., there are enough new data collected from the ad serving operations. Thus, after the XMLC ML model 230 is learned and used in ad serving applications, it may be checked, either regularly or dynamically, whether additional training (adaptation) is needed at 680. If so, the process proceeds to step 610 to conduct another round of learning to produce XMLC ML model 230 that is adapted to the new training data.
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar with them to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming, and general operation of such computer equipment and as a result the drawings should be self-explanatory.
Computer 800, for example, includes COM ports 850 connected to and from a network connected thereto to facilitate data communications. Computer 800 also includes a central processing unit (CPU) 820, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 810, program storage and data storage of different forms (e.g., disk 870, read only memory (ROM) 830, or random-access memory (RAM) 840), for various data files to be processed and/or communicated by computer 800, as well as possibly program instructions to be executed by CPU 820. Computer 800 also includes an I/O component 860, supporting input/output flows between the computer and other components therein such as user interface elements 880. Computer 800 may also receive programming and data via network communications.
Hence, aspects of the methods of information analytics and management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with information analytics and management. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server. In addition, the techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/software combination.
While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.