1. Technical Field
The invention relates generally to digital advertising. More particularly, the invention relates to a data management platform for digital advertising.
2. Description of the Related Art
Over the last decade, a number of radical changes have reshaped the worlds of digital advertising, marketing, and media. The first is an innovation called programmatic buying, which is the process of executing media buys in an automated fashion through digital platforms, such as real-time bidding exchanges (RTBEs) and demand-side platforms (DSPs). This method replaces the traditional use of manual processes and negotiations to purchase digital media. Instead, an advertisement (ad) impression is made available through an auction in a RTBE in real-time. Upon requests from RTBEs, DSPs then choose to respond with bids and proposed ads on behalf of their advertisers for this impression. The entire end-to-end buying process between RTBEs and DSPs typically takes less than 150 ms including the network time, leaving less than 50 ms for DSPs to run their runtime pipelines. It is well understood that to make such dynamic buying decisions optimal, particular data, including user data, advertiser data, contextual data, plays a central role.
A second important shift is the prolific use of mobile devices, social networks, and video sites. As a result, marketers have gained powerful tools to reach customers through multiple channels such as but not limited to mobile, social, video, display, email, and search. There are numerous platforms dedicated to single channel optimization. For example, video channel platforms aim to maximize the user engagements with video ads, while social ad platforms aim to increase the number of fans and likes of a given product. Regardless of channel, data driven approaches have been proven to be very effective to lift the campaign performance.
With the advance of such technologies, one challenge to the marketers today is that the marketing strategy becomes more complicated than ever before. While much work has been done to optimize each individual channel, how different channels interact with each other is little understood. This is however very important as customers often interact with multiple touch points through multiple channels. One main obstacle is that while there are abundant data to leverage, such data may be in different platforms and in different forms. As a result, it may be a non-trivial task to create a global dashboard by extracting aggregated reporting data from different platforms. Performing even finer grain analytics across channels may be virtually impossible, which may be important to the effectiveness, attributions, and accurate rate of return of different channels.
Recently, data management platforms (DMPs) have been emerging as the solution to address the above challenge. A DMP may be a central hub to seamlessly and rapidly collect, integrate, manage, and activate large volume of data.
An embodiment of the invention comprises a data management platform (DMP) that integrates the following functionalities:
1. Data integration: A DMP is configured to cleanse and integrate data from multiple platforms or channels with heterogeneous schema. Importantly, such integration may have to happen at the finest granular level by linking the same audience or users across different platforms. By such functionality, a deeper and more insightful audience analytics may be obtained across campaign activities.
2. Analytics: A DMP provides full cross channel reporting and analytics capabilities. Examples may include, but are not limited to, aggregation, user behavior correlation analysis, multi-touch attribution, defined as attributing credit to the channels which contributed to a final action of an audience, tag management, analytical modeling, etc. Furthermore, such DMP may be delivered through cloud-based software-as-a-service (SaaS) to end users and provide them the flexibility to plug in their own analytical intelligence.
3. Activation: A DMP is configured to not only get data in, but also send data out in real-time. In other words, such DMP may need to make the insights actionable. For example, such DMP may be configured to perform modeling and scoring in real-time by combining online and offline data and sending the data to other platforms to optimize the downstream media and enhance the customer experience.
Thus, an embodiment of the invention provides a data management apparatus for digital advertising. A data integration processor is provided for collecting and storing data from providers, resolving heterogeneity of the data at schema and data levels, and performing validity checks of the data. An analytics processor is provided for receiving validated data from the data integration processor and providing to users custom, nesting-aware, SQL-like query language and a library of data mining methods, machine learning models, and analytical user profiles (AUP). Further, an activation processor is provided for encapsulating complex computations performed in real-time, segment evaluation, and online user classification using runtime user profiles (RUP).
An embodiment of the invention comprises a data management platform (DMP) that integrates the following functionalities:
1. Data integration: A DMP is configured to cleanse and integrate data from multiple platforms or channels with heterogeneous schema. Importantly, such integration may have to happen at the finest granular level by linking the same audience or users across different platforms. By such functionality, a deeper and more insightful audience analytics may be obtained across campaign activities.
2. Analytics: A DMP provides full cross channel reporting and analytics capabilities. Examples may include, but are not limited to, aggregation, user behavior correlation analysis, multi-touch attribution, defined as attributing credit to the channels which contributed to a final action of an audience, tag management, analytical modeling, etc. Furthermore, such DMP may be delivered through cloud-based software-as-a-service (SaaS) to end users and provide them the flexibility to plug in their own analytical intelligence.
3. Activation: A DMP is configured to not only get data in, but also send data out in real-time. In other words, such DMP may need to make the insights actionable. For example, such DMP may be configured to perform modeling and scoring in real-time by combining online and offline data and sending the data to other platforms to optimize the downstream media and enhance the customer experience.
Thus, an embodiment of the invention provides a data management apparatus for digital advertising. A data integration processor is provided for collecting and storing data from providers, resolving heterogeneity of the data at schema and data levels, and performing validity checks of the data. An analytics processor is provided for receiving validated data from the data integration processor and providing to users custom, nesting-aware, SQL-like query language and a library of data mining methods, machine learning models, and analytical user profiles (AUP). Further, an activation processor is provided for encapsulating complex computations performed in real-time, segment evaluation, and online user classification using runtime user profiles (RUP).
The following is an overview of an exemplary DMP in accordance with an embodiment of the invention. It has been found that DMPs are able to handle big data in batch mode, as well as in real-time, thus unifying techniques from multiple fields of data science, including databases, data mining, streaming, distributed systems, key-value stores, and machine learning as disclosed in K.-C. Lee, B. Orten, A. Dasdan, and W. Li, Estimating Conversion Rate in Display Advertising from Past Performance Data, in KDD, pages 768-776, 2012; X. Shao and L. Li. Data-driven Multi-touch Attribution Models, in KDD, pages 258-264, 2011 (“Shao”); etc.
The remainder of the discussion herein is organized as a high-level overview of an embodiment of a DMP and three main components thereof: data integration, analytics, and activation.
In an embodiment of the invention, an audience or user profile covers available information for a given anonymized user, including but not limited to, demographics, psychographics, campaign, and behavioral data. User profile data may be typically collected from various sources. Such data may be first party data, i.e. historical user data collected by advertisers in their own private customer relationship management (CRM) systems, or third party data, i.e. data provided by third party data partners, typically each specializing in a specific type of data, e.g. credit scores, buying intentions, etc. In one embodiment, user profiles are treated as first class citizen and are the basic units for offline analytics, as well as for real-time applications.
In an embodiment of the invention, user profile data may arrive in various types, formats, and cardinalities, which may be best captured using a nested relational data model. Logically, each user profile is one record, where some attributes of this record could be another table storing certain type of events. In addition to the digital marketing domain, the use of the nested relational data model has already gained wide adoption in the field of big data such as disclosed in S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330-339, 2010.
Based on the above-described functionality of DMPs, in accordance with an embodiment of the invention, a DMP maintains two versions of the user profiles. First, the analytical user profile (AUP) is designed for the purpose of offline analytics and data mining. In an embodiment of the invention, the AUP is stored in a Hadoop File System (HDFS), such as disclosed in Hadoop, Open Source Implementation of MapReduce at hadoop.apache.org. Second, a runtime user profile (RUP) is stored in a globally replicated key-value store to enable fast and reliable retrieval in few milliseconds for real-time applications.
An embodiment of the invention can be understood with reference to
The data integration engine 3404, referred to herein as the Datahub, is responsible for gathering and storing data from first and third party providers, resolving heterogeneity at schema and data levels, e.g. disparate user ids, and performing the necessary validity checks. Once the data are received from the external partners, such data then flow into the other two components. In an embodiment, the analytics engine 3406 may be known as Cheetah, in S. Chen, “Cheetah: A high performance, custom data warehouse on top of mapreduce,” PVLDB, 3(2):1459-1468, 2010 (“Cheetah document”) and has an AUP store as a data layer. The analytics engine 3406 provides the data analysts with a custom, nesting-aware, SQL-like query language called Cheetah Query Language (CQL), in addition to a rich library of data mining methods and machine learning models. In an embodiment, a runtime engine 3408 runs on top of an RUP store. The runtime engine 3408 encapsulates the complex computations performed in real-time, such as but not limited to segment evaluation, online user classification, etc.
In an embodiment of the invention, user online activity data, e.g. in the form of impressions, clicks, and actions, is sent to the DMP 3402 by the DSPs 3410, e.g. for impression and click data, and the advertiser, e.g. for action data. Because an embodiment of the DMP may be integrated both with its own DSP, as well as other DSPs in the ecosystem, the online data may be obtained from those external DSPs. In such cases, they are considered to be third party media providers 3418, analogous to third party offline data providers 3420.
In the following discussion, particular, important components of one or more embodiments of the DMP 3402 are explained in more detail.
In an embodiment of the invention, a user profile may be a central data repository in the DMP. Each profile contains marketing campaign data, online behavior data, CRM data, etc. Some of such data are collected online, while others are collected by loading offline data files, which are typically keyed off disparate user ids from another platform.
In an embodiment of the invention, the DMP designed integration software is referred to as the Datahub and is used to receive offline data files. At a high level, the Datahub implements three steps:
In an embodiment of the invention, the Datahub handles scalability through multiple FTP servers, multi-pipeline concurrent loading, a Hadoop MapReduce computation model, and its own job scheduler to prioritize specific jobs. The Datahub also achieves immunity from bad data by an initial validation of offline data, thus shielding an embodiment of the DMP from any dirty data. Additionally, the Datahub uses configuration files for instantiating different loading templates and a centralized catalog, supporting schema evolvement, to mitigate heterogeneity issues. The Datahub consistently saves metrics, such as files that were received and stored, records processed, records rejected, last successful pipeline step, and profiling times, in a database. Such monitoring information enables system alerts, client notifications, and billing statements. Furthermore, the Datahub may be configured to recover after failure through a fault tolerance protocol relying on persistent status files. As well, by leveraging the nested data model of user profiles, the Datahub may incorporate more custom logic into a join algorithm, e.g. two data files may easily be differentiated and loaded incrementally.
In an embodiment of the invention, analytics over AUPs may be based on Cheetah, which is a high performance, custom data warehouse, as disclosed in the Cheetah document, supra. Cheetah has a SQL-like query language (CQL), which also supports queries over nested data models. Below is an example query:
In an embodiment of the invention, there are two nested tables in the user profile: prof.actions and prof.impressions, which record user's actions or conversions and impressions, respectively. Both nested tables have the field, advertiser, to identify which advertiser the action/impression is related to; and the field, ts, as the time stamp. Therefore, the query above applies GROUP-BY to the column advertiser of the nested table prof.actions, to compute the total occurrence of actions, i.e. count( ) and the number of users who have the action, i.e. count(distinct uid), given the WHERE clause indicating at least one impression from the same advertiser should take place before an action. The filtering condition is composed of a sub-query, which calculates the total number of impressions occurring before the concerned action, i.e. b.ts<a.ts, from the same advertiser, i.e. a.advertiser=b.advertiser, by querying on the nested table prof.impressions of the same user profile.
In an embodiment of the invention, Cheetah employs a number of optimization techniques for AUP queries. To name a few, but not to be limiting:
In an embodiment of the invention, CQL allows for SQL-based aggregations and correlations between different audience events. Sometimes, marketers look for more advanced analytics, such as modeling and machine learning. One example is multi-touch attribution (MTA) as described in Shao, supra.
In an implementation of an embodiment, MTA is a billing model that defines how advertisers distribute credit, e.g. customer purchase, to their campaigns in different media channels, e.g. video, display, mobile, etc. For example, suppose a user sees a car ad on a Web browser. Later, the user sees a TV commercial about the same car again, which makes him more interested. Finally, after the user sees this ad again on his mobile phone, he takes action and registers for a test drive. Marketers know that such media channels may contribute to a final conversion of an audience. However, a current common practice is last-touch attribution (LTA), where the last impression, the one on the mobile phone, gets the credit. A better and fairer advertising ecosystem is expected to distribute the credit to the channels that contributed to her final action. This is the so-called multi-touch attribution problem. In an embodiment of the DMP, different MTA models are incorporated as user defined functions (UDFs) into CQL. This way, CQL users have the freedom to feed an MTA algorithm with arbitrary input data.
In an embodiment of the invention, CQL as well as the data mining UDFs are exposed to external clients as a data service in the cloud and are configured such that the external clients may perform ad hoc analysis and obtain very unique insights on their own.
For purposes of understanding herein, RUPs may refer to user profiles stored in profile stores for real-time applications. In an embodiment of the invention, as with AUPs, RUPs also have a nested data model and are updated incrementally and in real-time with new user events. Profile stores, as with other Not only SQL (NoSQL) systems, are high-performance, key-value stores for RUPs, with keys being user ids and values being RUPs. As important runtime components, profile stores are highly optimized to provide low-latency read/write RUP access, typically within a few milliseconds to support peak 1,000,000 queries per second across multiple, geographically distributed data centers.
A design of an embodiment of a profile store is inspired by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels' disclosure entitled, “Dynamo: Amazon's highly available key-value store,” in SOSP, pages 205-220, 2007 and by Voldemort, which may be found at www.project-voldemort.com/voldemort. In an embodiment of the invention, a software layer is built on top of Berkeley DB (BDB) that uses consistent hashing to achieve sharding, replication, consistency, and fault tolerance. The embodiment of the profile store also employs flash drives because hard disks may not be fast enough for the purpose. RUPs are replicated locally in each data center, as well as globally between data centers, to achieve high availability and local low-latency access.
In an embodiment of the invention, to guarantee real-time synchronized RUPs in every data center, an infrastructure called the replication bus is built and employed that incrementally replicates user events across data centers and distributes such to profile stores to keep RUPs up-to-date. The replication bus is highly optimized to synchronize tens of billions of events daily between data centers with an average end-to-end service level agreement (end-to-end SLA) of within a few seconds.
Real-Time Processing Pipeline
It has been found that an important feature of a modern data management platform (DMP) is its ability to cope with data flow in real-time. An embodiment of the DMP herein disclosed is equipped with many real-time data processing components. In an embodiment of the invention, some real-time DMP components may consist of data, analytics, user modeling, complex event processing (CEP), and actionable signal generation.
In an embodiment of the invention, both AUP and RUP may store arbitrary user level data in a nested format. Ingress servers are responsible for receiving and storing data look up user profiles in real-time and performing mapping between cookies when necessary. The platform supports multiple types of data as impression/click events, structured data events, or arbitrary key-value pair data events. These data events are available in RUPs in real-time for the platform to use for algorithmic computation and decision making as well as analytics. Eventually such events from RUP are replicated to AUP.
In an embodiment of the invention, the platform supports multiple real-time operations on the received data. Many of such operations may be modeled as complex event processing. For example, one entity might want to find if a user belongs to a particular set of predefined segments in real-time. The segments are represented as a complex Boolean expression of attributes defined by some predefined taxonomy. Often the segments may be significantly more complicated than simple Boolean expressions, e.g. having some user behavior constraints such as having seen a display advertisement in the last seven days. Such complex segments may be represented by some form of executable code that is evaluated against the RUP data in real-time.
Another example use of real-time computation on a user profile in accordance with an embodiment of the invention involves evaluating a user against machine-learned models. Such models may be specified by the users of the DMP in some proprietary format or by using industry standard model specification language, such as Predictive Model Markup Language (PMML), an example of which may be found at en.wikipedia.org/wiki/predictive model markup language. An example model may predict a car buyer based on the latest online activity or a person likely to apply for a credit card. Having such knowledge in real-time may be immensely valuable to clients because they may use such prediction as signals to bias the campaigns or take other actions in real-time.
In an embodiment of the invention, in some cases, a computation for a particular algorithm may be significantly complex requiring multiple stages of a computation layer. Such style of computations may be simply thought of as a series of real-time MapReduce jobs processing the data step-by-step. The computation is represented by a continuous query language or by predefined operators using UDFs, which operate on RUPs in real-time. Such approach may solve complex tasks, such as learning a classification model, performing anomaly detection, or performing other data stream algorithms, such as maintaining top-K elements in a stream, as disclosed for example in A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, pages 398-412, 2005.
In an embodiment of the invention, signals generated out of the computation layer are stored as unstructured data in RUPs and AUPs and may also be sent back to the clients through egress servers for immediate action. DSPs or other platforms may immediately leverage such signals for better user behavior prediction to achieve better campaign performance.
Digital advertising has now reached a state where the pipeline between publishers on the supply side and advertisers on the demand site necessitates many technology partners to help publishers and advertisers deal with real-time optimal decisioning on a huge scale. Among such technology partners, data management platforms may occupy a prominent role as the hub where data relevant to reaching the audience over different channels is integrated, analyzed, and shared. A high-level overview of one or more embodiments of the DMP as an example demand side platform has been disclosed. It is contemplated that due to efficiencies gained through real-time decisioning and the scales involved with more online usage, the future of advertising may be more real-time, which may imply more data and components in real-time.
The computer system 3500 includes a processor 3502, a main memory 3504 and a static memory 3506, which communicate with each other via a bus 3508. The computer system 3500 may further include a display unit 3510, for example, a liquid crystal display (LCD) or a cathode ray tube (CRT). The computer system 3500 also includes an alphanumeric input device 3512, for example, a keyboard; a cursor control device 3514, for example, a mouse; a disk drive unit 3516, a signal generation device 3518, for example, a speaker, and a network interface device 3520.
The disk drive unit 3516 includes a machine-readable medium 3524 on which is stored a set of executable instructions, i.e. software, 3526 embodying any one, or all, of the methodologies described herein below. The software 3526 is also shown to reside, completely or at least partially, within the main memory 3504 and/or within the processor 3502. The software 3526 may further be transmitted or received over a network 3528, 3530 by means of a network interface device 3520.
In contrast to the system 3500 discussed above, a different embodiment uses logic circuitry instead of computer-executed instructions to implement processing entities. Depending upon the particular requirements of the application in the areas of speed, expense, tooling costs, and the like, this logic may be implemented by constructing an application-specific integrated circuit (ASIC) having thousands of tiny integrated transistors. Such an ASIC may be implemented with CMOS (complementary metal oxide semiconductor), TTL (transistor-transistor logic), VLSI (very large systems integration), or another suitable construction. Other alternatives include a digital signal processing chip (DSP), discrete circuitry (such as resistors, capacitors, diodes, inductors, and transistors), field programmable gate array (FPGA), programmable logic array (PLA), programmable logic device (PLD), and the like.
It is to be understood that embodiments may be used as or to support software programs or software modules executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a system or computer readable medium. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine, e.g. a computer. For example, a machine readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals, for example, carrier waves, infrared signals, digital signals, etc.; or any other type of media suitable for storing or transmitting information.
Further, it is to be understood that embodiments may include performing operations and using storage with cloud computing. For the purposes of discussion herein, cloud computing may mean executing algorithms on any network that is accessible by internet-enabled or network-enabled devices, servers, or clients and that do not require complex hardware configurations, e.g. requiring cables and complex software configurations, e.g. requiring a consultant to install. For example, embodiments may provide one or more cloud computing solutions that enable users, e.g. users on the go, to obtain advertising analytics or universal tag management in accordance with embodiments herein on such internet-enabled or other network-enabled devices, servers, or clients. It further should be appreciated that one or more cloud computing embodiments may include providing or obtaining advertising analytics or performing universal tag management using mobile devices, tablets, and the like, as such devices are becoming standard consumer devices.
Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.
This application claims priority to U.S. provisional patent application Ser. No. 61/801,001, filed Mar. 22, 2013, which application is incorporated herein in its entirety by this reference thereto.
Number | Date | Country | |
---|---|---|---|
61801001 | Mar 2013 | US |