Most commonly-accepted methodologies relating to contextual web site analysis and ad serving conform to a specific (deferred analysis) model whereby the first request for an advertisement logs the hosting page URL to a deferred offline, queue-based system. In this traditional system, the first few ad requests are fulfilled with stock content or Public Service ads while the offline process works through a queue of all pending URLs to iteratively crawl the page and perform content analysis of the page to derive a specific contextual classification of the page/URL by examining the page content in its entirety—a very time/CPU-intensive evaluation. Once the context (and its corresponding ad content) is derived, the URL entry in the database is coded as such, such that further requests from this hosting page/URL can simply be referenced against this now-preclassified context to then serve appropriate content. This commonly accepted model requires an incredibly large scale of database and processing power at scale because the system must maintain a list of literally every possible URL that hosts an ad placement. The context value of the page is also limited by the frequency in which the hosting page/URL is re-evaluated for new content.
The present invention provides an alternative method of achieving contextual ad serving, without the need for this expansive infrastructure of storing every possible hosting URL, and ensuring always-current page context by evaluating the context of the hosting page/URL in real-time, for every ad impression. This is achievable by reversing the model by identifying the corpus of all terms relevant to the available ad inventory (i.e. a selective set of terms) rather than attempting to evaluate the corpus of all terms residing in the hosting page/URL. The management of these selective set of terms (ContextBuckets), the manner in which these terms are associated with the appropriate ad content (Ad Content ->ContextBucket), and the mechanism for evaluating the hosting page/url against this set of selective terms (via the condensing of the term sets into the Tokenspace are the three primary claims in support of this filing.
Preferred and alternative examples of the present invention are described in detail below with reference to the following drawings:
The present invention provides as system and method for deriving contextually-relevant associations between groups of pre-defined content to a web page in which the content will be rendered for the purpose of delivering ad content to a web page with the highest possible context to the target page content. The invention allows highly relevant content to be delivered to a web page (i.e. advertising).
This method is herein referred to as the CLASS method. The CLASS method includes four basic object types: the TokenSpace, the ContextBucket, the Centroid, and the Document.
The ContextBucket serves as a named definition to be eventually associated with a collection of web content (ex: the advertisement content). The ContextBucket has two pieces of member data: a Name and a set of n-grams, which are used as a basis for generating a Centroid. The set of n-grams are descriptors for the ContextBucket.
The Centroid is a normalized representation of the ContextBucket. Normalization in this context is defined as one of many methods available for down-casting and/or stemming of n-grams combined with an accept/reject methodology for n-grams.
The TokenSpace is a union of all normalized n-grams of each Centroid, ordered by an ordering function (ex: a Latin alphabetical sort).
A source-document represents the content being evaluated for contextual mapping.
The Document represents the normalized version of the source-document that will be used for term-vector distancing against the Centroids in the TokenSpace.
The associations between these elements are visually represented in the FIGURE.
The CLASS method is described as follows:
At startup, the set of all defined ContextBuckets are iterated over and a Centroid is created for each ContextBucket. The set of n-grams is iterated over and each n-gram is either accepted or rejected by a Centroid building function. Accepted n-grams are then normalized via one or more pluggable normalization providers and are then added to the Centroid. One such example normalization would be keyword stemming (stemming is a process for reducing inflected (or sometimes derived) words to their stem, base or root form).
There now exists a set of “unfinished” Centroids SC={C0 . . . Cn}. Then, the union of all of the n-grams of each Centroid is determined. The n-grams in the union are ordered via an ordering function—typically the natural (Latin alphabetical) order of the n-grams can be used. This ordered union of all Centroids is called the TokenSpace.
Next, each Centroid is bound to the TokenSpace and a term vector is computed for each Centroid in the TokenSpace. A term vector in this context is a simple list of integers corresponding to the TokenSpace, where each member of the list is equal to the count of the occurrences of the corresponding term from the TokenSpace in the provided Centroid or Document.
As an example, assume there are two ContextBuckets:
Assuming the rejection function was to only accept dictionary words and the normalization function was simply the down-casting function, the following language definition is attained:
L=[“catnip”, “golden retriever”, “kitty”, “Labrador”, “litter”, “pet”, “puppy”, “whiskers”]
Thus the following two Centroids would be cast as:
The system is then ready to accept documents for categorization/mapping against the Centroids.
When the system is asked to categorize a source-document, it passes the source document to a Tokenizer. The role of the Tokenizer is to present a set of n-gram candidates to a Document Builder. The Tokenizer uses the same normalization and rejection functions as were configured for the generation of Centroids to process all keywords in the document. Only those normalized keywords/n-grams from the source document that exist in the TokenSpace can be represented as candidates. The Document Builder then builds a Document to represent the source data. Thus, the Document represents a normalized set of matching n-grams from the source-document.
For example, if one were to attempt to categorize the contents of a (fictional) web page URL (http://www.kittylitter.com), the entirety of the page content is essentially reduced to a set of normalized n-grams derived from this document-source:
The document source URL, n-grams, and term-vector are constructed into a Document. Once this Document is constructed, it is passed to a BucketMapper, which categorizes the Document by mapping it to the Centroids in the system.
This mapping by the BucketMapper is performed by finding the Centroid with the “nearest-neighbor” term-vector to the requested Document in the TokenSpace.
Given the definition of the dot product:
a dot b=|a∥b|*cos(<ab)
Then:
<ab=a cos((a dot b)/(|a∥b|))
This formula is used to calculate the angles between each Centroid and the given Document, and the Centroid with the lowest angle is chosen as the Centroid for the Document. Since the Centroid is simply a normalized version of the ContextBucket, the desired mapping from source-document to ContextBucket exists.
Now that the association between the source-document (ex: http://www.kittylitter.com), its Document, and the mapped Centroid/ContextBucket have been derived, the ContextBucket can be used in association with the delivery of any desired web content.
For example, all ContextBuckets can be associated with one or more pieces of ad content. Once the source-document has been mapped to a ContextBucket, the associated ad content can be delivered to the source-document.
While the preferred embodiment of the invention has been illustrated and described, as noted above, many changes can be made without departing from the spirit and scope of the invention. Accordingly, the scope of the invention is not limited by the disclosure of the preferred embodiment. Instead, the invention should be determined entirely by reference to the claims that follow.
This application claims the benefit of U.S. Provisional Application Ser. No. 60/973,393 filed Sep. 18, 2007 and U.S. Provisional Application Ser. No. 60/986,680 filed Nov. 9, 2007, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60973393 | Sep 2007 | US |