Broad-matching of keywords has become an important technique used by online advertising platforms, search engines, and other applications that deal with relevancy of keywords. Broad-matching, also referred to as advanced matching, is a process of identifying keywords that are related or similar to a keyword in a context such as a web page or query string. Broad-matched keywords may be used for different applications.
While broad-matching has been used for advertising and other applications, there have been shortcomings in its use. For example, in the realm of online advertising, the keywords that are of interest to users may change rapidly. Current broad-match algorithms cannot keep up with these trends. Estimations of relevancy may quickly become inaccurate. Learning machines for finding and ranking relevant matches may require complete offline retraining when new training data is available. The most effective broad-matching algorithm for a given time or context may not always be used or emphasized. Furthermore, training data may need to be labeled by humans.
Techniques related to keyword broad-matching are discussed below.
The following summary is included only to introduce some of the concepts discussed in the Detailed Description below. This summary is not comprehensive and is not intended to delineate the scope of the claimed subject matter, which is set forth by the claims presented at the end.
A source keyword may be received multiple times and in response a machine-learning algorithm may be used to produce or train a ranker that ranks respective matching-keywords that have been determined to match the source keyword. A portion or unit of content may be generated based on one of the ranked matching-keywords. The content is transmitted via a network to a client device and a user's impression of the content is recorded. The machine-learning algorithm may continue to learn about matching-keywords for arbitrary source keywords from recorded impressions (e.g., clickthrough data) and in turn inform or train a ranking component that ranks keywords. The learning alters how the machine-learning algorithm evaluates matching-keywords determined to match the source keyword. It should be noted that “keyword” is used herein in a manner consistent with the meaning it conveys to those of ordinary skill in the art of keyword matching; “keyword” refers to a single word or a short phrase of words that form a semantic unit.
Many of the attendant features will be explained below with reference to the following detailed description considered in connection with the accompanying drawings.
The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein like reference numerals are used to designate like parts in the accompanying description.
Embodiments discussed below relate to a learning-based approach for broad-matching. Learning may be based on implicit feedback (learning samples), which may be user impressions or responses to decisions by a learning machine, for example, advertisement clickthrough logs where ads have been selected based on a decisions by the learning machine. Multiple arbitrary similarity functions (including various existing broad-match algorithms) may be used by incorporating them as features of the learning samples. A learning algorithm may be used to continuously revise a hypothesis for predicting likelihood of user agreement with a match. When user feedback (e.g., click, hover, ignore, etc.) is consistent with a prediction of the hypothesis (e.g., a user clicks an ad selected per a broad-match/expanded keyword as predicted by the hypothesis), then the hypothesis is strengthened. When user feedback is inconsistent with a prediction of the hypothesis, the hypothesis is weakened. In one embodiment, the learning algorithm may reduce the influence of older training data (samples) on the hypothesis, i.e., a sample's impact on the hypothesis may diminish as new samples are obtained.
As mentioned above, in the realm of web-based advertising, advertisements may be submitted as bids on specific keywords, where a bid is an amount an advertiser will pay for a user's click on the advertisement. When a bid-for keyword occurs in a delivery context, an advertisement may be selected among candidates based on amounts of bids, degree of relevancy or estimated probability of being clicked, and the like.
While learning machine 120 may have many uses (e.g., query replacement, providing a user with a list of candidate synonyms, etc.), in the system of
To expand the scope of an advertisement and to obviate the need for an advertiser to laboriously maintain a complete and up-to-date set of keywords for an advertising topic, the advertisement platform 122 may use the broad-match learning machine 120 to expand the scope of bid-for keywords. To do so, as indicated by the arrows between the learning machine 120 and the advertisement platform 122, the advertisement may pass a source/input keyword to the broad-match learning machine 120. The broad-match learning machine 120, embodiments of which will be explained in detail further below, receives the input keyword (e.g., “skis”), identifies broad-matching keywords, which are keywords that have been determined, by one or more broad-match algorithms, to be words similar to the input keyword (semantically, and/or textually, etc.). The broad-match learning machine 120 evaluates the broad-match keywords using the learned hypothesis and ranks them according to their various features. In one embodiment, ranking is performed offline and ranked matches are accessed online with lookups. One or more of the top-ranked broad-match keywords are returned or transmitted (e.g., via a network, bus, etc.) to the advertisement platform 122, which then uses the broad-match keywords to select one or more advertisements. Note that the components of system of
The advertisement platform 122 may receive input/source keywords from a variety of sources. In
The user of client 124 views and possibly interacts (or declines to interact) with the content or the advertisement. The user's impression 124 (reaction, response, etc.) is captured and logged. In the advertisement example, the user's impression may be recorded in the form of a clickthrough response (i.e., clicking, hovering, etc.), stored in a click-through log 128. Clickthrough may involve a server recording a request for a web page that originated from a known web page, or an advertisement, etc. In either the click-through log 128 and/or a data store used by the broad-match learning machine 120, information is stored that correlates click-through log 128 entries with the corresponding broad-match keyword and input keyword that were used to select the advertisement to which the entry corresponds (i.e., the log entry and the input-match keyword pair are linked or stored together). The click-through log entries and their respective keyword pairs are then used to train the broad-match learning machine 120.
The broad-match learning machine 120 receives a stream of training samples. A sample 130 may include a click-through log entry (a user's impression of or response to an ad) and a corresponding keyword pair (or information for linking to same). The broad-match learning machine 120 uses the user's impression to revise the hypothesis that was used to select or rank the broad-match keyword. Generally, if a user's impression affirms or ratifies the previous determination (reflected in the recorded user impression) of the hypothesis, then the hypothesis is revised or updated to strengthen the predictive likelihood of the broad-match keyword. Conversely, if the user's impression does not affirm or ratify the previous determination, then the hypothesis is revised to reduce the rank of the broad-match keyword relative to other broad-match keywords matching the input/source keyword. Details of how hypotheses may be revised will be described further below.
As mentioned earlier, the broad-match learning machine may continue to operate even while re-learning from incoming samples; the broad-match learning machine may continue to handle keywords for clients 132 while learning/re-training one or more keywords. In other words, the broad-match learning machine may be an online-type of learning machine that can learn from its previous decisions (on-the-fly in some cases). In this document “online learning” refers to the known class of learning algorithms. In one embodiment, a hypothesis of the learning algorithm may include information about the relative contributions of arbitrary “black-box” broad-match algorithms to ranking/predicting the corresponding broad-match keyword's hypothesis.
The content platform 146 can be any server/service that uses keywords to generate content and provide the content to users via a network 148. For example, in the case of an advertisement platform, keywords are matched to advertisements to select an advertisement. In other embodiments, the content may simply inform a search (i.e., the search category is for products) or the content may be a web page whose subject matter is informed by the received matching keywords 145. The platform 146 transmits output or content 150 thus generated or selected. A client such as an e-mail application or browser 152 receives and displays the content 150 in some visible form such as text, video, an image, etc. The user's reaction or behavior with respect to the displayed content 150 is captured. For example, the user's response may be in the form of an amount of time that the content 150 was displayed, an indication of whether a pointer was hovered over the content, a log of subsequent web pages visited, an answer to a direct inquiry presented through the browser 152 (e.g., “is this the topic of interest to you?”), and so on.
The captured impression is eventually provided to the broad-match learning machine 142, for example in the form of logs 154, data tables, etc. The impression data may be directly transmitted to the broad-match learning machine 142, may be provided via the content platform 146, and so on. In turn, the broad-match learning machine 142 uses the impression log 154 or other form of feedback to learn, that is, it adjusts how it evaluates broad-matching keywords, generally by strengthening the weight of matches that were affirmed by the user, and reducing the weight of matches that were rejected by the user. New samples or impressions may be given greater weight or impact than older samples such that over time, the impact, affect, or influence of prior impressions or samples fades. Details are discussed further below.
The selected 186 broad-match keyword is passed to a content platform 146 as discussed above. Based on the broad-match keyword, the content platform 146 generates or selects 190 content 192 and provides same to a user, recording the user's impression thereof and facilitating linking of the recorded impression with the recorded decision. For example, the recorded impression may include the broad-match keyword and the user's response thereto (e.g., “user clicked”).
A learning or training component 194 may update the recorded 188 hypothesis using the recorded impression, even while the hypothesis continues to be used or available for servicing other matching requests. The updating may be performed by reading 196 the impression, correlating 198 it with the previously recorded 188 decision, applying 200 a learning algorithm (e.g., a perceptron 202 or other algorithm, described below) to revise the hypothesis, which is then stored 204 and used for future matching for the received 180 input keyword.
As mentioned above, the features extracted or computed for a keyword that is to be broad-matched may include selections or estimations performed by multiple broad-match algorithms. That is, off-the-shelf or other broad-match algorithms, perhaps taking into account different aspects of keywords, such as lexical properties, context, etc. may be used.
Algorithm 260 is based on a modification of the max-margin voted perceptron algorithm, which is a discriminative online linear classifier described in detail elsewhere. Averaging may be used instead of voting, which may simplify computation. While averaged perceptron is a robust, efficient classifier, it does not immediately account for drift, because its hypothesis is an average of all weight vectors observed in the past. Algorithm 260 modifies an averaged perceptron such that the hypothesis is a multiplicatively re-weighted mean. This effectively corresponds to averaging with an exponential time decay (see
The result is algorithm 260, which may be called Amnesiac Averaged Perceptron (AAP), processes training examples as a stream, updating a current hypothesis (weights w) when a training example is misclassified by the algorithm (according to user clickthrough feedback, for example), with the update being based on hinge loss. The optimal hypothesis (weights wavg) is maintained as a running average, and is used for actual prediction. Amnesia rate α dictates how much influence recent examples have on the averaged hypothesis compared to past examples. After a certain number of examples, continuous scaling by α will lead to numeric overflow, which may be resolved by periodic scaling of wavg, N and η. Note that notation used in
A simplified form of algorithm 260 will now be described. Given samples x1, x2, . . . xn, where x1 is the oldest sample and xn is the newest sample, a current weight vector w (hypothesis), at the time of the nth sample, will be equal or proportional to:
where α is the amnesia or decay factor. Other techniques may be used to effectuate decay; the present example is provided as an example of an efficient and simple choice. Other online learning classifiers may be modified for similar effect. The hypothesis is a running statistic in that at the time of any sample xi, the previous samples are reflected in values of the current weights of the hypothesis w; previous samples contribution is reflected in the current w and need not be maintained.
Because the algorithm produces uncalibrated predictions of click (or other feedback, such as hover, protracted display, etc.), a sigmoid calibration may be employed to convert predictions to actual probabilities, which is effective for converting the output of max-margin classifiers to probabilities.
The learning process may be improved by incorporating feature selection. Given a large number of features (see feature vector 232 in
Greedy feature selection may be used, based on a holdout set. Greedy feature selection begins with a set of selected features, S (which is initially empty). For each feature fi not yet in S, a model is trained and evaluated using the feature set s ∪ fi. The feature that provides the largest performance gain is added to S, and the process is repeated until no single feature improves performance. Feature selection may be conducted in an online fashion when evaluating the quality of each individual feature.
Embodiments and features discussed above can be realized in the form of information stored in volatile or non-volatile computer or device readable media. This is deemed to include at least media such as optical storage (e.g., CD-ROM), magnetic media, flash ROM, or any current or future means of storing digital information. The stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or any other information that can be used to enable or configure computing devices to perform the various embodiments discussed above. This is also deemed to include at least volatile memory such as RAM and/or virtual memory storing information such as CPU instructions during execution of a program carrying out an embodiment, as well as non-volatile media storing information that allows a program or executable to be loaded and executed. The embodiments and features can be performed with the memory and processor(s) of any type of computing device, including portable devices, workstations, servers, mobile wireless devices, and so on.