With the wide adoption of search engines, such as MS Live Search, search engine advertising has become an increasingly important tool for businesses to reach consumers. Search engine advertising often involves placing a banner advertisement or sponsored link in a prominent place among a number of search results. The sponsored advertisement or link is typically chosen based on bidding for keywords associated with user queries submitted to websites. An advertiser winning the bid for a given keyword will have its advertisement or link displayed when a user enters that keyword in a search query.
To select an optimal set of keywords for bidding, advertisers often utilize keyword tools. These tools typically provide a number of keyword statistics such as search volume, cost per click, search volume trends, estimated advertisement position, etc., based on advertisement click-though data and enable an advertiser to see sources where traffic has been generated from.
In various embodiments, a computing device is configured to facilitate selection of keywords for bidding by an advertiser of a website. To facilitate selection, the computing device may process a click-through log to determine measures of competitiveness for a plurality of websites extracted from the click-through log. In some embodiments, the computing device may then, for one of the websites, determine a ranking of competing websites based at least in part on the measures of competitiveness. Also, in various embodiments, the computing device may, for a concept keyword of interest to an advertiser of one of the websites, determine a ranking of competing websites for that concept keyword based at least in part on the measures of competitiveness. Further, in some embodiments, the processing may further comprise determining one or more concept keywords for each of the plurality of websites, each concept keyword-website pair having an associated score, and calculating the measures of competitiveness based at least in part on the associated scores.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Non-limiting and non-exhaustive examples are described with reference to the following Figures:
In various embodiments, the competitive analysis 202 may process a click-through log containing entries for queries 208 and websites 210 to determine measures of competitiveness for the websites 210. In some embodiments, this process may involve determining one or more concept keywords for each website 210, creating a bipartite graph of the concept keywords and websites 210, and performing a Markov walk algorithm on the graph to calculate the measures of competitiveness. These operations are described in greater detail below with reference to
In various embodiments, the search server 302 may be any sort of computing device or devices known in the art, such as personal computers (PCs), laptops, servers, phones, personal digital assistants (PDAs), set-top boxes, and data centers. For example, search server 302 may be a server associated with Microsoft Windows Live Search or some other search application. Search server 302 may provide users with search capabilities, allowing users to enter search queries and receive, in response, a plurality of search results. In various embodiments, the search results may include the banner ads and sponsored links described above with regard to
In various embodiments, click-through log 304 can be a file of any format known in the art. For example, click-through log 304 may be a database file, a plain-text file, or an XML file. Further, click-through log 304 may comprise lists of queries and websites that a user clicked-through to in response to receiving the queries' search results. For example, click-through log 304 may comprise a table having queries in one column and websites in another column. A given query or website may repeat in a number of rows of the table, as one query might lead to click-throughs to several websites, and one website may be click-through to based on several queries. Table 1, below, illustrates an exemplary table of a click-through log 304. In some embodiments, in addition to queries and websites, the click-through log 304 may also store a frequency for each query website pair, the frequency being the number of times that the query resulted in a click-through to the website.
As shown in
Also, in some embodiments, search server 302 and computing device 306 may be connected by at least one networking fabric (not shown). For example, the server 302 and device 306 may be connected by a local access network (LAN), a public or private wide area network (WAN), and/or by the Internet. In some embodiments, the server 302 and device 306 may implement between themselves a virtual private network (VPN) to secure the communications. Also, the server 302 and device 306 may utilize any communications protocol known in the art, such as the Transmission Control Protocol/Internet Protocol (TCP/IP) set of protocols. In other embodiments, rather than being coupled by a networking fabric, the server 302 and device 306 may be locally or physically coupled.
As is further illustrated in
In various embodiments, concept keyword determination module 310 (hereinafter “keyword module 310”) may determine one or more concept keywords for at least some of the websites appearing in the click-through log 304. A concept keyword may, for example, be a phrase that appears in several of the queries associated with a website and be an independent n-gram that has a semantic meaning. Further, the concept keyword may not be a navigational word or stop word. To determine the concept keywords for each website, keyword module 310 may first create a PAT tree for each website of the queries associated with that website. The keyword module 310 then calculates association scores for n-grams extracted from those queries and applies a local maxima algorithm to select the n-grams with the highest association scores as concept keywords. Next, the keyword module 310 filters out navigational words and stop words from the concept keywords, and calculates scores for each concept keyword based on its frequency of appearance among the queries for the website. Then, the keyword module 310 may select the top K concept keywords with the highest scores as the one or more concept keywords for the website. Keyword module 310 may then repeat these operations for some or all of the other websites listed in the click-through log 304.
As mentioned, keyword module 310 may first create a PAT tree (PAT tree is an abbreviation for “Patricia Tree”) for each website of the queries associated with that website. Keyword module 310 may organize the queries into a PAT tree, in some embodiments, to facilitate efficient retrieval of n-grams from the queries. PAT trees are well-known to those of ordinary skill in the art and accordingly will not be described further.
In various embodiments, keyword module 310 may then retrieve n-grams from the PAT tree. Each n-gram may be a sequence of one or more terms t1, . . . , tn extracted from one or more queries of the query corpus organized by the PAT tree. Upon retrieving/extracting each n-gram, keyword module 310 may calculate a symmetric conditional probability (SCP) score for that n-gram. The keyword module 310 may use the SCP score to estimate the degree of association of the substrings comprising an n-gram. In some embodiments, the SCP score for an n-gram may be defined as:
where tj is a term, t1, . . . , tn is a sequence of terms comprising an n-gram, and p(t1, . . . , tn) is a probability of the occurrence of the n-gram t1, . . . , tn in the query corpus of the website. In some embodiments, if each substring of an n-gram has a similar occurrence to the n-gram, the SCP score for that n-gram will be high, indicating a strong degree of cohesion for that n-gram. For example, if the n-gram “airline tickets” appears 1000 times, and the substrings, “airline” and “tickets” each also appear 1000 times, that would indicate that the substrings only tend to appear together, as the n-gram. Such an n-gram will have a high SCP score, with what is considered “high” varying from embodiment to embodiment.
In some embodiments, after calculating the SCP score for each n-gram, the keyword module 310 may calculate the context dependency (CD) score for each n-gram. The CD score may help measure the lexical boundaries for each n-gram. In some embodiments, the CD score for an n-gram may be defined as:
where tj is a term, t1, . . . , tn is a sequence of terms comprising an n-gram, LC(t1, . . . , tn) is the number of unique left adjacent words appearing in the query corpus of the website, and RC(t1, . . . , tn) is the number of unique right adjacent words appearing in the query corpus of the website. LC( ) or RC( ) are equal to the frequency of the n-gram if there are no left adjacent or right adjacent words, respectively. The CD score can be used to determine if the n-gram is dependent on a certain string containing it. For example, if the n-gram only occurs when the string including it occurs, the score of the n-gram may be close to 0.
The keyword module 310 may then combine the SCP and CD scores by multiplying the SCP and CD scores together for each n-gram to arrive at an association/SCPCD score for each n-gram.
In various embodiments, after calculating the SCPCD scores for each n-gram, the keyword module 310 may apply a local maxima algorithm to the n-grams to select a number of algorithms having the highest SCPCD scores. Utilizing this algorithm, the keyword module 310 may compare the SCPCD score of an n-gram to its antecedent and successor n-grams. The antecedent n-gram may be a substring of the n-gram under consideration, having one less term than the n-gram under consideration. For example, if the n-gram is t1, . . . , tn, its antecedent n-gram may be t2, . . . , tn. The successor n-gram may be a string containing the n-gram under consideration, having one more term than the n-gram under consideration. For example, if the n-gram is t1, . . . , tn, its successor n-gram may be t1, . . . , tn+1. Keyword module 310 compares the score of the n-gram to its antecedent and successor n-grams, and if the score of the n-gram is the local maxima (i.e., is higher than that of the antecedent and successor), the n-gram is selected as a concept keyword. In some embodiments, the local maxima algorithm may be “relaxed” if the n-gram appears with a frequency exceeding some pre-determined threshold (i.e., even if the n-gram is not a local maxima, it may still be selected if it appears often enough).
In various embodiments, after selecting a number of n-grams as concept keywords, the keyword module 310 may filter out keywords having navigation roles. Keywords may have navigational roles if they contain terms similar to the URL of the website. To compute whether a term is navigational, the keyword module 310 may use the Levenshtein distance between the URL and the term. If the term is navigational, the keyword module 310 may filter the keyword associated with it out of the set of selected concept keywords. In some embodiments, however, before filtering out a keyword containing a navigational term, the keyword module 310 may check if the navigational term is present in a dictionary of terms determined to be “meaningful”, such as “games”, “weather”, or “shoes”, with what is “meaningful” varying from embodiment to embodiment. Also, in various embodiments, the keyword module 310 may filter out concept keywords that consist only of stop words.
In some embodiments, after filtering the selected concept keywords, the keyword module 310 may calculate scores for each of the concept keywords. The score may be unique to the pair of each concept keyword and a website (since the same concept keyword may be determined for multiple keywords, and have different scores for each). In various embodiments, keyword module 310 may calculate the score for each concept keyword based on the frequency of appearance of the concept keyword within the query corpus of the website for which the concept keyword was determined. In some embodiments, after calculating the scores, the keyword module 310 may select the top K scoring concept keywords as the one or more concept keywords determined for the website.
As further illustrated by
In various embodiments, calculation module 312 may first generate a bipartite graph of the concept keywords and websites. The bipartite graph may comprise two partitions: one for the concept keywords and another for the websites. Each concept keyword and website may be represented by a node. The concept keyword nodes may each be connected to one or more websites by an edge, and the websites may be connected by those same edges to one or more concept keywords. Also, each edge may be associated with a score of the concept keyword-website pair that it represents, those scores described in greater detail above.
An exemplary bipartite graph is illustrated by
In various embodiments, after creating the bipartite graph, calculation module 312 may perform a Markov walk algorithm on the graph. As a preliminary to performing the algorithm, however, the calculation module 312 may first calculate transition probability matrices based on the scores associated with each edge. For a graph with n concept keywords and m websites, there is an m×n symmetric matrix of scores. The matrix would be symmetric because the score for entry m1n1 would be the same as the score for n1m1. Once the score matrix is defined, the calculation module 312 may use it to define two transition probability matrices. The first transition probability matrix includes transition probabilities from a website wj at a time t to a concept keyword ck at time t+1 (with j ranging from 1 to m and k ranging from 1 to n). The probabilities of the first matrix may be defined to normalize out wj, such that:
where sjk is the score entry in the m×n matrix at wick, Pt+1|t (ck|wj) denotes the transition probability from wj at a time t to ck at time t+1, and wherein i ranges over all concept keywords connected to wj. Based on the defined probabilities, the first matrix Pwc may be defined as [Pt+1|t (ck|wj)]jk. The size of the matrix Pwc would also be m×n and would be row stochastic (i.e., the entries for a given row would sum to 1).
The second transition probability matrix includes transition probabilities from a concept keyword ck at a time t to a website wj at time t+1. The probabilities of the second matrix may be defined to normalize out ck, such that:
where sjk is the score entry in the m×n matrix at wjck, Pt+1|t (wj|ck) denotes the transition probability from ck at a time t to wj at time t+1, and wherein i ranges over all websites connected to ck. Based on the defined probabilities, the second matrix Pcw may be defined as [Pt+1|t (wj|ck)]kj. The size of the matrix Pwc would be n×m and would also be row stochastic (i.e., the entries for a given row would sum to 1).
After defining the two probability matrices, the calculation module 312 may then define an initial vector v0 by assigning an initial value to each website. In calculating the vector v0, the calculation module 312 may select one of the websites as a “seed node”. In some embodiments, calculation module 312 may select the website for which competitors are to be determined as the “seed node”. The seed node is assigned a value of 1, and all other nodes in the vector (i.e., all other websites in the graph) are assigned values of 0.
With the vector v0 and probability matrices Pwc and Pcw as inputs, calculation module 312 may perform a Markov walk algorithm. The Markov walk may initialize a variable v to v0 and then repeat, until a convergence point is reached, the following operations:
compute u=PwcTv;
compute v=α PcwTu+(1−α) v0, where α ∈ [0,1)
For example, referring again to
In various embodiments, the Markov walk may be considered complete when v asymptotically converges to a result vector v*. The result vector v* may also be a one-dimensional vector with most or all of the websites having a score/weight between 0 and 1, and the sum of all weights/scores equaling 1. These scores may represent the posterior probabilities that a website wj is associated with the seed node (the website initially assigned a value of 1). Since these posterior probabilities may reflect a degree of competition with the seed node, they may serve as measures of competitiveness/competition scores for each website.
As is further illustrated by
In various embodiments, after determining the ranking, ranking module 314 may also determine keyword groupings of competing websites. To determine concept keywords to select for groupings, the ranking module 314 may propagate the measures of competitiveness from the nodes of the bipartite graphs associated with the competing websites to the concept keywords associated with those websites. As with the Markov walk algorithm above, the propagation may be based on the transition probabilities from the websites at time t to the concept keywords at time t+1. After propagating the measures of competitiveness to the concept keywords, the ranking module 314 may select the top N concept keywords—based on the propagated scores—as keywords around which to build keyword groupings. Each keyword grouping may comprise such a selected concept keyword and the top competing websites for that concept keyword. After selecting the concept keywords, the ranking module 314 may determine the top competing websites for each concept keyword. In various embodiments, the ranking module 314 may determine the top competing websites for a concept keyword based on the scores associated with each concept keyword-website pair or based on transition probabilities. The websites with the highest scores/transition probabilities for a concept keyword may be selected as the website comprising the keyword grouping.
For example,
As is further shown in
The computing devices may then determine one or more concept keywords for each of a plurality of websites extracted from the click-through log, block 404. The determining of the one or more concept keywords, block 404, is further illustrated by
In some embodiments, the computing devices may then calculate associated scores for each concept keyword-website pair based on frequencies that queries extracted from the click-through log resulted in click-throughs to websites, block 406.
In various embodiments, the computing device may then calculate measures of competitiveness for the plurality of websites based at least in part on the associated scores, block 408. The calculating, block 408, is further illustrated by
As shown in
In various embodiments, the computing device may then propagate measures of competitiveness to nodes of the concept keywords in a bipartite graph (described in
After selecting the concept keywords, the computing device may then select a number of websites associated with the selected number of concept keywords to create keyword groupings of competing websites, block 414.
Next, the determining may include retrieving n-grams from the queries and calculating scores for the n-grams, block 404b. In some embodiments, the n-gram scores may include one or both of symmetrical conditional probabilities and/or context dependencies.
In various embodiments, the computing device may then apply a local maxima algorithm to the n-grams and, based on results of the algorithm, selecting one or more of the n-grams as the one or more concept keywords, block 404c
The computing device may then filter out navigational keywords from the concept keywords based on comparisons of the concept keywords to website identifiers, block 404d, and/or filter out stop words from the concept keywords, block 404e.
The calculating may further include performing a Markov walk algorithm on the bipartite graph, block 408b . In some embodiments, performing the Markov walk algorithm may further include propagating a weight assigned to a seed node of the bipartite graph between partitions of the bipartite graph based on the concept keyword-website pair scores until a convergence point is reached.
In a very basic configuration, computing device 700 may include at least one processing unit 702 and system memory 704. Depending on the exact configuration and type of computing device, system memory 704 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. System memory 704 may include an operating system 705, one or more program modules 706, and may include program data 707. The operating system 705 may include a component-based framework 720 that supports components (including properties and events), objects, inheritance, polymorphism, reflection, and provides an object-oriented component-based application programming interface (API), such as that of the .NET™ Framework manufactured by Microsoft Corporation, Redmond, Wash. The device 700 may be of a configuration demarcated by a dashed line 708.
Computing device 700 may also have additional features or functionality. For example, computing device 700 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 700 may also contain communication connections 716 that allow the device to communicate with other computing devices 718, such as over a network. Communication connections 716 are one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, etc.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
References are made in the detailed description to the accompanying drawings that are part of the disclosure and which illustrate embodiments. Other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the disclosure. Therefore, the detailed description and accompanying drawings are not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and equivalents.
Various operations may be described, herein, as multiple discrete operations in turn, in a manner that may be helpful in understanding embodiments; however, the order of description should not be construed to imply that these operations are order-dependent. Also, embodiments may have fewer operations than described. A description of multiple discrete operations should not be construed to imply that all operations are necessary.
The description may use perspective-based descriptions such as up/down, back/front, and top/bottom. Such descriptions are merely used to facilitate the discussion and are not intended to restrict the scope of embodiments.
The terms “coupled” and “connected,” along with their derivatives, may be used herein. These terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still cooperate or interact with each other.
The description may use the phrases “in an embodiment,” or “in embodiments,” which may each refer to one or more of the same or different embodiments. Furthermore, the terms “comprising,” “including,” “having,” and the like, as used with respect to embodiments, are synonymous.
For the purposes of the description, a phrase in the form “A/B” means A or B. For the purposes of the description, a phrase in the form “A and/or B” means “(A), (B), or (A and B)”. For the purposes of the description, a phrase in the form “at least one of A, B, and C” means “(A), (B), (C), (A and B), (A and C), (B and C), or (A, B and C)”. For the purposes of the description, a phrase in the form “(A)B” means “(B) or (AB)” that is, A is an optional element.