Not applicable.
Not applicable.
In the field of online advertising, determining which web pages to place advertisements can be an important decision. It can be desirable to place advertisements on a web page that a specific target market frequently visits, or on a web page that is related to the marketed product. It can also be desirable to place advertisements on a search results page corresponding to particular search query. Conventionally, advertisers can bid on search queries submitted by users of a search engine in order display their advertisements on the corresponding search results page.
An advertiser may want to associate as many search terms and variations of those search terms as possible to their advertisements. Such search terms may include abbreviated terms that may refer to one or more expanded phrases. When bidding on particular abbreviated terms, an advertiser may desire to invest in only on those abbreviated terms that will lead to search results that are related to the advertised product or service. Conventionally, advertisers have to manually select which abbreviated terms correspond to search results of their related product or service. Accordingly, it may be desirable to provide a more precise way in which advertisers can determine if certain abbreviated terms produce desired search results.
A system and method are disclosed for creating a database of expansion phrases for abbreviated terms. In an embodiment, an abbreviated term is submitted and results sets corresponding to the abbreviated term submitted are received. The results set can comprise at least one search result. One or more possible expansion phrases can be generated from the result set. At least one expansion phrase can be selected from the possible expansion phrases based on filter rules. The selected expansion phrases may be ranked according to a ranking algorithm and associated with the corresponding abbreviated term.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The invention introduces a system and method for creating a database of expansion phrases for abbreviated terms. Such a database can be helpful for determining the most common expansions of abbreviated terms. In an embodiment, the method can submit an abbreviated term and receive a corresponding results set. One or more possible expansion phrases can be generated from the results set, and expansion phrases can be selected from possible expansion phrases using one or more filter rules. The selected expansion phrases can be ranked, associated with the abbreviated term, and stored in a database.
Search engine 104, query log database 106, abbreviation deduction manager 108, context-based similarity system 118, and third party source 120 can be a server including a workstation running the Microsoft Windows®, MacOS™, Unix, Linux, Xenix, IBM AIX™, Hewlett-Packard UX™, Novell Netware™, Sun Microsystems Solaris™, OS/2™, BeOS™, Mach, Apache, OpenStep™ or other operating system or platform. As shown in
Client 102 can include a communication interface. The communication interface may be an interface that can allow the client to be directly connected to any other client, server, or device or allows the client 102 to be connected to a client, server, or device over network 122. Network 122 can include, for example, a local area network (LAN), a wide area network (WAN), or the Internet. In an embodiment, the client 102 can be connected to another client, device, or server via a wireless interface.
Query log database 106 can store search queries submitted by users of search engine 104 or another search engine. In an embodiment, the context-based similarity system 118 can be used to discover key phrases and/or measure their similarity by utilizing the usage context information from search engine query logs. The similarity levels between two key phrases can then be used to narrow down the search space of several tasks in online keyword auctions, like finding the keyword/abbreviation pairs, finding frequent misspellings of a given keyword, finding key phrases with similar intention, and/or finding keywords which are semantically related and the like.
In an embodiment, the context-based similarity system provides a mechanism for determining similarity between key phrases using usage context information (e.g., information apart from a focus term of a search) in search query logs. Thus, key phrases can be found which have a similar intention and/or are related conceptually by looking at the similarity of key phrase patterns around them. Moreover, algorithms can be applied for limiting the search space to only those key phrases which are similar to the given key phrase. This can make the algorithms computationally tractable and may also provide a higher accuracy for the final results.
Noise Filtering: This pass includes, but is not limited to, the following: First, the query logs are passed through a URL filter which filters out queries that may happen to be URLs. This step is important for noise reduction because some of search engine logs are URLs. In an embodiment, non-alphanumeric characters, except punctuation marks, are omitted from the queries. In an embodiment, queries containing valid patterns of punctuation marks such as “.” “,” “?” and quotes and the like are broken down into multiple parts at the boundary of punctuation.
Low-frequency word filtering: In this pass, frequencies of individual words that occur in the entire query logs are determined. At the end of this pass, words which have a frequency lower than a pre-set threshold limit are discarded. This pass eliminates the generation of phrases containing infrequent words in the next step. Typically, if a word is infrequent then a phrase which contains this word is likely infrequent as well.
Key-phrase candidate generation: In this pass, possible phrases up-to a pre-set length of N words for each query are generated, where N is an integer from one to infinity. Typically, a phrase which contains an infrequent word, a stop-word at the beginning, a stop-word at the end, and/or a phrase that appears in a pre-compiled list of non-standalone key phrases are not generated. At the end of the pass, frequencies of phrases are counted and infrequent phrases are discarded. The remaining list of frequent phrases is called a “key phrase candidate list.”
Key-phrase determination: For each query, the best break is estimated by a scoring function which assigns a score of a break as sum of (n−1)×frequency+1 of each constituent key phrase. Here, n is a number of words in the given key phrase and can be an integer from one to infinity. Once the best break is determined, a real count of each constituent key phrase of the best query break is incremented by 1. This pass outputs a query breakup in a file for later use to generate a Co-occurrence Graph.
One can make an additional pass through the list of key phrases generated in the above step and discard the key phrases with a real frequency below a certain threshold when the count of obtained key phrases exceeds the maximum that is needed.
Co-occurrence Graph generation: Using the query breakup file generated in a key phrase extraction process, a key phrase Co-occurrence Graph is generated. A Co-occurrence Graph is a graph with key phrases as nodes and edge weights representing the number of times two key phrases are part of the same query. For example, if a breakup of a query had three key phrases, namely, a, b, and c then the weights of the following edges are incremented by 1: {a,b}, {a,c} and {b,c}.
Co-occurrence Graph pruning: Once the Co-occurrence Graph has been generated, noise is removed by pruning edges with a weight less than a certain threshold. Next, nodes which have less than a certain threshold number of edges are pruned. Edges associated with these nodes are also removed. Further, the top K edges for each node are determined, where K is an integer from one to infinity. Edges, except those falling into the top K of at least 1 node, are then removed from the graph.
Similarity Graph creation: A new graph called the Similarity Graph is then created. The set of nodes of this graph is the key phrases which remain as nodes in the Co-occurrence Graph after Co-occurrence Graph pruning.
Similarity Graph edge computation: For each pair {n1, n2} of nodes in the Similarity Graph, an edge {n1, n2} is created if and only if the similarity value S(n1,n2) for the two nodes in the Co-occurrence Graph is greater than a threshold T. The weight of the edge {n1,n2} is S(n1,n2). The similarity value S(n1,n2) is defined as the cosine distance between the vectors {e1n1, e2n1 . . . } and {e1n2, e2n2 . . . }, where e1n1, e2n1 . . . are the edges connecting node n1 in the Co-occurrence Graph and e1n2, e2n2 . . . are the edges connecting node n2 in the Co-occurrence Graph. Cosine distance between two vectors V1 and V2 is computed as follows: (V1·V2)/|V1|X|V2|. A total of ˜nC2 distance computations are required at this stage.
Similarity Graph edge pruning: The top E edges by edge weight for each node in the Similarity Graph are then determined, where E is an integer from one to infinity. The edges, except those falling in the top E edges of at least one node, are removed. Typically, the value of E is approximately 100.
Output: Output the generated Similarity Graph generated above.
The Similarity Graph can be stored in a hash table data structure for very quick lookups of key phrases that have a similar usage context as the given key phrase. The keys of such a hash table are the key phrases and the values are a list of key phrases which are neighbors of the hash key in the Similarity Graph. The main parameter to control the size of this graph is the minimum threshold value for frequent key phrases in the key phrase extraction process. The size of the Similarity Graph is roughly directly proportional to the coverage of key phrases. Hence, this parameter can be adjusted to suit a given application and/or circumstances.
Referring back to
The abbreviated term output component 122 can be, for example, a program that is configured to output a plurality of different abbreviated terms. In an embodiment, the plurality of different abbreviated terms are outputted into either a search engine or a similarity graph. In an embodiment, similar phase generation component 110 can be used to receive an output from a search engine or a similarity graph, wherein the output is a results set including at least one result. If the results set is received from the search engine, the results set can be a search results set including at least one search result corresponding to a query. In an embodiment, the query can be an abbreviated term received from the abbreviated term output component 122. If the results set is received from a similarity graph, the results set can be a nodes set including at least one node corresponding to a query. In an embodiment, the query can be an abbreviated term received from the abbreviated term output component.
Once the output is received, the similar phrase generation component can be configured to generate all possible expansion phrases from the output. In an embodiment, the expansion phrases are generated based on the query that was submitted to generate the output. The abbreviation detection component 112 can be configured to select expansion phrases from the possible expansion phrases based on filter rules. In an embodiment, a selected expansion phrase can be an expansion phrase that is most relevant to the query. The level of relevancy can be determined utilizing a relevancy determination algorithm employed by the by the abbreviation detection component. The ranking component 116 can be configured to rank the selected expansion phrases according to a ranking algorithm employed by the ranking component. The expansion phrase database 114 can associate and store the ranked expansion phrases with the corresponding query. In another embodiment, the expansion phrase database 114 can include expansion phrases and corresponding abbreviated terms received from one or more third party sources 120.
At operation 706, possible expansion phrases are generated from the results of the results set. In an embodiment in which the results set is received from a similarity graph, the possible expansion phrases are generated by extracting the most relevant M nodes that are related to the abbreviated term, where M is an integer from one to infinity. The level of relevancy of the nodes to the abbreviated term can be determined by an employed algorithm.
In an embodiment in which the results set is received from a search engine, the possible expansion phrases are generated by selecting the first P search results and generating possible expansion phrases from the selected search results up to length X, where P and X are integers from one to infinity and X is the number of terms in the expansion phrase. The expansion phrases can be generated from the titles of the search results, the snippets of the search results, or both the titles and snippets of the search results. The snippets of the search results can be the text that is accompanied with the title of the search result. For example, referring to
At operation 708, expansion phrases from the possible expansion phrases are selected based on filter rules. In an embodiment, a selected expansion phrase can be a possible expansion phrase that is closely related to the abbreviated term. An algorithm utilizing any number of filter rules can be employed by the invention to determine how closely related the possible expansion phrase is to the abbreviated term. For example, one filter rule could be that the of the letters in the abbreviated term stands for a corresponding first letter of a word in the selected expansion phrase. For example, referring to
Another example of a filter rule could be that the first letter in the abbreviated term is the first letter of the first word in the selected expansion phrase and the other letters of the abbreviated term can be found anywhere else in the selected expansion phrase. For example, referring to
At operation 710, the selected expansion phrases are ranked. In an embodiment, the selected expansion phrases are ranked in order of the frequency the selected expansion phrases are found within query log database 106 (
While particular embodiments of the invention have been illustrated and described in detail herein, it should be understood that various changes and modifications might be made to the invention without departing from the scope and intent of the invention. The embodiments described herein are intended in all respects to be illustrative rather than restrictive. Alternate embodiments will become apparent to those skilled in the art to which the present invention pertains without departing from its scope.
From the foregoing it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages, which are obvious and inherent to the system and method. It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated and within the scope of the appended claims.