1. Field of the Invention
The present invention generally relates to information retrieval systems. In particular, the present invention relates to information retrieval systems that index and retrieve hierarchically-defined resources.
2. Background
Textual advertisements (“ads”) are ubiquitous on the World Wide Web today, as they appear on a wide variety of Web pages ranging from obscure forums to major search engine results pages. Online textual ads connect advertisers to prospective customers, and clicks on these ads by Web users account for a significant fraction of the revenue for many online businesses: from news publishers to blogs to Web search engines. According to at least one observer, as of 2008, textual advertising comprised approximately 40% of the $22 billion online advertising market, which is predicted to double in size over the next 5 years.
Textual ads are distributed on the Web via two primary channels, namely, “sponsored search” and “content match.” In sponsored search, textual ads are shown alongside of the search results produced by a Web search engine, while in content match, contextually relevant ads are displayed within the content of generic Web pages. Historically, sponsored search evolved by allowing advertisers to explicitly bid on queries (termed “bid phrases” or “bid terms”) that they wished to display their ads for. In this paradigm, the burden of ad selection was placed primarily on the advertisers, since they needed to skillfully choose a comprehensive list of queries relevant for their ads. This scenario is called “exact match,” since the query and the bid phrase must match exactly for an ad to be shown.
However, it quickly became apparent that it is virtually impossible to explicitly enumerate billions of less popular “tail” queries and impractical to cover them via exact match, even though such queries provide valuable advertising opportunities. To unlock the revenue potential of these numerous yet individually infrequent queries, the “advanced match” method was introduced. Here, the query and the bid term no longer need to match exactly, and ads are selected algorithmically by a search engine or ad delivery system.
Recently, information retrieval techniques have been proposed for advanced match by indexing the ads as documents using the ad text visible to the user as well as the ad's bid phrases. Ads are then selected using an “ad query” that is generated from the user's query (sponsored search) or the Web page on which ads are to be displayed (content match). The ad query is executed against the index of ads using standard information retrieval matching and ranking techniques. Most implementations of advanced match make the simplifying assumption that ads are atomic units that are independent of each other, even though ads from the same advertiser could be quite similar or nearly identical.
In practice, however, textual ads are defined and organized as a hierarchically-structured database with several types of entities. In accordance with one such hierarchically-structured database, each advertiser has one or more accounts. Each account in turn contains one or more campaigns and each campaign includes one or more ad groups. Each ad group typically includes multiple creatives (i.e., the visible text of the ad) and bid phrases. Bid phrases correspond to different products or services offered by the advertiser, while creatives represent different ways to advertise those products or services. Any creative can be paired with any bid term within the same advertisement group to create an actual ad displayed to the user.
This hierarchical structure makes indexing of ads highly non-trivial, as naively indexing all possible “displayable” ads leads to a prohibitively large and ineffective index. Ad retrieval using such an index will not only be slow but will also result in suboptimal precision. There are currently no known techniques for indexing ads that exploit the hierarchical structure of the ad corpus in order to build a compact and effective index for advanced match. Additionally, there are no known ad retrieval methods that are capable of exploiting the hierarchical structure of the ad corpus to retrieve more relevant advertisements than those yielded by conventional methods.
Novel and efficient methods are described herein for indexing ads and other resources that are defined and organized in accordance with a hierarchical schema. In accordance with at least one embodiment, an ad corpus is transformed into a collection of hierarchically structured textual documents. An indexing technique that exploits the hierarchical structure is then applied to construct a compact yet effective ad index that can be used for performing advanced match or other ad retrieval functions. In certain embodiments in which a displayable ad comprises a combination of a creative and a bid phrase contained within a particular ad group, the indexing operation includes combining a field of text representative of a creative within an ad group with fields of text representative of all of the bid phrases within the same ad group to create a single indexing unit. In other embodiments, the indexing operation includes combining fields of text representative of all the creatives and all the bid phrases within an ad group to create a single indexing unit.
Various retrieval methods are also described herein that are capable of exploiting the hierarchical structure of the ad corpus to retrieve more relevant ads than those yielded by conventional methods. In accordance with one embodiment, the retrieval method comprises obtaining a list of the most relevant indexing units with respect to an ad query, wherein each indexing unit comprises a field of text representative of a creative contained in an ad group and fields of text representative of all the bid terms within the same ad group, extracting a ranked list of creatives there from, filtering the ranked list of creatives by ad group, and then retrieving the bid term associated with each creative on the list that is deemed most relevant to the ad query. In accordance with another embodiment, the retrieval method comprises obtaining a list of the most relevant indexing units with respect to an ad query, wherein each indexing unit comprises fields of text representative of all the creatives and bid terms contained within an ad group, retrieving the creative associated with each ad group represented in the list that is deemed most relevant to the advertisement query, and retrieving the bid term associated with each ad group represented in the list that is deemed most relevant to the ad query.
In further embodiments, the relevancy ranking obtained by using at least one of the aforementioned retrieval methods is improved by re-ranking a list of retrieved <creative, bid term> pairs based on a plurality of relevancy-related features associated with each <creative, bid term> pair and/or the ad group with which such pair is associated. Such re-ranking may comprise, for example, calculating a relevancy score for each retrieved <creative, bid term> pair that is a linear combination of weighted relevancy-related features associated with each <creative, bid term> pair and/or the ad group with which such pair is associated. The weights applied to each feature may be optimized, for example, by employing a learning to rank algorithm.
Importantly, the indexing and retrieval methods described herein in the context of an ad retrieval system can easily be extended to other information retrieval systems in which the retrievals are structured hierarchically.
Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in detail below with reference to the accompanying drawings. It is noted that the invention is not limited to the specific embodiments described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the relevant art(s) to make and use the invention.
The features and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
The following detailed description refers to the accompanying drawings that illustrate exemplary embodiments of the present invention. However, the scope of the present invention is not limited to these embodiments, but is instead defined by the appended claims. Thus, embodiments beyond those shown in the accompanying drawings, such as modified versions of the illustrated embodiments, may nevertheless be encompassed by the present invention.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” or the like, indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to implement such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
As shown in
Ad database 102 comprises a hierarchically-structured database that stores textual ads.
As used herein, the term “creative” generally refers to the visible text of an ad. Three example creatives, denoted creative 3021, creative 3022 and creative 3023, are shown as being contained within ad group entity “ad group 2” in
As used herein, a “bid phrase” generally refers to one or more terms that are associated with a product or service offered by an advertiser. Historically, the term “bid phrase” arose from sponsored search systems in which advertisers were allowed to explicitly bid on user queries that they wished to display their ads for. However, as used herein, the term “bid phrase” is not limited to phrases that are bid upon by advertisers and may encompass phrases that become associated with a product or service offered by an advertiser via other means. A plurality of example bid phrases 3041, 3042, 3043, 3044, 3045, . . . 304N are shown as being contained within ad group entity “ad group 2” in
Bid phrases thus correspond to different products or services offered by an advertiser, while creatives represent different ways to advertise those products or services. In accordance with the example database implementation illustrated in
Ad indexer 104 comprises a system that obtains and/or generates information about each ad stored in ad database 102 and stores such information in another database, denoted ad index 106. This process may be referred to as indexing the ads stored in ad database 102. The information stored in ad index 106 is used by ad retriever 114 within search engine 108 to identify ads stored within ad database 102 that are deemed most relevant to an ad query and to rank such identified ads in order of relevancy.
As will be discussed in more detail herein, ad indexer 104 advantageously indexes the ads stored in ad database 102 in a manner that exploits the hierarchical structure associated with ad database 102. In certain embodiments, this process results in the generation of an ad index that is more compact and effective than indexes that would be generated using conventional indexing approaches. A detailed explanation of various indexing methods that may be used by ad indexer 104 will be provided herein.
Search engine 108 comprises a system that is designed to help users, such as a user of user computer 112, search for and obtain access to resources that are stored at a multitude of different interconnected nodes within World Wide Web 110. Such resources may include, for example, Web pages, text files, audio files, image files, video files, or the like. Search engine 108 may comprise, for example, a publicly-available Web search engine such Yahoo!® Search (www.yahoo.com), provided by Yahoo! Inc. of Sunnyvale, Calif., Bing™ (www.bing.com), provided by Microsoft® Corporation of Redmond, Wash., and Google™ (www.google.com), provided by Google Inc. of Mountain View, Calif.
Search engine 108 provides an interface that enables a user of user computer 112 to submit a user query that relates to one or more resources of interest. The user query may comprise, for example, a text query comprising one or more query terms. Responsive to receiving the user query, search engine 108 executes a search to identify resources on World Wide Web 110 that are deemed relevant to the user query, ranks the identified resources in accordance with a relevancy ranking scheme, and then returns a search results page to the user via user computer 112, wherein the search results page includes identified resources sorted in order of relevancy. In one embodiment, for each identified resource, the search results page includes a URL associated with the resource, a title associated with the resource, and a short summary that briefly describes the resource. The URL may be provided in the form of a link that, when activated by a user, causes user computer 112 to retrieve the associated resource from a node within World Wide Web 110.
As shown in
As will be discussed in more detail herein, ad retriever 114 advantageously retrieves ads in a manner that exploits the hierarchical structure associated with ad database 102. In certain embodiments, this process results in the retrieval of more relevant ads than those retrieved using conventional information retrieval techniques. A detailed explanation of various retrieval methods that may be used by ad retriever 114 will be provided herein.
User computer 112 is intended to broadly represent any system or device that is capable of interacting with a search engine, such as search engine 108. In certain embodiments, user computer 112 comprises a processor-based system or device that executes a Web browser or other software that enables a user to submit queries to and receive search results from search engine 108 via World Wide Web 110 or some other communication channel. Depending upon the implementation, such system or device may comprise, for example, a desktop computer system, a laptop computer, a tablet computer, a gaming console, a personal digital assistant, a smart telephone, a portable media player, or the like. Although only one user computer 112 is shown in
As shown in
Ad database 202 comprises a hierarchically-structured database that stores textual ads. Ad database 202 may comprise a database having essentially the same structure as ad database 102 as described above in reference to
Ad indexer 204 comprises a system that obtains and/or generates information about each ad stored in ad database 202 and stores such information in another database, denoted ad index 206. Ad indexer 204 may operate in a like manner to ad indexer 104 as described above in reference to
Ad delivery system 208 comprises a system that is designed to provide contextually-relevant ads for insertion within generic Web pages (as opposed to search results pages) published by various entities via World Wide Web 210. Ad delivery system 208 is configured to provide such advertisements in response to ad requests received via World Wide Web 210. Depending upon the implementation, such ad requests may emanate from a Web page server 212 that is responsible for delivering a Web page to a user computer 214 or from the user computer 214. In the former case, Web page server 212 may embed the advertisements in the Web page before delivering the Web page to user computer 214. In the latter case, a Web browser executing on user computer 214 may dynamically insert the advertisements into the Web page when displaying the Web page to the user.
As shown in
Like ad retriever 114 discussed above in reference to
Web page server 212 is intended to broadly represent any system or device that is capable of serving Web pages over World Wide Web 210. User computer 214 is intended to broadly represent any system or device that is capable of displaying such Web pages. For example, user computer 214 may be any of the various system or device types described above in reference to user computer 112 of
The operating environments of
In view of the hierarchical definition of an ad as discussed above in reference to ad databases 102 and 202, the ad retrieval problem can be formulated as the retrieval of <creative, bid phrase> pairs from a structured schema, wherein each pair comprises a displayable ad. Embodiments of the present invention arise from postulating ad retrieval as a structured retrieval problem, where the unit of retrieval is defined hierarchically. As will be discussed herein, there are several crucial tradeoffs that must be analyzed when dealing with this unique retrieval scenario.
Naïvely indexing all possible retrieval units (i.e., all possible <creative, bid phrase> combinations) using standard information retrieval indexing approaches would result in a significant amount of wasted storage, since each creative and bid phrase will be indexed multiple times due to the Cartesian product semantics. Most of the inverted indexing algorithms used in modern search engines incur increased cost with larger index sizes, making the naïve approach infeasible in practice. Hierarchically-structured indexing schemes are described herein that avoid this problem by reducing the amount of duplication. Novel ranking algorithms will also be described herein that exploit the hierarchical structure and are both more efficient and more effective than algorithms that utilize the naïve indexing approach. In one embodiment, this is achieved by employing a multi-phase retrieval approach in which an ad group is first retrieved, then an optimal creative is selected, and finally a bid phrase is chosen that makes the resulting ad the most relevant for a given query.
As noted above, a target retrieval unit in accordance with one embodiment is a <creative, bid phrase> pair, which together comprise a displayable ad. Thus, in accordance with such an embodiment, the goal of the retrieval function is to produce a ranked list of relevant <creative, bid phrase> pairs. Since the terms “bid phrase” and “bid term” are used interchangeably herein, in the following, the <creative, bid phrase> pair will be referred to as a <creative, bid term> pair.
The example indexes described herein contain two types of textual fields: creative fields (c) and bid term fields (t). In one embodiment, each creative field consists of three sub-fields: title, description and URL. This is consistent with the example creatives 3021-3023 described above in reference to
|C|+|T|=|G|(|
where |
Various retrieval methods described herein utilize a scoring function to measure the relevancy of an entity, such as an indexing unit, with respect to a query, such as an ad query. Any suitable scoring function presently known or hereinafter developed may be used to implement these methods. In certain embodiments, a scoring function associated with a well-known language modeling approach to information retrieval is utilized. In accordance with this approach, indexing units are scored by their probability of generating the query terms. Formally, given a query q and an indexing unit u, each indexing unit is scored using a unigram language model
where p(wi|u) is estimated using Dirichlet smoothing, so that the final scoring function is as follows:
where tfw,k is the number of occurrences of a term w, in either a particular indexing unit (k=u) or the entire collection (k=C), and μ is a free parameter, which controls the amount of smoothing.
In the exemplary retrieval methods described herein, indexing units are ranked in descending order based on their score. As many ad retrieval applications require only a limited number of ads to be retrieved (e.g., a sponsored search application that returns only a limited number of ads to a user in response to her query), only a list of top K indexing units [u(1), . . . , u(K)] is retrieved in accordance with these example embodiments. Furthermore, each unit u in the list is associated with an ad group identifier gIDu. To promote diversity and advertiser coverage in the ranked list, certain embodiments may require that [gIDu(1), . . . , gIDu(K)] are unique, thereby limiting the number of ads retrieved from a single ad group to one.
Three example indexing strategies and corresponding retrieval methods will now be described that may be used for performing ad retrieval. These strategies may be used, for example, to obtain a list of ads relevant to an ad query for use in applications such as sponsored search or content match. Each of the indexing methods described herein may be implemented by ad indexer 104 as described above in reference to system 100 of
Each of the strategies presented below take into account the hierarchical structure of ad database 102 and ad database 202 as described above. The strategies are focused on the indexing of the two lower levels of the hierarchical ad structure—namely the ad group and the creative-bid term levels. The underlying principles of these strategies, however, are general enough to easily extend them to the higher levels the ad hierarchy.
As will be made clear below, each of the strategies presented differ in their choice of the atomic indexing unit and their ranking algorithms. Table 1 provides a formal definition of indexing units and estimated index sizes for each strategy. In what follows, a detailed description is provided of each indexing strategy and associated retrieval strategy.
1. Term Coupling Based Index and Retrieval
In the first indexing scheme, each indexing units represents a <creative, bid term> pair <c,t>. This is the most fine-grained indexing unit, and this approach effectively indexes the Cartesian product of the creatives and bid terms in each ad group. The resulting index is referred to herein as CTInd. As noted above, indexing all possible <creative, bid term> pairs in each ad group can result in a prohibitively large and ineffective index and thus this approach is considered sub-optimal. However, this approach is described herein to provide a basis for comparison for other more efficient hierarchically structured ad indexing schemes.
At step 404, indexing units are generated by combining each field of text representative of a creative contained within an ad group with each field of text representative of a bid term contained within the same ad group. This step is performed across all ad groups in the ad database. As noted above, this step effectively indexes the Cartesian product of the creatives and bid terms in each ad group.
At step 406, the indexing units are stored in a database that is accessible to an information retrieval system, such as an ad retrieval system.
At step 504, a relevancy score is calculated for each indexing unit in the CTInd index with respect to the ad query. In an embodiment, the relevancy score for each indexing unit is calculated using the scoring function described above in Equation 2. However, persons skilled in the relevant art(s) will appreciate that other suitable scoring functions presently known or hereinafter developed may be used to perform this step.
At step 506, the K indexing units having the highest relevancy scores are identified and, at step 508, a ranked list of <creative, bid term> pairs is obtained from the K indexing units.
At step 510, the <creative, bid term> pairs in the ranked list are filtered to ensure that only a single <creative, bid term> pair per ad group is represented in the ranked list. This step may entail identifying <creative, bid term> pairs in the ranked list that belong to the same ad group and then removing all but the highest ranked <creative, bid term> pair in the set of identified pairs. Identifying <creative, bid term> pairs that belong to the same ad group may comprise identifying <creative, bid term> pairs that are associated with the same ad group identifier, gIDt.
At step 512, the filtered list of <creative, bid term> pairs produced by step 510 is returned to the entity that provided the ad query.
As demonstrated by flowchart 500, since the CTInd index is generated by indexing all possible <creative, bid term> pairs, a retrieval process that utilizes this index is essentially a single-level process. No post-processing is required, since indexing units correspond directly to displayable ads. As shall be discussed below, other more compact indexes require some additional ranking after the initial retrieval is performed. The CTInd index, on the other hand, only requires filtering by ad group to retrieve a single displayable ad per ad group. The foregoing retrieval algorithm is referred to herein as CTRank. Algorithm 1 shows the retrieval algorithm pseudocode.
C × T
In the CTInd index, the average number of indexing units per ad group is a product of cardinalities |
2. Creative Coupling Based Index and Retrieval
In a second indexing scheme in accordance with an embodiment of the present invention, each indexing unit represents a single creative c coupled with all the bid terms associated with its ad group gIDc. This indexing scheme advantageously produces a much smaller index than CTInd, since it does not require computing a Cartesian product of every creative and every bid term in an ad group. This index is referred to herein as CrtvInd.
At step 604, indexing units are generated by combining each field of text representative of a creative contained within an ad group with fields of text representative of all bid terms contained within the same ad group. This step is performed across all ad groups in the ad database. As noted above, this step results in a plurality of indexing units, wherein each indexing unit is a single creative c coupled with all the bid terms associated with its ad group gIDc.
At step 606, the indexing units are stored in a database that is accessible to an information retrieval system, such as an ad retrieval system.
At step 704, a relevancy score is calculated for each indexing unit in the CrtvInd index with respect to the ad query. In an embodiment, the relevancy score for each indexing unit is calculated using the scoring function described above in Equation 2. However, persons skilled in the relevant art(s) will appreciate that other suitable scoring functions presently known or hereinafter developed may be used to perform this step.
At step 706, the K indexing units having the highest relevancy scores are identified and, at step 708, a ranked list of creatives is obtained from the K indexing units.
At step 710, creatives in the ranked list are filtered to ensure that only a single creative per ad group is represented in the ranked list. This step may entail identifying creatives in the ranked list that belong to the same ad group and then removing all but the highest ranked creative in the set of identified creatives. Identifying creatives that belong to the same ad group may comprise identifying creatives that are associated with the same ad group identifier, gIDt.
At step 712, for each creative in the filtered list, the bid term associated therewith that is the most relevant to the ad query is identified. In one embodiment, this step comprises applying the scoring function of Equation (2) to each bid term associated with each creative in the filtered list and selecting the bid term that receives the highest score for each creative. This step results in the generation of a ranked list of <creative, bid term> pairs.
At step 714, the list of <creative, bid term> pairs produced by step 712 is returned to the entity that provided the ad query.
As can be seen from the foregoing, a retrieval process that utilizes the CrtvInd index to retrieve a <creative, bid term> pair <c, t> first retrieves a ranked list of creatives, filters them by ad group, and then retrieves the bid term with the highest score associated with each creative in the list. This retrieval method is referred to herein as CrtvRank. Algorithm 2 shows the retrieval algorithm pseudocode.
C
In the CrtvInd index, for each creative that is indexed there is, on average, 1+|
3. Ad Group Coupling Based Index and Retrieval
In a third indexing scheme in accordance with an embodiment of the present invention, the indexing unit represents the ad group itself. That is to say, each indexing unit represents all of the creatives and all of the bid terms associated with a particular ad group. The index produced in accordance with this approach is referred to herein as AdGrpInd.
At step 804, indexing units are generated by combining fields of text representative of all creative contained within an ad group with fields of text representative of all bid terms contained within the same ad group. This step is performed across all ad groups in the ad database. As noted above, this step results in a plurality of indexing units, wherein each indexing unit represents the ad group itself. At step 806, the indexing units are stored in a database that is accessible to an information retrieval system, such as an ad retrieval system.
At step 904, a relevancy score is calculated for each indexing unit in the AdGrpInd index with respect to the ad query. In an embodiment, the relevancy score for each indexing unit is calculated using the scoring function described above in Equation 2. However, persons skilled in the relevant art(s) will appreciate that other suitable scoring functions presently known or hereinafter developed may be used to perform this step.
At step 906, the K indexing units having the highest relevancy scores are identified and, at step 908, a ranked list of ad groups is obtained from the K indexing units.
At step 910, for each ad group in the list, the creative associated therewith that is the most relevant to the ad query is identified. In one embodiment, this step comprises applying the scoring function of Equation 2 to each creative associated with each ad group in the list and selecting the creative that receives the highest score for each ad group.
At step 912, for each ad group in the list, the bid term associated therewith that is the most relevant to the ad query is identified. In one embodiment, this step comprises applying the scoring function of Equation 2 to each bid term associated with each ad group in the list and selecting the bid term that receives the highest score for each ad group.
The combination of steps 910 and 912 results in the generation of a ranked list of <creative, bid term> pairs. At step 914, this ranked list of <creative, bid term> pairs is returned to the entity that provided the ad query.
Thus, in order to retrieve a <creative, bid term> pair in accordance with the foregoing method, first a ranked list of ad groups is retrieved. Then, for each ad group, a creative and a bid term with the highest relevancy scores are retrieved. Since the creative and the bid term scores are independent, and assuming that the scoring function is monotonic, the retrieved <c, t> pair is the pair with the highest score in the ad group. This retrieval algorithm is referred to herein as AdGrpRank. The algorithm pseudocode is presented in Algorithm 3.
G
It is noted that the AdGrpInd index is the only index type described herein with no duplicated fields. The number of indexing units is the same as the number of ad groups, |G|, and for each ad group there are, on average, |
This is the most compact index of the three alternatives described herein. Note also that this indexing scheme eliminates the need for filtering the results by the ad group identifier gIDu, since by definition each retrieved ad will be constructed from a different ad group.
In view of the foregoing, it can be seen that the CrtvInd and AdGrpInd indexes, which combine like entity types having the same parent entity in the ad corpus (e.g., bid terms within the same ad group or creatives and bid terms within the same ad group), are far more compact than the CTInd index. This will advantageously result in reduced storage requirements for each index and in reduced execution time for ad retrieval processes performed using such indexes.
In the previous section, basic retrieval algorithms were described for each of the proposed indexing structures that ranked retrieved ads in terms of relevancy to the ad query. In some embodiments, additional steps may be performed to re-rank the retrieved ads in order to achieve improved or optimal ranking performance. It has been observed that as the size of the indexing unit grows beyond a representation of a simple <creative, bid term> pair (as is the case with the CrtvInd and AdGrp indexes described above), certain challenges present themselves that may hinder the performance of the aforementioned retrieval methods. The re-ranking methods described in this section are intended to address those challenges.
In the indexing strategies described in the preceding section, the indexed and retrieved units are hierarchically structured from atomic and composite fields. Previous work on structured document retrieval shows that a combination of field scores often yields better retrieval performance than matching each field independently.
To test this hypothesis, a simple experiment was designed that combined the scores obtained for the ad group and the <creative, bid term> pair retrieved by AdGrpRank (Algorithm 3). Formally, ads were re-ranked based on a mixture of scores:
sc
q(g,c,t)=λscq(uc,t)+(1−λ)scq(g) (3)
where uc,t is a <creative, bid term> pair <c, t> treated as a single indexing unit, and scq(•) is as defined in Equation 2.
It should be noted that in contrast to previous work, where mixture models usually yield improvements, combining the <creative, bid term> pair score with the ad group score is detrimental—setting λ=1, and ignoring the ad group structure, yields the best performance. It is postulated that this is a result of several key traits that differentiate ad retrieval applications such as sponsored search from other information retrieval tasks.
For example, sponsored search is characterized by a large variance in ad group lengths. While some ad groups contain only a few bid terms, others can have up to 1,000 bid terms associated with them. This causes a strong length bias: the probability of shorter documents (ad groups with a smaller number of terms) to be retrieved is higher than their probability of being relevant. While length bias is a well-known phenomenon in information retrieval, its effects are more pronounced in sponsored search, since the latter has a much higher variance of document lengths.
Another issue that is unique for sponsored search is the cohesiveness of the ad groups. In traditional retrieval, it is usually assumed that documents, even structured ones, are about a single topic; however, this assumption does not necessarily hold for ad groups. A set of bid terms associated with an ad group can range from being focused and cohesive (associated with a single service or product) to being fragmented or even scattered (associated with several products or even with a broad set of generic bid terms), depending on the strategy of the advertiser. There is also a possibility that some advertisers might misuse the bidding mechanisms by purposefully associating unrelated bid terms to their ad groups in an attempt to attract more customers.
Since existing structured retrieval techniques do not address these issues, a novel structured re-ranking approach is described herein that is specifically tailored towards ad retrieval applications such as sponsored search. The approach relies on the ad structure and employs features that go beyond the simple mixture model in Equation 3.
In one embodiment, the structured re-ranking method is a generalization of the mixture model presented in the previous section. It takes into account a given <creative, bid term> pair and the associated ad group. The method may be applied, for example, after performing an initial retrieval round using AdGrpRank (although other retrieval methods such CrtvRank may be used). Then, the method re-ranks the initially retrieved ads using a linear model
where fi(•) is a feature function, λi are weights assigned to each function, and n is the number of such functions used for the re-ranking.
At step 1104, the ranked list of <creative, bid term> pairs is re-ranked based on a plurality of relevancy-related features associated with each <creative, bid term> pair and/or the ad group with which such pair is associated. Such re-ranking may comprise, for example, calculating a relevancy score for each retrieved <creative, bid term> pair that is a linear combination of weighted relevancy-related features associated with each <creative, bid term> pair and/or the ad group with which such pair is associated. The linear combination may be that defined in Equation 4 above. As will be discussed below, the weights applied to each feature may be optimized. For example, the weights applied to each feature may be optimized by employing a learning to rank algorithm.
At step 1106, the re-ranked list of <creative, bid term> pairs is provided to an entity requesting ads.
The re-ranking procedure outlined above is referred to herein as StructRank. A pseudocode representation showing the application of the procedure to the AdGrpRank retrieval method is shown below as Algorithm 4.
The choice of features used in StructRank will play a crucial role in the resulting algorithm performance. Limiting the choice of features to field scores alone yields the mixture model in Equation 3. Since this model does not succeed in outperforming the baseline method that uses no structural information (CTRank), the set of features used by StructRank may be expanded to address certain issues specific to sponsored search that were described above. Using this expanded set of features will result in substantial performance improvements on a number of tasks.
Example features that may be used by StructRank as well as example methods that may be used for parameter optimization will now be described, although other features and methods not described herein may be used. The example features described below have the form f(g,c,t), and are therefore defined over a <c, t> pair and an ad group associated with it. The example features are as follows.
crtvTermPairScore is the score of the given <creative, bid term> pair. It is the equivalent to the scq(uc,t) component in the mixture model in Equation 3.
adGrpScore is the score of the entire ad group, from which the <creative, bid term> pair was selected. It is equivalent to the scq(g) component in the mixture model in Equation 3.
adGrpTermCount is the number of term fields in a given ad group. As previously mentioned, a number of bid terms associated with an ad group can vary significantly. Since document length can affect the document's prior probability of retrieval, the number of bid terms can be explicitly added as a feature to the model.
adGrpEntropy is the entropy of an ad group. The entropy is computed over the individual words of the ad group as—Σwεg pg(w) log pg(w), where the probability of word wi is computed using a maximum likelihood estimate
Following previous work, where entropy was found related to document heterogeneity, entropy is used as an estimate of ad group cohesiveness—ad groups with smaller entropy will tend to be more cohesive.
adGrpQueryCover produces a real number r ε[0,1], such that r is the ratio of query words “covered” by any field in the ad group. It is common, especially for longer queries, that not all query terms will appear in the selected creative and term pair. The adGrpQueryCover feature helps differentiate between the ad groups that achieve high relevance scores due to a disproportional repetition of a single query word (bid term spamming) and those that achieve high relevance scores due to a more comprehensive coverage of query words.
AdGrp[Field]Ratio is the fraction of fields of type [Field] in an ad group that match at least one query word. Intuitively, one expects that ad groups that have high field match ratios with respect to a query, will yield more relevant <creative, bid term> pairs, since those ad groups will tend to be more focused on the query topic. [Field] denotes either one of the sub-fields of the creative field, or the entire bid term field. This produces three features based on the location of the match: AdGrpURLRatio, AdGrpTitleRatio and AdGrpTermRatio.
Additional features other than those outlined above may also be used to implement the re-ranking method.
Various methods may be used for optimizing the free parameters in the ranking function in Equation 4. When the number of free parameters is small (as, for instance, is the case in Equation 3), it is possible to optimize the parameters using an exhaustive search over the parameter space. However, when more features are introduced, such an exhaustive search quickly becomes infeasible.
To address this problem, a large and growing body of literature on the learning to rank methods for information retrieval may be relied upon. Learning to rank methods allow effective parameter optimization for ranking functions with respect to various retrieval metrics, even when a number of free parameters is high.
In one embodiment, a simple yet effective learning to rank approach is employed that directly optimizes the retrieval metric of choice. For example, the retrieval metric may be an nDCG metric. It is easy to see that the ranking function of Equation 4 is linear with respect to λi. Therefore, the coordinate ascent algorithm proposed by Metzler and Croft may be used (see D. Metzler and W. B. Croft, Linear Feature-based Models for Information Retrieval, Information Retrieval, 10(3):257-274, 2007). This algorithm iteratively optimizes a multivariate objective function (such as, for example, rScq (g,c,t) of Equation 4) by performing a series of one-dimensional line searches. It repeatedly cycles through each parameter λi, holding all other parameters fixed while optimizing λi. This process is performed iteratively over all parameters until the gain in the target metric is below a certain threshold.
Although the coordinate ascent algorithm may be used for its simplicity and efficiency, any other learning to rank approach that estimates the parameters for linear models can be used. Other possible learning to rank algorithms include ranking SVMs, SVMMAP or RankNet.
Due to the linearity of the above-defined ranking function, query dependent features (e.g., query length) cannot be readily incorporated into StructRank, since they will have the same contribution across all documents associated with a query. While this can be addressed by using non-linear rankers, such rankers typically require more training data and are more prone to overfitting than linear models. As a middle ground between linear and non-linear approaches, an embodiment bins the queries, and trains a specific model for each bin. Any query-dependent feature (or combination thereof) can be used for query binning. In various experiments it was found that binning by query length is both conceptually simple and empirically effective for retrieval optimization.
The indexing and retrieval methods described above in the context of an ad retrieval system can easily be extended to other information retrieval systems in which the retrievals are structured hierarchically. To help illustrate this, general methods for performing hierarchically-structured indexing and retrieval will now be described in reference to
In particular,
At step 1204, indexing units are generated by combining fields of text representative of hierarchically-related entities in the hierarchical database, wherein the combining includes combining fields of text representative of entities having a same type and a same parent entity and wherein each resource is obtainable from one or more entities represented by the fields of text included in a single indexing unit.
At step 1206, the indexing units are stored in a database accessible to the information retrieval system.
At step 1304, one or more of the indexing units are selected based at least on the relevancy scores.
At step 1306, the one or more resources are identified based on the selected indexing unit(s), wherein each identified resource is obtainable from one or more entities represented by the fields of text included in a selected indexing unit.
Any of the operational components of systems 100 or 200 as described above in reference to
As shown in
System 1500 also includes a main memory 1506, preferably random access memory (RAM), and may also include a secondary memory 1520. Secondary memory 1520 may include, for example, a hard disk drive 1522, a removable storage drive 1524, and/or a memory stick. Removable storage drive 1524 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like. Removable storage drive 1524 reads from and/or writes to a removable storage unit 1528 in a well-known manner. Removable storage unit 1528 may comprise a floppy disk, magnetic tape, optical disk, or the like, which is read by and written to by removable storage drive 1524. As will be appreciated by persons skilled in the relevant art(s), removable storage unit 1528 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative implementations, secondary memory 1520 may include other similar means for allowing computer programs or other instructions to be loaded into system 1500. Such means may include, for example, a removable storage unit 1530 and an interface 1526. Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 1530 and interfaces 1526 which allow software and data to be transferred from removable storage unit 1530 to system 1500.
System 1500 may also include a communication interface 1540. Communication interface 1540 allows software and data to be transferred between system 1500 and external devices. Examples of communication interface 1540 may include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, or the like. Software and data transferred via communication interface 1540 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communication interface 1540. These signals are provided to communication interface 1540 via a communication path 1542. Communications path 1542 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and other communications channels.
As used herein, the terms “computer program medium” and “computer readable medium” are used to generally refer to media such as removable storage unit 1528, removable storage unit 1530 and a hard disk installed in hard disk drive 1522. Computer program medium and computer readable medium can also refer to memories, such as main memory 1506 and secondary memory 1520, which can be semiconductor devices (e.g., DRAMs, etc.). These computer program products are means for providing software to system 1500.
Computer programs (also called computer control logic, programming logic, or logic) are stored in main memory 1506 and/or secondary memory 1520. Computer programs may also be received via communication interface 1540. Such computer programs, when executed, enable system 1500 to implement features of the present invention as discussed herein. Accordingly, such computer programs represent controllers of the computer system 1500. Where an aspect of the invention is implemented using software, the software may be stored in a computer program product and loaded into system 1500 using removable storage drive 1524, interface 1526, or communication interface 1540.
The invention is also directed to computer program products comprising software stored on any computer readable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein. Embodiments of the present invention employ any computer readable medium, known now or in the future. Examples of computer readable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory) and secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, zip disks, tapes, magnetic storage devices, optical storage devices, MEMs, nanotechnology-based storage device, etc.).
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those skilled in the relevant art(s) that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. Accordingly, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.