The present invention relates to the field of Internet searching. In particular, the present invention discloses a method and related system for deciding which external corpora to integrate into primary search engine results.
Traditional web search engines retrieve a ranked list of URL's in response to a query from a user. Increasingly, search results include content from specialized sub-collections or corpora known as “verticals”, which may include non-text media collections such as images and videos, as well as genre-specific subsets of the web such as news and blogs. When a general web search engine has access to or maintains vertical search engines, one important task becomes the detection and presentation of relevant vertical results, known as “vertical selection”. An objective is to maximize the user satisfaction by presenting the appropriate vertical display or displays: this includes the presentation of no display when the user is best satisfied by general web results.
One important aspect of the task of generating a single ranked list from multiple sub-collections (“distributed information retrieval”) is deciding which sub-collections to search given a user's query. This may be approached using query classification techniques that automatically match queries to a predefined set of categories. This predefined set may be topical categories such as games, business, or health. However, this methodology is incomplete and does not take other factors into account.
Disclosed is a computer-implemented method and system for deciding which external corpora, such as verticals, to integrate into primary Internet search engine results in response to a query. In some embodiments, the method includes:
In 310, a first estimate of the probability of each vertical being relevant to the query is computed using the offline evidence from step 100. These estimates (which encompass k+1 situations for k possible verticals, one for each vertical, and one for no relevant verticals) are incorporated into a statistical quantity known as the “prior” probability distribution.
In step 320, if user feedback results for the vertical relevance to the query are available, those results are used to modify the prior distribution obtained from the offline results. This may be accomplished in different ways, depending on the functional form assumed for the prior distribution. Two prior distributions include: 1) Beta (or multiple Beta) prior distribution, and 2) logistic normal prior distribution. The prior distribution modified by the user feedback data is called the posterior probability distribution. This posterior distribution incorporates the offline evidence or data, and also incorporates the user feedback data. It provides an improved estimate of the probabilities of relevance of the possible verticals to the query.
In optional step 325, user feedback data from similar queries is incorporated into the user feedback for the current query. This step is positioned differently depending on the functional form of the prior distribution used. For a multiple Beta prior, similar query user feedback is incorporated just following step 310. For a logistic normal prior, similar query user feedback is incorporated into the current query user feedback during step 320.
In step 330, display decisions are made, based on the results from steps 320 and 325. Among the possible display decisions are: 1) pick one or no vertical with the highest probability of relevance, and 2) randomly choose a vertical, with probability of selection being set as proportional to the probability of relevance for each vertical.
Step 300, the gathering of the offline data, is in itself a complex task. A more detailed flow description of step 300 is illustrated in
In step 400, for this embodiment, corpus features are incorporated from two distinct sets of corpora: collections of vertical-sampled documents, obtained using a variation of query-based sampling, and collections of Wikipedia-sampled articles, each mapped to a vertical heuristically using the Wikipedia article's categories. In one embodiment, four types of corpus features are used: retrieval effectiveness features; ReDDE features; soft ReDDE features; and categorical features. Retrieval effectiveness features may use Clarity, a cross-language information retrieval system, to quantify the predicted effectiveness of the query's retrieval on a vertical-representative corpus. ReDDE (Relevant Document Distribution Effectiveness, a resource-ranking algorithm that explicitly tries to estimate the distribution of relevant documents across the set of available databases) features are derived from a retrieval on an index that combines vertical-representative documents (either vertical- or Wikipedia-sampled). Soft ReDDE features generalize the ReDDE features. Instead of having documents map to a single vertical, a soft document-to-vertical membership has been derived using the similarity between the document and the vertical. Finally, categorical features are derived from the labels of automatically classified documents (e.g., sports, health, science). A query's membership to a class is proportional to the number of top-ranked documents assigned to the category. The data gathered from the corpus features, in other words, takes a query and correlates it with the verticals generated according to several query content and vertical content mappings. Note that although ReDDE and Clarity are two examples of corpus-based features, other corpus-based features may be used without deviating from the spirit or scope of the invention.
In step 405, data is gathered from query log features. The use of query log features is motivated by the assumption that a vertical's relevance may correlate with the likelihood of the vertical being issued by the query. In one embodiment, vertical-specific unigram language models were built using one year of queries issued directly by users to the vertical in question. Query log features used the query generation probability given by the vertical's query-log language model. Note that other non-language model query log features may be used, in isolation or in combination. The data gathered from the query log, in other words, takes a specific vertical and models which queries were directed to that vertical over the past year.
In step 410, data is gathered from query string features. The use of query string features derives from the query string itself, independent of external resources. For example, if the query contains the word “news”, we may assume news vertical intent. The rule-based vertical trigger features, used in one embodiment, are based on 45 classes that characterize vertical intent, (e.g., weather, zip code, music artist, movie). Each trigger class is associated with manual classification rules using regular expressions and dictionary lookups. In addition, a rule-based geographic entity tagger is used to assign probabilities to the set of geographic entities appearing in the query, (e.g., city, country, landmark). Each of these geography types is considered a separate feature. Note that the query string features described herein are exemplary and other query string features may be used without deviating from he spirit or scope of the invention.
Referring back to
Referring now to
The first exemplary form of prior distribution is a multiple beta prior. Beta distributions are generally described in Wikipedia at http://en.wikipedia.org/wiki/Beta_distribution.
pqv˜Beta(aqv, bqv) (1)
with the a and b parameters, which control the shape of the prior distribution, being derived from the offline model probability πqv as follows:
aqv=μπqv bqv=μ(1−πqv) (2)
The inputs for the prior distribution, therefore, are πqv (the offline probability model), and μ, which is a hyper-parameter set by the system designer, which may be any positive number. A large value of μ will concentrate the distribution around πqv, whereas a small value of μ will spread out the distribution.
In step 615, using the prior distribution of equations (1) and (2), and assuming that positive and negative feedback information input is available for the query-vertical pairs, (Rqv is defined as the number of clicks 7 (e.g., positive feedback), and
where Vqv=Rqv+
The second exemplary form of prior distribution is the logistic normal prior. Logistic normal distributions are described in J. Aitchison and S. M. Shen, Logistic-normal distributions: Some properties and uses, Biometrika, 67(2): 261272, August 1980. The flow of this method is illustrated in
where: W and
In step 710, as can be derived using this type of prior distribution, the posterior mean is expressed as:
Rqv and
It has been found that the logistic normal prior method is best suited to cases where there is a clear preferred vertical, whereas the multiple beta method is more effective in cases of similar rated, or ambiguous, verticals.
Referring now to
A corpus-based similarity measure using language models of retrieved results is used to detect similarity between queries. In an embodiment, in step 800, given two query language models, they are compared by comparing their associated language using the Bhattacharyya correlation. This is described in Wikipedia at the World Wide Web address of http://en.wikipedia.org/wiki/Bhattacharyya_distance. The Bhattacharyya correlation ranges between 0 and 1 and is defined as
where P(w|θqi) is the probability of w given document qi.
The information from similar queries is incorporated as follows for the two types of priors discussed. First, in step 805a, for the multiple beta model, the prior of the candidate query is modified to become {circumflex over (p)}qv known as the nearest neighbor estimate of pqv, given by:
where Zq is a normalization factor equal to Σq′B(q, q′). In step 810a, the offline model estimate πqv is then modified and computed to equal
{circumflex over (π)}qv=(1−λq)πqv+λq{circumflex over (p)}qv, 9)
where λ is a designer-set parameter that can range from 0 to 1, which controls the importance of the nearest-neighbor estimate relative to the offline model estimate. λq equals λ multiplied by the maximum similarity value of the set of q's.
Second, in step 805b, for the logistic normal prior model, similar query data is incorporated by adding elements to covariance matrix Σ. Using this method, it can be derived that the similar query data modifies the exponents aqv and bqv in equation (6) to become
Thus, the similar query feedback data modifies the current query user feedback equations. Note that use of the Bhattacharyya similarity measure is exemplary: other types of similarity measures, such as cosine similarity, may be used without deviating from the spirit or scope of the invention.
Referring to
The addition of a random aspect (known as the ε-greedy method) presents random displays for queries with some probability ε. Another randomization method, referred to herein the Boltzmann method, exploits the posterior means across verticals. This method can be broadly described as follows: randomly choose a vertical with a probability proportional to the probability of relevance of that vertical. A visual representation of the randomness injected into the selection would be throwing darts at a board with regions corresponding to the various verticals, but the area of each region would be proportional to the corresponding vertical's probability of relevance. Thus verticals with a higher likelihood of relevance would be included in the random component more often than verticals with lower likelihood of relevance.
Specifically, using the Boltzmann method, in order to incorporate a random element, the decision about which vertical to present is sampled from a multinomial over verticals, this multinomial being derived from the estimated vertical relevance probabilities {tilde over (p)}qv. An exemplary form of the multinomial is a Boltzmann distribution of the form
P(v)=1/Z exp({tilde over (p)}qv/τ),
where Z=Σv exp({tilde over (p)}qv/τ), and τ, a positive quantity, is a designer-set parameter which controls the uniformity of the random vertical selection. As τ approaches ∞, the vertical selection becomes more random, and as τ approaches zero, it becomes less random.
Evaluation
An important aspect of the decisions is the evaluation of the effectiveness of the decisions. Table 1 summarizes the results for the best performing runs of the algorithms described herein, for all queries.
Table 1 lists a quantity called the normalized Umacro, the normalized macro-averaged utility, for the various algorithms. The average utility for an individual query is computed by summing the comparison between the user intent and the prediction, over the set of times the query was issued. This individual query average utility is then summed and averaged over the set of queries to obtain the macro-averaged utility. A normalization factor equal to the best expected value for macro-averaged utility is incorporated to obtain the normalized Umacro. The upper bound on normalized Umacro is 1, (i.e., a perfect system has a performance equal to 1). A designer-set parameter, δ, which ranges between 0 and 1, is defined as the probability of correctly detecting user feedback, (i.e., it introduces noise into the feedback). The higher the value of δ, the more accurate and less noisy is the feedback. Note that preferred adaptation algorithms are robust to noisy feedback.
Row 1 in Table 1 represents the offline estimate, without user feedback. Row 2 is the Multiple Beta model with a uniform prior (i.e., this is a feedback-only model); Row 3 is Multiple Beta with the offline π prior; row 4 incorporates similar query intent; Row 5 adds ε-greedy randomization; Row 6 utilizes the Boltzmann form for the randomization. Rows 6-10 follow the same pattern, but using the Logistic Normal prior model.
The results summarized in Table 1 demonstrate that, although feedback-only models can outperform offline-only models, combining the two results in significant improvements. It is seen that using a logistic normal prior outperforms multiple beta priors across all queries. However, it can also be seen that multiple beta priors with randomized decision making provides stable performance for both single and multiple intent queries, i.e., queries for which multiple verticals are relevant. Multiple Beta priors outperform logistic normal priors for multiple intent queries.
System Considerations
A server system, as defined herein, may include a single server computer or a plurality of server computers. The servers may be located at a single facility or the servers may be located at multiple facilities. In some embodiments, the vertical module may comprise a plurality of servers, such as server systems 9301 to 930N. The vertical selector may comprise one or more additional servers, coupled to and accessible by the server systems for the vertical module, such as server systems 9301 to 930N. In addition, the third parties to the query processing system, such as integrator networks, third party agents and third party recipients, comprises one ore more severs, such as servers 9301 to 930N. As such, servers 9301 to 930N are intended to represent a broad class of server farm architectures and the servers 9301 to 930N may be configured in any manner without deviating from the spirit or scope of the invention.
The client system 910 may include a desktop personal computer, workstation, laptop, PDA, cell phone, any wireless application protocol (WAP) enabled device, or any other device capable of communicating directly or indirectly to a network. The client system 910 typically runs a web-browsing program that allows a user of the client system 910 to request and receive content from server systems 9301 to 930N over network 920. The client system 910 typically includes one or more user interface devices 940 (such as a keyboard, a mouse, a roller ball, a touch screen, a pen or the like) for interacting with a graphical user interface (GUI) of the web browser on a display (e.g., monitor screen, LCD display, etc.).
In some embodiments, the client system 910 and/or system servers 9301 to 930N are configured to perform the methods described herein. The methods of some embodiments may be implemented in software or hardware configured to optimize the selection of additional content to be displayed to a user.
The computer system 1000 may further include a mass storage device 1020, peripheral device(s) 1030, portable storage medium drive(s) 1040, input control device(s) 1070, a graphics subsystem 1050, and an output display 1060. For purposes of simplicity, all components in the computer system 1000 are shown in
The portable storage medium drive 1040 may operate in conjunction with a portable non-volatile storage medium, such as a compact disc read only memory (CD-ROM), to input and output data and code to and from the computer system 1000. In one embodiment, the query processing system software is stored on such a portable medium, and is input to the computer system 1000 via the portable storage medium drive 1040. The peripheral device(s) 1030 may include any type of computer support device, such as an input/output (I/O) interface, to add additional functionality to the computer system 1000. For example, the peripheral device(s) 1030 may include a network interface card for interfacing the computer system 1000 to a network.
The input control device(s) 1070 provide a portion of the user interface for a user of the computer system 1000. The input control device(s) 1070 may include an alphanumeric keypad for inputting alphanumeric and other key information, a cursor control device, such as a mouse, a trackball, stylus, or cursor direction keys. In order to display textual and graphical information, the computer system 1000 may contain the graphics subsystem 1050 and the output display 1060. The output display 1060 may include a cathode ray tube (CRT) display or liquid crystal display (LCD). The graphics subsystem 1050 receives textual and graphical information, and processes the information for output to the output display 1060. The components contained in the computer system 1000 are those typically found in general purpose computer systems, and in fact, these components are intended to represent a broad category of such computer components that are well known in the art.
In some embodiments, the query processing system is software that includes a plurality of computer executable instructions for implementation on a general-purpose computer system. Prior to loading into a general-purpose purpose computer system, the query processing system software may reside as encoded information on a computer readable medium, such as a hard disk drive, non-volatile memory (e.g., flash), compact disc read only memory (CD-ROM) or DVD.
Some embodiments may include a computer program product which is a storage medium (media) having instructions stored thereon/in that may be used to control, or cause, a computer to perform any of the processes of the invention. The storage medium may include, without limitation, any type of disk including floppy disks, mini disks (MD's), optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices (including flash cards), magnetic or optical cards, nanosystems (including molecular memory ICs), RAID devices, remote data storage/archive/warehousing, or any type of media or device suitable for storing instructions and/or data.
Stored on any one of the computer readable medium (media), some implementations include software for controlling both the hardware of the general purpose/specialized computer or microprocessor, and for enabling the computer or microprocessor to interact with a human user or other mechanism utilizing the results of the invention. Such software may include without limitation device drivers, operating systems, and user applications. Ultimately, such computer readable media further includes software for performing aspects of the invention, as described above.
Included in the programming (software) of the general/specialized computer or microprocessor are software modules for implementing the teachings of the invention, including without limitation encoding an archive from a library to generate an encoded archive that is compatible with a virtual library device, and uploading the encoded archive, according to the processes described above.
In one hardware implementation, the query processing system may comprise a dedicated processor including processor instructions for performing the functions described herein. Circuits may also be developed to perform the functions described herein.
It is not expected that the invention should be limited to the exact embodiments described herein. It should be apparent to those skilled in the art that changes and modifications can be made without departing from the inventive concept. By way of example, other types of query string, log, corpus, and feedback features can be combined. These include classifiers using user feedback information as features directly combined with non-feedback features.
The techniques described herein have application for use in cases where the vertical is owned by the search engine, (e.g., the corpora are properties of the general search engine). It may also be used when the corpora are not owned by the search engine, (e.g., a digital library interface which only provides a limited interface to the general search engine). Furthermore, it can also be used for non-vertical content such as “calculators” or other automatic processes which impact web search results. The scope of the invention should be construed in view of the claims.
Number | Name | Date | Kind |
---|---|---|---|
7769746 | Lu et al. | Aug 2010 | B2 |
20020062216 | Guenther et al. | May 2002 | A1 |
20040264780 | Zhang et al. | Dec 2004 | A1 |
20090281894 | Ratnaparkhi | Nov 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20110131246 A1 | Jun 2011 | US |