Web search engines have been developed that allow users to search for information on a network such as the Internet by submitting a query including of one or more query terms. For example, a user may enter a query into a data entry field of a web page associated with a web search engine. The web search engine can access the query and search the network for information that satisfies the query. Web search results that depict, among other things, one or more URLs to documents that satisfy the query are found by the web search engine and are displayed.
One obstacle to obtaining the proper web search results is that users often misspell the query terms associated with their query. To alleviate this problem, many web search engines check the spelling of the query terms and provide suggestions to the user for correcting the query.
Traditional research on spelling correction has relied on pre-defined lexicons for detecting spelling errors. However, this method does not work well for web query spelling correction since there is no lexicon that can cover the vast number of query terms that may be used for performing a web search.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
This Summary is provided to introduce concepts concerning query spelling correction which are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A technology for query spelling correction is disclosed. In one method approach, web search results generated based on a query term are received. The web search results are used as a part of determining a correction candidate for the query term if the query term is incorrectly spelt. For example, web search results that include the word “vacuum” may be used to determine that a query term “vaccum” is incorrectly spelt. The word “vacuum” from the web search results can be used as a correction candidate for the incorrectly spelt query term “vaccum.”
The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments of the technology for query spelling correction and, together with the description, serve to explain principles discussed below:
The drawings referred to in this description should be understood as not being drawn to scale except if specifically noted.
Reference will now be made in detail to embodiments of the present technology for query spelling correction, examples of which are illustrated in the accompanying drawings. While the technology for query spelling correction will be described in conjunction with various embodiments, it will be understood that they are not intended to limit the present technology for query spelling correction to these embodiments. On the contrary, the presented technology for query spelling correction is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope the various embodiments as defined by the appended claims. Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present technology for query spelling correction. However, the present technology for query spelling correction may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present embodiments.
Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present detailed description, discussions utilizing terms such as “receiving,” “using,” “configuring,” “obtaining,” “determining,” “making,” “analyzing,” “providing,” “performing” or the like, refer to the actions and processes of a computer system, or similar electronic computing device. The computer system or similar electronic computing device manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission, or display devices. The present technology for query spelling correction is also well suited to the use of other computer systems such as, for example, optical and mechanical computers.
According to one embodiment, web search results are used as a part of determining a correction candidate for an incorrectly spelt query term. First, web search results can be used as a part of determining whether a query term is correctly spelt. For example, navigational queries usually contain terms that are parts of destination URLs, which may contain out-of-vocabulary terms since there are millions of sites on the Internet. Inferring navigational queries through URL fragment matching can help reduce the probability that a query that includes a query term for an uncommon web site name, such as “innovet” would be incorrectly modified to a more commonly used word such as “innovate.” In another example, web search results can be used to identify acronyms. For example, assume that the query includes the query term “SCOPI” which is an acronym for an organization “Systeme de Communication et de Partage de Information™.” Traditional query spelling correction models would probably incorrectly modify “SCOPI” to “scope.” Frequently an acronym is specified in text using a text pattern where the full name of an organization is followed by the acronym of the organization in parenthesis. The incorrect modification of the acronym “SCOPI” can be prevented by analyzing web search results based on text patterns to identify acronyms.
Second, web search results can be used as a part of determining correction candidates for incorrectly spelt query terms. For example, terms associated with web search results provide information as to a user's intention. In a specific example, if a user incorrectly spelt a query term “vacuum” as “vaccum” there is a high probability that the web search results obtained from the incorrectly spelt query term “vaccum” will include the correctly spelt word “vacuum.”
A user may enter a query into a data entry field of a web page associated with a web search engine. The query may include one or more query terms. The query terms may be separated by blanks. The query terms may be, among other things, words or acronyms. The web search engine can access the query and search the network for information that satisfies the query. Examples of queries include “Vaccum cleaner,” “vacuum cleaner,” “Vaccum,” “Vacuum,” “scopi,” and so on. “Vaccum cleaner” and “Vacuum cleaner” are examples of queries that include more than one query terms.
Web search results that depict for example one or more URLs to documents that satisfy the query are found by the web search engine.
Web search results can be used to identify acronyms using for example text patterns, according to one embodiment. Referring to
Terms associated with web search results provide information as to a user's intention, according to one embodiment. In a specific example, if a user incorrectly spelt a query term “vacuum” as “vaccum” there is a high probability that the web search results obtained from the incorrectly spelt query term “vaccum” will include the correctly spelt word “vacuum.” This can be attributed to the collective link text distribution on the Internet where many links with misspelled text point to sites with correctly spelt text. For example, although the web search results depicted in
Web search results typically include fewer web pages for incorrectly spelt query terms than for correctly spelt query terms. For example, approximately 1.5 million web pages may be returned for the incorrectly spelt query term “vaccum” and approximately 72 million web pages may be returned for the correctly spelt query term “vacuum.”
Web search results for a correctly spelt query term such as “vacuum” and an incorrectly spelt version such as “vaccum” typically include similar context information. For example, the web search results for “vacuum” and “vaccum” may both include terms such as “cleaner,” “pump,” “bag,” or “systems.” The similar context of the web pages returned for searches performed using “vacuum” and “vaccum” can be used to indicate that “vaccum” is an incorrect spelling of “vacuum.”
As will become more evident acronyms, context, number of terms associated with web search results are just a few features that can be used as a part of query spelling correction.
A maximum entropy model is used to model the conditional probability Pτ(c|q) (also known as a “posterior probability”) associated with equation 1 (
Equation 2 (
is a normalization factor and ƒi(c,q) is a feature function defined over query q and correction candidate c. n is the total number of features and λi is the corresponding feature weight. Feature weights can be optimized by maximizing the posterior probability on a training set by selecting the λi that maximizes the log probability of the correct query given the incorrect query, for example, across the entire training set. λs can be optimized using numerical optimization algorithms such as the Generalized Iterative Scaling (GIS) algorithm by maximizing the posterior probability of the training set which contains, for example, manually labeled set of query-truth pairs as expressed by equation 3 (
There are various possible ways of interpreting one or more query terms associated with a query. The various possible ways of interpreting a user specified query q may be referred to as a confusion set C, according to one embodiment. An implementation of a maximum entropy model can be trained using a subset of a confusion set C. A correction candidate c is a query string that is one possible way of interpreting all or part of the user specified query q, according to one embodiment. The correction candidate c may be selected from the confusion set C.
According to one embodiment, correction candidates associated with a confusion set are generated for each query term associated with a query from a term extracted from a source, such as query logs or web search results. Continuing the example, in this case a query may be “vacccum cleaner,” the query term may be “vaccum” and the correction candidate may be “vacuum,” which is extracted from a source such as a query log or web search results. Various spelling correction methods such as generating correction candidates based on edit distance or phonetic similarity can be used. According to one embodiment, the relative proximity of two characters on a standard QWERTY keyboard layout is used as a part of determining edit distance. Phonetic similarity can be determined using a text-to-phoneme converter or by retrieving a phonetic description from a lexicon, for example. The correction candidates for an entire query can be generated by composing the correction candidates of each individual query term. For example, assume that q=w1 . . . wn, and that the confusion set of wi is Cwi, then according to one embodiment the confusion set of q is Cw1{circle around (×)}Cw2{circle around (×)} . . . Cwn. More specifically, for a query q that equals w1 w2, assume that w1 has candidates c11 and c12 and w2 has candidates c21 and c22, then the confusion set C is {c11c21,c11c22,c12c21,c12c22}, according to one embodiment. For simplicity, compound and composite errors were not denoted in Cw1{circle around (×)}Cw2{circle around (×)} . . . Cwn.
A confusion set can be pruned, for example, by roughly ranking the candidates based on the statistical n-gram language model estimated from a source, such as query logs. A subset of the confusion set C that results for example from the pruning and that contains for example a specified number of top-ranked (most probable) candidates can be used for offline training of an implementation of the maximum entropy model or for online re-ranking, or a combination thereof. The number of correction candidates can be used as a parameter to balance top-line performance and run-time efficiency.
A maximum entropy model can be used to describe the possibility that an event will occur based on pieces of evidence by estimating the probabilistic distribution of the pieces of evidence, according to one embodiment. A type of evidence can be thought of as a feature. A feature may be implemented as a variable in a software program. Values for features (referred to herein as “feature values”) can be obtained from various sources. For example, feature values can be obtained from a lexicon, query logs, or web search results, as will become more evident. A maximum entropy model can be configured using the feature values.
The following is a list of features (also referred to as “initial features”), according to one embodiment. Values for initial features can be obtained, for example, from a lexicon or a query log.
The following is a list of features (referred to herein as “web search report features”), according to one embodiment. Values for web search report features can be obtained, for example, from a web search report.
The feature values depend, according to one embodiment, on factors such as what the features are, initial features or web search report features for example, what query was used, what sources the feature values were obtained from and so on.
According to one embodiment, a maximum entropy model can be configured with feature sets (also referred to herein as feature values).
Various features can be used by a maximum entropy model to determine a set of one or more correction candidates. To illustrate, assume that a user specified the incorrectly spelt query term “vaccum” and the correction candidate “vacuum” is suggested. However, the system does not know whether “vacuum” is correct. For example, referring to
The correction candidates are ranked according to one embodiment to determine one or more top ranked correction candidates, according to one embodiment. Continuing the example of an incorrectly spelt query term “vaccum” and the suggested correction candidate “vacuum,” the system may not be able to determine based on model M1 whether “vacuum” is correct. So according to one embodiment, the correction candidate(s) are ranked. For example, the correction candidate c, which in this illustration would be the correctly spelt “vacuum,” is processed by the web search engine to obtain second web search results Rc.
Feature set S2 is obtained from the web search results Rc. The feature set S2 is for web search result features. Feature set S2 is generated for each candidate based on Rc. The feature set S2 contains feature values pertaining to document similarities between Rq and Rc, according to one embodiment. For example, cosine measurements based on query term frequency vectors of Rq and Rc can be used to generate feature values S2. Feature set S2 can be used to configure the maximum entropy model resulting in model M2. M2 may be used to re-rank the candidates.
As depicted in
All of, or a portion of, the embodiments described by flowchart 600 can be implemented using computer-readable and computer-readable program code (also known as “instructions”) which reside, for example, in computer-usable media of a computer system or like device. The computer-usable media can be any kind of memory that instructions can be stored on. Examples of the computer-usable media include but are not limited to a disk, a compact disk (CD), a digital video device (DVD), read only memory (ROM), flash, and so on. As described above, certain processes and steps are realized, in one embodiment, as a series of instructions (e.g., software program) that reside within computer readable memory of a computer system and are executed by the processor of the computer system. When executed, the instructions cause the computer system to implement the functionality of various embodiments as described below.
In preparation of the flowchart 600, assume that queries that were received at a web search engine may be stored in a query log and that feature values S0 for one or more initial features were obtained from a lexicon or a query log, or a combination thereof. Various initial features are described, among other places, in the “Features” section.
At 620, web search results generated based on a query term are received. For example, a user specified query q (
At 630, the web search results are used as a part of determining a correction candidate for the query term if the query term is incorrectly spelt. For example, feature values S1 (
The feature values S0 and S1 are used to configure a maximum entropy model associated with the determiner 520 resulting in model M1, according to one embodiment. The feature values S1 may be obtained for features such as a number of pages, URL fragments and so on associated with the web search results Rq. For example, the number of pages for the incorrectly spelt “vaccum” would be fewer than the number of pages for the correctly spelt “vacuum” which can be used as an indication of what the correct spelling is. In another example, the number of times that the query term or the correction candidate occur in the web search results Rq can be used as a part of determining whether a query term is misspelt or whether a correction candidate under consideration should be made a correction candidate. For example, the number of times that the correction candidate “vacuum” occurs in web search results Rq would probably far exceed the number of times that the query term “vaccum” occurs in the web search results Rq. Further, the web search results for both “vaccum” and “vacuum” would probably include similar content such as “cleaner,” “bags,” and so on further indicating that “vaccum” and “vacuum” are potentially the same word. The feature values S1 may include a URL fragment from the web search results Rq that can be used to determine if a query term is incorrectly spelt. For example, if the query term were “SCOPI” and the web search results Rq included a URL for the organization “Systeme de Communication et de Partage de Information™” with a URL fragment “SCOPI” the determiner 520 would be able to use the URL fragment to determine that “SCOPI” is correctly spelt and not modify “SCOPI” to “scope.”
M1 (
According to one embodiment, a top ranked correction candidate is determined based on the correction candidate c. For example, the correction candidate c, which in this illustration would be the correctly spelt “vacuum,” is processed by the web search engine to obtain second web search results such as web search results Rc. Feature values S2 are obtained from the second web search. The feature values S2 may be for one or more web search results features which are described, among other places, in the “Features” section. Feature values S2 can be received or determined by the determiner 520 and are used to configure the maximum entropy model associated with the determiner 520 resulting in model M2. Any of the features that values were obtained for S1 could also be used for S2, according to one embodiment. M2 may be used to re-rank correction candidates.
With reference now to
System 700 of
Referring still to
Referring still to
The computer-readable and computer-executable instructions reside, for example, in data storage features such as computer usable volatile memory 708, computer usable non-volatile memory 710, and/or data storage unit 712 of
Although the subject matter has been described in a language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
This Application is related to U.S. patent application Ser. No. 11/589,557 by Mu Li and Ming Zhou, filed on Oct. 30, 2006 and entitled “DISTRIBUTIONAL SIMILARITY-BASED MODELS FOR QUERY CORRECTION” with attorney docket no. 317147.01, assigned to the assignee of the present patent application the contents of which are incorporated herein.