The field of the invention generally relates to information retrieval methods, and more particularly, to a method and system for information retrieval that improves the relevance of search results obtained using a search engine. In one aspect of the invention, a method and system for retrieving documents or web pages uses a search engine to provide relevant information to the user. Information retrieval is based, at least in part, on the use of adaptive language processing methods to resolve ambiguities inherent in human language.
Current search engines rank search results based on many assumptions that must be predetermined in advance. These assumptions can be, for example, the users' desired information or goal, whether they are looking for specific content they have seen before, researching a novel topic, or locating some resource. Many times, the search engine must assume the meanings of ambiguous queries submitted by the requester. Such ambiguous queries are common due to the nature of short queries input to the search engine. Moreover, in many languages, particularly the English language, words have multiple meanings. Finally, ambiguities will often arise due to poorly formed queries. In these situations current search engines make broad assumptions, implementing so-called “majority rules.” For example, a search engine might assume that an user issuing the query of “jaguar” is looking for the JAGUAR automobile because that is what 80% of the users were looking for previously. These assumptions, however, often turn out to be incorrect.
Consequently, it becomes increasingly difficult for search engines to look for the non-majority usages of terms. Conventional search engines thus have difficulty “searching beyond the norm.” For example, if a user is looking for the Jaguars football team or the JAGUAR operating system produced by APPLE COMPUTER, the requester would have to add additional query words to their searches. Alternatively, the requester would have to attempt using complex “advanced search” features. Either method, however, does not necessarily guarantee better results. As a result, requesters are often left to wade through pages and pages of irrelevant documents. This problem is only exacerbated by the ever increasing volume of content that is being created and archived.
Most current search engines locate pages or documents based on one or more “keywords,” which are usually defined by words separated by spaces and/or punctuation marks. Search engines usually first pre-process a collection of documents to generate reverse indexes. An entry in a reverse index contains a keyword, such as “watch” or “check,” and a list of documents within the collection that contain the keyword of interest. When a user issues a query such as “watch check babysitter” to the search engine, the search engine can quickly retrieve the list of documents containing these three keywords by looking up the reverse indexes. This avoids the need to search the entire collection of documents for each query, which of course, is a time consuming process. Recently, more sophisticated search engines, such GOOGLE and TEOMA, improve keyword searching by prioritizing the search results via measures of relevancy based on how the stored documents reference each other via hypertext links. For example, a higher degree of linking may be used as a proxy for relevancy.
Unfortunately, keyword-based search engines fail to account for the many ambiguities present in all natural (e.g., human) languages. For example, the word “driver” has multiple meanings. For example, “driver” may refer to an operator of a vehicle, a piece of computer software, a type of tool, a golf club, and the like. When a user is seeking documents containing a particular type of driver, he or she can either: (1) sort through the results manually to eliminate documents using a different meaning of “driver,” or (2) compose complex queries to make the request less ambiguous, such as “(golf (driver or club)) and not (golf cart driver),” or (3) wade through the “advanced search” interface(s) in order to reduce the irrelevant documents returned by the search engines. These options are, however, time consuming, tedious, and require users to impart additional efforts in understanding, or worse, adapting to their own search to a search engines particular to improve their search results.
A better model is for the search engine to comprehend or “understand” the documents as humans reading them would. As such, the search engine would extract the meanings commonly understood to those reading the document or web page. In doing so documents or web pages are organized based on the meanings of the words and not the words themselves. In this scenario the number of irrelevant documents can be greatly reduced, thereby improving the user's experience and the search engine's effectiveness in retrieving relevant documents. Unfortunately, understanding natural language texts requires resolving ambiguities inherent in all natural languages, a task that can be difficult even for humans. Similarly, computer programs written to analyze words contained in documents are also unable to resolve these ambiguities reliably.
Current search engines suffer from the limitation in that they leave these linguistic ambiguities unresolved. Attempts have been made, however, to develop models aimed to mitigate this problem. Generally, the most common approaches can be divided into two major groups: (1) feature-based models and (2) language-based models. Feature-based models extract features from documents and convert them into predefined representations, such as feature vectors, categories, clusters, and statistical distributions. These transformations enable the approximation of the closeness in meaning between documents and the requesters' queries by calculations done using these representations. Unfortunately, these representations need to be stored in addition to the index, thereby greatly increasing the storage requirements of the search engine utilizing such a system. This option is less desirable for most large-scale search engines capable of handling the number of documents and web pages contained on a network such as, for instance, the Internet. One can reduce the size of these representations to save space, but this also decreases their effectiveness and, thus, their utility.
Furthermore, these approaches still rely on “shallow” features, i.e., the words themselves, to approximate the underlying semantics. That is, current models treat the documents as “bag of words,” where each word is represented by its presence and neighboring words. Therefore, these approaches ignore the well-formed structures of natural language, a simplification with several problems. The following four sentence fragments illustrate the problem with these models:
Because these four fragments contain the same four words, a “bag of words” model will treat them all as semantically equivalent, i.e., having the same meaning. A human reader, however, would easily see that this is not the case. One improvement is to retain ordering and proximity information of these keywords, but the true semantic meaning remains inaccessible. For example, assume a user queries a search engine with “Apple” and “fell.” A search engine based on the so-called shallow features will find the following three sentences equally relevant because “Apple” and “fell” appear next to each other:
A human reader would understand that it is “shares” and “the man” that fell in the second and third sentence, respectively. It is all too common for a user to read a document returned by a search engine to find its irrelevance because of the engine's ignorance of such linguistic ambiguities.
A different approach is for a search engine to analyze the documents to extract their meaning—an area of research called natural language processing (NLP). This field studies various approaches that can best resolve language ambiguities, including linguistics based, data-driven, and semantics based techniques. The goal is to recover the semantics intended by the author. For example, the model may identify the computer software meaning of “driver” in the following sentence:
“The driver needed by the Golf computer game can be found here.”
A search engine can then create indices based on the meaning instead of the keywords, i.e., a semantic index or conceptual index. A user looking for a software driver using such a search engine would not be inundated with documents regarding golf clubs or vehicle operators, for example. Moreover, structural ambiguities such as the “Apple fell” example discussed above are also resolved to properly identify the long-distance dependences between words.
There are two major obstacles preventing a search engine from realizing these benefits of NLP techniques. These include accuracy and efficiency. Although NLP's accuracy has been steadily improving, it has not improved the accuracy of information retrieved on a large scale. This is because the accuracy level of resolving linguistic ambiguity (i.e., disambiguation) is still lacking, and thus the errors made cancel the benefits NLP provides. One reason for this canceling effect is that the information retrieval (IR) models usually accept only one interpretation from the NLP systems. In doing so, however, disambiguation errors are treated as correct by the IR systems, thus producing the nullifying effect.
The second challenge is efficiency. Because of the voluminous nature of the number of documents linked to the Internet, processing large amounts of text can be too time consuming to be practical. For example, full analyses of sentential structures, i.e., parsing, requires a significant amount of time (e.g., at least polynomial time). Resolving references made with articles and pronouns can involve complex aligning procedures. Reconstructing the structure of a discourse requires complex record-keeping and sophisticated algorithms. Therefore, applications of these more “in-depth” NLP techniques are hampered by the amount of computational resources needed, especially dealing with the enormity and fast-growing collection on the Internet.
Related to the efficiency issue is accuracy. While algorithms that avoid in-depth analysis exist and thus reduce the amount of computation resources needed, they come at a price of lowered accuracy. That is, the improved efficiency is made possible by ignoring, for example, long-distance dependencies and complex relations within texts. The challenge is in striking a delicate balance between accuracy, efficiency, and practicality. Thus, the goal is to provide an information retrieval system and method that can accurately resolve natural language ambiguities to improve the system's search quality, while at the same time is efficient such that it can be used to index large collections such as the Internet and keep pace with its phenomenal growth.
There thus is a need for a system and method that efficiently searches and identifies relevant information for a requestor. The system and method would advantageously account for lexical ambiguities. Moreover, in certain embodiments, the method would provide the user with a simple way to eliminate results that are unwanted. The system and method also would present the most relevant information to the requester in a manner that mitigates or eliminates entirely the process of wading through lists of unrelated or irrelevant documents.
In one aspect of the invention, an improved system and method for information retrieval is provided that improves the resolution of ambiguities prevalent in human languages. This system and method includes four main components including: (1) an adaptive method for natural language processing, (2) an improved method for incorporating language ambiguities into indexes, (3) an improved method for disambiguating requesters' queries, and (4) an improved method for generating user feedback based on the disambiguated queries.
In one aspect of the invention, the language processing used in the present invention is an adaptive and integrative approach to resolve ambiguities, referred to as Adaptive Language Processing (ALP) module. The ALP module is adaptive in the sense that it balances the need for accuracy and efficiency. The process begins with resolving part-of-speech and word sense ambiguities based on local information, making it more efficient. However, if additional analysis is performed, such as chunking, full parsing, anaphora resolution, etc., the NLP model leverages this additional information to improve the method's accuracy. Consequently, the method balances efficiency with accuracy, in that ambiguities are quickly resolved in a first pass, and if more accuracy is needed, more computation can be allocated.
An important aspect of ALP's output, which is also maintained throughout the IR model, is a measure of confidence (MOC) parameter or value. This MOC value represents the amount of confidence, or conversely, the amount of ambiguity, the model associates with each ambiguous decision. Because current NLP models are not 100% accurate, and because some ambiguities can sometimes be intentional, the present invention entertains multiple interpretations as well as their associated confidence measures. The MOC value allows the model to better integrate multiple sources of ambiguities into interpretations that are more semantically coherent. The result is reduced retrieval errors, an improved user experience, as well as improved reliability as NLP technology improves.
For example, using the earlier “driver” query, the ALP module is not forced to make only a single decision for “driver,” a difficult task because of the limited context. Instead, the ALP module produces a MOC value for each possible meaning, such as 50% confident for the “software driver” meaning, 35% confident for the “golf club” meaning, etc. This measure is then maintained and utilized throughout the IR model to improve search quality. The MOC value may also be retained to provide user assistance.
In one aspect of the invention, a user's query is processed by the following steps. First, a list of documents or web pages and associated MOC values are retrieved from the reverse indexes. These MOC values are then used to disambiguate the user's query via a “confidence intersection” formed by a matrix of the various ambiguous meanings attributable to a particular query vis-à-vis the number of documents containing the queried term(s). The documents or web pages are then sorted based on the disambiguated query, presenting more semantically relevant results higher on the list. Optionally, a list of alternative interpretations of the query is provided for the user. If the wrong interpretation is chosen initially, users can readily choose the correct one and quickly eliminate irrelevant results.
An additional benefit of the semantic-based IR model enabled by NLP is its ability to suggest additional search terms based on conceptual similarity. The uniqueness of this approach is that the suggestions are more relevant since they are based on the disambiguated queries. Furthermore, the suggestions are compiled automatically during the language analysis step done by the ALP module. These suggestions are linguistically correct and semantically disambiguated. Moreover, the suggestions reflect and adapt to the ever-changing body of documents searched by the search engine. Consequently, these suggestions provide to the users instant access to relevant documents that are semantically similar to their current query.
In one aspect of the invention, a method of indexing documents for use with a search engine includes the steps of identifying the words contained in a document. The words are processed in an adaptive language processing module so as to associate each word with a measure of confidence (MOC) value, the MOC value being associated with a particular meaning of the word. Each word and its MOC value is stored in a reverse index along with location information for the document. The documents may be indexed using, for example, a crawler and an indexer.
In the method described above, each word within a document may also be associated with a part-of-speech tag identifying the grammatical usage of the word within the document. The part-of-speech tag may be associated with a MOC value. In addition, in the method described above, each word within a document may also be associated with a word sense value identifying a particular meaning of the word. The word sense value may be associated with a MOC value.
In still another embodiment of the invention, a method of retrieving documents using a search engine includes providing a reverse index including one or more keywords and a list of documents containing the one or more keywords, the reverse index further including a MOC value associated with the one or more keywords. One or more query terms are input to the search engine. Based on the input query terms, one or more meanings of the query terms are identified and each meaning is associated with a MOC value. A list of documents is then retrieved containing the one or more query terms, wherein the documents are ranked at least in part on the MOC value associated with the one or more keywords contained in the document and the MOC value associated with each query term meaning.
In one preferred aspect of the invention, the documents having a keyword meaning most similar to the query term with the highest MOC value are ranked higher. This ranked list may be presented to the user on his or her computer (or other device) to provide a list of documents that are more relevant than lists returned by conventional search engines.
In one aspect of the invention, the user may be presented with one or more alternative queries. The one or more alternative queries may comprise known phrases formed by consecutive query terms. The alternative queries may be ranked according to their respective usage frequencies. Alternatively, the one or more alternative queries may be based at least in part on speech pairings of multiple keywords contained within the documents. In yet another embodiment, the alternative queries may be based in part on synonym(s) of one more query terms. Alternatively, the one or more queries may be based in part on definition(s) of the input query terms. In still another aspect, the alternative queries may be based at least in part on the disambiguated query. The alternative queries may also be presented to the user in a ranked order. For example, alternative queries may be ranked based on usage frequency or on semantic similarity to the input query.
In another embodiment of the invention, a method of retrieving documents using a search engine includes providing a reverse index including one or more keywords and a list of documents containing the one or more keywords, the reverse index further including a MOC value associated with the one or more keywords. One or more query terms are input into to the search engine. The query terms are disambiguated by obtaining a MOC value for each query term based at least in part on the meaning of each query term. A list of documents is retrieved containing the one or more query terms, wherein the retrieved documents are initially ranked based at least in part on the MOC value associated with the keyword contained in document and the measure of confidence value associated with each query term meaning. The list of documents is then re-ranked at least in part based the semantic similarity of each document to the disambiguated query. The semantic similarity of a document to the disambiguated query may be determined by looking up pre-computed distanced between every two concepts within an ontology.
In another embodiment of the invention, a method of retrieving documents using a search engine includes submitting a query to a search engine and presenting a user with a list of documents, the list including an exclusion tag associated with each document in the list. One or more exclusion tags in the list are selected to exclude one or more documents. Next, a similarity measure is determined for each document in the list based at least in part on the similarity of the document to those documents associated with a selected exclusion tag. The list is then re-ranked based on the determined similarity measure, wherein those documents most similar to the excluded documents are demoted or removed from the re-ranked list.
The user may also be presented with a list of a list of categories, wherein each category includes an exclusion tag associated therewith, wherein selection of the exclusion tag associated with a particular category excludes documents from the re-ranked list that fall within the particular category.
In another aspect of the invention, an improved method for ranking the relevance of search results is provided. This method includes three general steps including: (1) providing a user-interface component that is easy for requesters to specify the results they do not want (the documents to eliminate), (2) computing a similarity measure of all the results to those eliminated, and (3) based on the similarities, re-ranking the results list so those with similar content to the eliminated documents are ranked lower or removed entirely.
According to still another embodiment of the invention, a method of retrieving documents using a search engine includes establishing a user preference for a plurality of categories of documents, submitting a query to a search engine, determining a similarity measure between the documents based at least in part on the similarity of the documents to the established category preferences, and presenting the user with a list of documents, wherein the documents are ranked based on the determined similarity measure.
It is thus an object of the invention to provide a method and system for retrieving information using a search engine. The method and system provides more relevant documents to a user by efficiently and accurately resolving linguistic ambiguities contained in both documents and submitted queries. A method is also provided that permits the display or presentation of the most relevant documents to a user. Irrelevant or un-wanted documents can easily be removed from returned query lists to limit or eliminate the need to sift through pages of returned documents. Further features and advantages will become apparent upon review of the following drawings and description of the preferred embodiments.
Still referring to
Thus a sample output from the ALP module 120 for the sentence “He found a driver” would be the following:
He PRP(1.0)[#1(1.0)]
found VBD(0.8)[#1(0.4)/#2(0.5)/#3(0.1)] ADJ(0.1)[#1(1.0)] NN(0.1)[#1(1.0)]
a DT(1.0)[#1(1.0)]
driver NN(1.0)[#1(0.4)/#2(0.3)/#3(0.3)]
The symbol following the word is the part-of-speech tag (PRP for pronouns, VBD for past tense verbs, DT for determiners, and NN for nouns). The number appearing after the POS tag is the MOC value generated by the ALP module 120, such as 0.8 for “found” being a verb and 0.1 for being an adjective. Following the POS tags are the word sense numbers and their respective MOC values. In this example, “driver” has three noun senses, and due to the ambiguous context, all three senses are almost equally likely.
Additionally, the ALP module 120 generates optional document summaries 122, which are used when search results are returned to the users. The document summaries 122 can be simply the textual portions of the original documents, or condensed versions of the documents like an abstract or synopsis. The document summaries 122 may be presented to the user adjacent to each document identified in a search result list.
The ALP module 120 outputs, along with the associated MOC values, are processed by an indexer 124 to generate a reverse index (or indices) 126. This process is illustrated in greater detail below. The reverse index 126 can be continually updated as documents are added and/or updated. For example, crawlers or bots may continually or regularly retrieve documents to that the reverse index 126 contains up-to-date entries.
Still referring to
Most current search engines simply ignore these ambiguities and treat them as keywords, or use heuristics to make initial guesses as to what the user intended. In contrast, the system and method of the present invention improves upon this process by disambiguating the query 128 using a query processor 130 which is described in more detail below.
The output of the query processor 130 is a list of documents containing the query terms. Additionally, a ranked list of possible interpretations of the users' ambiguous queries 128 is produced, the first of which is considered as the most plausible. The output from the query processor 130 is then sent to the results processor 132, which then ranks the list of documents by their relevance. The search results are then combined, formatted, and ultimately sent displayed to the user 134 via a monitor or the like.
Traditionally, when an error in language analysis is made, it causes a document to be permanently indexed by a search engine using the incorrect meaning. This problem is further exacerbated when highly ambiguous queries 128 are submitted. Consequently, conventional search engines return fewer relevant documents to the user. Moreover, the documents that are returned may contain irrelevant or the wrong content.
In contrast, the present system and method for information 100 retrieval maintains multiple interpretations and associate each with a confidence measure (e.g., MOC value). This is done for both the documents being searched as well as the users' query 128. With reference to
The goal of this process 138 is to choose the most confident meanings of query words that are contained in documents. This is an advancement over vector-based or ontology-based retrieval methods in that query disambiguation is based on the documents being searched, rather than a predefined computation of semantic similarity. Consequently, the system and method described herein is a dynamic method of disambiguation by mapping queries 128 to their meanings based on the ever-changing content of the document collection. This is an improvement over conventional approaches, where query disambiguation, if done at all, is done based on static methods for calculating similarity, regardless of the document collection.
A second task of the confidence intersection process 138 is to obtain a measure of document relevancy to the query 128. The MOC score for each document computed during confidence intersection process 138 is the system's certainty about the documents containing the correct meanings of the query words. By sorting on the document confidence scores, documents most similar to the disambiguated query are ranked higher on the results list, whereas less likely and possibly erroneous interpretations are placed lower on the list.
Still referring to
There are three types of suggestions the present system and method offers to the users. The first is based on phrasal ambiguities. Assume, for example, that the user's query is “special interest stall drives” (without the quotes). This is an ambiguous query with multiple interpretations. A safe and default action is simply to search for documents containing all four query terms. However, it is most likely that the user is searching for “‘special interest’ ‘stall’ ‘drives’”, i.e., “special interest” is meant as a compound noun phrase. In a scenario such as this, the suggestion module 140 would produce less ambiguous queries to help users refine their searches. In this example, the suggestion module 140 would produce the following four interpretations by adding quotes around phrases:
(1) “special interests” stall drives
(2) “special interests stall” drives
(3) special interests “stall drives”
(4) special “interests stall drives”
These four phrasal suggestions are generated by looking-up known phrases that are composed of consecutive query terms. These known phrases are automatically identified by a chunker that is part of the ALP module 120 as the ALP module 120 processes the document collection. Additionally, the potential suggestions may be weighted by their usage frequency, identifying the most likely phrase as “special interests” in this example. This look-up procedure is done efficiently using dynamic programming techniques which are known to those skilled in the art. In the above example, the other alternates makes little sense. However, since the suggestions are weighted by their usage frequency, the less useful suggestions are ranked lower. In one embodiment, the less frequent alternatives may be disposed of entirely and not presented to the users.
A second type of suggestion is part-of-speech (POS) ambiguity. For example, “drives” can be a noun, as in “floppy drives,” or a verb, as in “Jane drives.” The suggestions the present invention provides are exactly as in this example to distinguish this ambiguity. Specifically, a noun can be expanded into a noun phrase, a noun-verb, a verb-noun, or a adjective-noun pair. Likewise, a verb can be expanded into a noun-verb, verb-noun, adverb-verb, or verb-adverb pair. Similarly, an adjective can be expanded into an adjective-noun or a noun-is-adjective pair. Lastly, adverbs can be expanded into an adverb-verb, verb-adverb, or adverb-adjective pair.
These POS suggestion pairs are generated and weighed by their usage frequency within the documents, compiled by the indexer 124. Therefore, these suggestions are not predetermined by a dictionary or database. Rather, they are dynamically generated and updated with the content of the documents. Therefore, pairings with increasing popularity or archaic, technical terms are automatically incorporated.
The third type of suggestion is based on word sense ambiguity. This is the most challenging method for automatic suggestion. While synonym lists can be used, they are often long and can become laborious for the users to read. One possibility is to associate a short phrase with each unique concept within a lexicon, such as “financial bank,” “river bank,” and “racetrack bank” for these three senses of “bank.” The drawback of this method is the manual efforts needed to create and update these phrases and concepts.
Still another option is to use the definitions and/or example sentences from dictionary glossaries, which is a less labor intensive approach. However, this would also demand more from the user in reading the definitions. Also, they are less compositional if the queries contain multiple ambiguous words. Ultimately the decision is made by the system builder choosing a tradeoff between these intertwined parameters.
One additional function of the query refinement suggestion module 140 is to generate conceptually similar search queries. This is especially useful when the users are searching conceptually or are unsure of the exact vocabularies. Two such methods for automatically generating relevant suggestions are presented with both methods being centered around the disambiguated queries. This is an improvement over current suggestion methods, which are simply based on collocations of keywords. Collocations are generally unreliable since they are based on “shallow” linguistic features, in that suggestions are based on words that frequently occur next to each other, whether they are conceptually relevant or not. Even with ad hoc heuristics to extract more informative collocations, they are still not semantically disambiguated. Therefore, collocations such as “downloadable driver” and “driver education” are both suggested even though the users are unlikely to be searching for both meanings of “driver.”
The advantage of having the queries disambiguated is the semantic context they provide, such as a “computer driver” query would not produce suggestions about operating cars. Eliminating the noise from the suggestions based on semantic similarity is important to their usefulness. The suggestions are first compiled into a database during the indexing step in preparing the reverse indices 126, where the disambiguated phrases produced by the chunking step of the ALP module 120 are saved to a database, alongside with its usage frequency. For making suggestions, phrases that appear in the list of result documents are tallied and weighted by their usage frequencies.
The suggestions can be ranked based on their frequencies alone, or further refined based on their semantic similarities to the query. One approach is to use semantic distance as a measure of semantic similarity. This is typically computed based on an ontology where concepts are connected in a hierarchy. Semantic distances are computed by the number of “hops,” or degrees of separation between two concepts. These refined suggestions are therefore focused more on semantic relevance and less on usage frequencies. One downside to this approach is the added complexity and computation. However, the ultimate decision on tradeoffs between complexity/resource utilization and relevance of search results is a decision left for the system builder.
One matrix is based on semantic relatedness 146, a concept introduced earlier for ranking suggestions. This improves the results by grouping and boosting or promoting documents that are more semantically similar to the query. That is, the semantic closeness of the entire document to the query is computed via semantic distance. This is computed efficiently by pre-computing distances between every two concepts within an ontology and saving it into a database 148. With the database 148, the semantic similarity of a document to the disambiguated query is computed by looking-up the pair-wise values of concepts within the document to the query terms. It is important to note that the disambiguated query 142 is essential to this step because semantic similarity cannot be calculated without it. While semantic distance has been described as a preferred method to determine semantic relatedness, other measures of similarity can be used provided they can be computed efficiently.
Other matrices for ranking of the documents are common to current search engines and may be implemented in the current system and method. These may include one or more of term frequency, text formatting, text positioning, document interlinking, document freshness and others. These matrices are compiled and stored in a database of document attributes 150 during pre-processing. A weighting measure may be given to each matrix to gauge its importance which may be chosen or altered by the system builder.
Still referring to
Lastly, in one embodiment of the invention, optional suggestions generated by the query refinement suggestion module 140 are incorporated by a results formatter 154 to compose the final formatting of the results page for the user. Options for the formatting include HTML, XML, and the like, depending on user preference and applications. The formatted result page is then returned to the user for display 134.
The information retrieval system 100 overcomes this shortcoming by inferring what the user is searching for conceptually. However, due to the limited context, reliably disambiguating the query 128 is difficult. While most would assume that the user is searching for something akin to “my car motor stops,” such assumptions can often be wrong and lead to irrelevant results. In this example, minimal assumptions are made during the disambiguation step 142 by the ALP module 120 such that an equal likelihood is given that “stalls” means “delay or stop” (a noun sense) or “bring to a standstill” (a verb sense). This can be seen by the equal MOC values 204 (0.4 in this case) associated with the query 128. This output constitutes as the initial query disambiguation 142 and is further refined as described below.
The same process may be undertaken with respect to the query word “engine.” In one aspect of the invention, the sense or meaning of “engine” may be determined independently of the sense or meaning of “stall.” This may be preferred, for example, if efficiency is a concern. In another aspect of the invention, query ambiguity is resolved across multiple query terms.
For each of the three documents a permutation of the different meanings of the query words is generated to determine the combined likelihood of that particular meaning combination used within the document. In this step, the query words influence each other because of the examination of the senses that are the most likely to be contained within the same set of documents. In doing so, the query terms do not have to be semantically similar to each other, as was necessary in previous methods that rely on the query terms alone. Instead, the information retrieval system 100 looks for the most commonly used senses of query terms within the documents containing them. Therefore, the present invention leverages the content of the documents to automatically disambiguate the senses of query terms.
The final step 218 is to automatically disambiguate the query 128 is to select the maximal sense combination across all three documents 200a, 200b, 200c, which in this example is the first sense for “engine” and third sense for “stall.” If further refinement is desired, an optional semantic similarity processing step 220 between each sense combination can be added as a measure of semantic plausibility. The result is an automatic, efficient and accurate method to disambiguate the users' queries 128.
Another suggestion method generates 224 related concepts 226 such as “prevent engine knocks” and “fuel cleaners.” These suggestions may be based on linguistically accurate meanings that were collected and stored in a language database 148.
Referring to
As stated earlier, the relevancy of the documents 200a-200c is computed based on the automatically disambiguated query terms. However, this automated process is not infallible. Therefore, in one aspect of the invention, search results are displayed along with alternate query interpretations 222 and suggested related concepts 226. In this example the automatically determined interpretation is “car engine stops,” which is shown at the top of the list as reference to the user. Likely alternate interpretations are provided below, which are links that encodes the exact meanings of these alternates. For example, if the user chooses the alternate meaning of “economic engine delayed,” query disambiguation need not be done (such processing having already occurred). Instead, search results are re-scored and ranked such that documents containing the “economic engine” meaning are presented first. In this example, document 200a (Document #72) would then be ranked highest. In addition, as seen in
The embodiment illustrated in
In this case the similarity computation measures how similar each document is to the removed document. For example, if the user excluded a “driver” listing for computer software, the similarity measurement would be made between each document in the list and “computer software.” The relevance of the results is then adjusted as inversely proportional to this similarity, since the user indicated his or her disinterest in documents pertaining to computer software. Thus, the results are re-ranked so that documents about software are demoted or removed entirely, while more relevant documents, such as ones about car drivers, replace them. Therefore, by a simple click of the mouse, the user not only removes the irrelevant document (e.g., document 406), but also those similar to it. Therefore, this invention allows the users to make their search results more relevant, intuitively and with minimum effort.
The effectiveness of the re-ranking lies in computing the similarity measure. There are various methods for this computation, such as probabilistic classification, semantic similarity, neural networks, vector-based clustering. The particular method of similarity determination can vary. For example, the method can be trained via positive or negative evidence and similarity value can be computed given new data. In one aspect of the invention, the positive evidence is composed of the documents that the user did not exclude. That is, the documents that a user is interest in are determined implicitly, as the inverse of those the users excluded explicitly. The positive evidence can also be gathered explicitly by user preferences (as explained below with respect to
Given a collection of positive and negative evidence, a probabilistic classifier, for example, can be trained to compute the probability of exclusion given the context from the documents, i.e., P(exclude=true|<context>). Once trained, the classifier can then compute this probability for each document in the results, which is the likelihood of it being similar to the set of excluded documents. This probability is then factored into the inverse re-ranking process described above.
Another possibility is to use semantic similarity to measure the likeness of two documents. For example, a race car driver is semantically closer to a truck driver than to computer software. Conversely, a software driver is semantically closer to an electronic circuit driver and not vehicle operators. The most common method for comparing semantic similarity is via an ontology, where concepts are organized in an hierarchy and are grouped into semantically similar concepts. To determine the similarity between concepts, one can simply use the degree of separation between them, i.e., semantic distance. The degree of separation may be determined by the number of hops or degrees of separation between related concepts. Optionally, the semantic distance may be augmented or modified with semantic density and probabilistic weighting.
Semantic similarity is attractive because it is more intuitive and can be more efficient. However, the challenge lies in first categorizing each document into a concept inside an ontology, such as using a probabilistic classifier to compute probability of a category given the document context, P(category|<context>). That is, each document is first mapped into a “conceptual” space. In serving a user's exclusion request, therefore, the similarity measure between documents to the training set becomes fast look-ups for similarity between the concepts they are mapped into.
With respect to the user-based preferences embodiment, the ontology-based approach to determining similarity is amendable for such user customization, allowing each user to specify their interest in the concepts, such as computers versus sports versus shopping. This information can be used to rank result relevance without any explicit user input (i.e., exclusion) by computing each search result to the user's profile. Upon explicit feedback from the user, the results can be further tailored for the needs of the user.
While embodiments of the present invention have been shown and described, various modifications may be made without departing from the scope of the present invention. The invention, therefore, should not be limited, except to the following claims, and their equivalents.
This Application claims priority to U.S. Provisional Patent Application No. 60/671,396 filed on Apr. 14, 2005. U.S. Provisional Patent Application No. 60/671,396 is incorporated by reference as if set forth fully herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2006/014358 | 4/13/2006 | WO | 00 | 10/10/2007 |
Number | Date | Country | |
---|---|---|---|
60671396 | Apr 2005 | US |