The present invention relates to search result snippet analysis, and in particular to search result snippet analysis for query expansion and result filtering.
The Internet (Web) has become a store of information on virtually every conceivable topic. The easy accessibility of such vast amounts of information is unprecedented. In the past, someone seeking even the most basic information related to a topic was required to refer to a book or visit a library, spending many hours without a guarantee of success. However, with the advent of computers and the Internet, an individual can obtain virtually any information within a few clicks of a keyboard.
A consumer electronics (CE) device in a network can be enriched by enabling the device to seamlessly obtain related information from the Internet while the user enjoys the content available at home. However, at times, finding the right piece of information from the Internet can be difficult. The complexity of natural language, with characteristics such as polysemy, makes retrieving the proper information a non-trivial task. The same word, when used in different contexts can imply completely different meanings. For example, the word “sting” may mean bee sting when used in entomology, an undercover operation in a spy novel, and the name of an artist when used in musical context. In the absence of any information about the context, it is difficult to obtain the proper results.
Further, querying a search engine not only requires entering keywords using a keyboard, but typically requires several iterations of refinement before the desired results are obtained. Forming a good query requires the user to have at least some knowledge about the context of the information needed, as well as the ability to translate that knowledge into appropriate words in a query.
Conventional approaches to finding concepts that are related to a query can be classified into two categories: (1) search result categorization and (2) query expansion. In search result categorization the results returned by a search engine in response to a query are categorized into different subtopics by using a clustering method. Naive Bayes Classifier, Hierarchical Clustering and Suffix Tree Clustering are some of the methods used for such clustering. However, such categorization techniques are computationally expensive and require entire documents to be clustered in order to obtain a good approximation of their themes. This is difficult to achieve in CE devices (e.g., TV, DVR, cell phone, PDA, MP3 player) because of their inherent constraints on hardware space. Further, the time required to fetch the documents and process them makes such techniques infeasible for real-time use. Recent research shows that snippets returned by a search engine can be used instead of documents, without considerable decrease in the precision of clustering. However, irrespective of whether snippets or documents themselves are used, the clusters formed by these approaches are not very precise.
In query expansion, instead of clustering the received search results, the search result content is analyzed to determine and recommend, the concepts that are related to, and more specific instances of, the original query. For example, if the original query is “Canada,” the recommended topics might be “Canada Map,” “Canada Language,” or “Canada Geography.” However, typically, entire documents are processed to arrive at a set of related topics. As above, fetching and analyzing entire documents is an expensive process, both in terms of time and space. On a PC with considerable processing power and storage capacity, this may be a conceivable approach but not on a resource constrained device such as a CE device in a local network such as a home network.
Further, searching for a specific topic on a large network such as the Internet typically requires multiple iterations of manually entering a search query and refining it depending upon the relevance of the results returned. This also requires the user to be skilled in the techniques for forming queries. The difficulty is exacerbated on a CE device where the user's involvement in the process has to be minimized so as to let the user enjoy the content rather than worry about forming proper queries. There is, therefore, a need for a method and system that provides search result snippet analysis for query expansion and result filtering.
The present invention provides a method and system that enable search result snippet analysis for query expansion and result filtering. Further, a technique for post processing search result snippets is provided to suggest topics for further search and extracting terms related to the search topic for later use.
In one embodiment this involves query formation and search result snippet analysis for query expansion and result filtering. Further, post processing of snippets enables suggesting topics for further searching and extracting terms related to the search topic for later use.
Such a search and analysis process further allows extraction of most relevant information from resources for user viewing and selection. This is performed by suggesting topics relevant to the original query and receiving user selections for query modification and further searching.
In one embodiment, such searching and analysis is implemented in a CE device that can be connected to a local network. The searching and analysis requires minimal user involvement, can be performed in an online fashion (i.e., in real-time) and requires small memory and processing power. The present invention further enables extracting, and presenting to the user, subtopics related to the original query, in a way that is practical to perform in real-time on a CE device. Such an extraction and presentation method is not expensive in terms of the amount of memory space required and does not require the user to guide the process.
In one example, an initial query is formed based on local metadata sources and a user's current activity. The query is sent to a search engine for searching and returning snippets. The returned snippets are then indexed, and analyzed for identifying and extracting any relevant information therefrom. The extracted information is used for query expansion by forming a set of subtopics of the original query, which can be presented to the user and/or searched further.
These and other features, aspects and advantages of the present invention will become understood with reference to the following description, appended claims and accompanying figures.
The present invention provides a method and system that enable search result snippet analysis for query expansion and result filtering. Further, a technique for post processing search result snippets is provided to suggest topics for further search and extracting terms related to the search topic for later use.
In one example implementation of the present invention, an initial query is formed based on local metadata sources in a local network and a user's current activity in the network (e.g., playing a CD). The query is provided to a search engine for searching and returning snippets. The returned snippets are then indexed and analyzed for identifying and extracting relevant information (including specific terms) therefrom. The extracted information is used for query expansion by forming a set of subtopics of the original query, which can be presented to the user and/or searched further. The snippets further allow identifying terms that are relevant to the original query. The identified terms can be stored locally and used later as additional contextual terms for refining a query for forming a new query.
As used herein, a snippet comprises a piece of information (i.e., text) that is returned as a part of the search results by a typical search engine. A snippet includes short bits of a web page. For example, if a search is for “Afghanistan” on Google, the first search result for (www.afghan-web.com) has the following snippet: “Afghanistan Online provides updated news and information on Afghan culture, history, politics, society, languages, sports, publications, communities, . . . .”
The devices 20 and 30, respectively, can implement the UPnP protocol for communication therebetween. Those skilled in the art will recognize that the present invention is useful with other network communication protocols such as JINI, HAVi, 1394, etc. The network 10 can comprise a wireless network, a wired network, or a combination thereof.
Search result snippet analysis includes extracting relevant concepts from search results (snippets) and presenting them to the user.
Example scenarios are now described for better understanding of the present invention.
This example scenario describes how the present invention can be used to enrich a user's TV viewing experience by enabling her to find more interesting information about the current content from a resource (e.g., the Internet). The TV is connected to the user's home network, and implements snippet analysis for query expansion and result filtering according to the present invention. An example viewing session on the TV is conducted by the user as follows:
This example scenario describes how the present invention can be used to extract contextual words relevant to a topic, which can be stored and used later for query formation. Said topic can be a topic selected by the user from topics that are relevant to current content being viewed on a content player connected to a home network. The content player implements snippet analysis for query expansion and result filtering according to the present invention. An example listening session on the content player is conducted by the user as follows:
The system 300 utilizes the following components: Broadcast Unstructured Data Sources (e.g. subtitles, closed captions) 301, a Local Metadata Cache 303, Local Content Sources 307, Application States 309, a Broadcast Data Extractor and Analyzer 306, a Local Contextual Information Gatherer 302, a Contextual Information Deriver 304, a Client User Interface (UI) 310, a Correlation Framework 305, an Internet Metadata Gatherer from Structured Sources 318, an Internet Structured Data Sources (e.g. CDDB) 320, a query 322, a Search Engine Interface 324, web pages 326, a Snippet Analyzer 328, and Internet Unstructured Data Sources (e.g., web pages) 330. The function of each component is further described below.
The Broadcast Unstructured Data Sources 301 comprises unstructured data embedded in media streams. Examples of such data sources include cable receivers, satellite receivers, TV antennas, radio antennas, etc.
The Local Contextual Information Gatherer (LCIG) 302 collects metadata and other contextual information about the contents in the local network. The LCIG 302 also derives additional contextual information from existing contextual information. The LCIG 302 further performs one or more of the following functions: (1) gathering metadata from local sources whenever new content is added to the local content/collection, (2) gathering information about a user's current activity from the states of applications running on the local network devices (e.g., devices 20, 30 in
The LCIG 302 includes a Contextual Information Deriver (CID) 304 which as discussed above, derives new contextual information from existing information. For this purpose, the CID 304 uses a local taxonomy of metadata related concepts. An example of such taxonomy is discussed in relation to
The LCIG 302 further maintains a local metadata cache 303, and stores the collected metadata in the cache 303. The cache 303 provides an interface for other system components to add, delete, access, and modify the metadata in the cache 303. For example, the cache 303 provides an interface for the CID 304, Local Content Sources 307, Internet Metadata Gatherer from Structured Sources 318, Broadcast Data Extractor and Analyzer 306, Document Theme Extractor 308 and Snippet Analyzer 328, etc., for extracting metadata from local or external sources.
The Broadcast Data Extractor and Analyzer (BDEA) 306 receives contextual information from the Correlation Framework (CF) 305 described further below, and uses that information to guide the extraction of a list of terms from data embedded in the broadcast content. The BDEA 306 then returns the list of terms back to the CF 305.
The Local Content Sources 307 includes information about the digital content stored in the local network (e.g., on CD's, DVD's, tapes, internal hard disks, removable storage devices).
The Local Application States 309 includes information about the current user activity using one or more devices 20 or 30 (e.g., the user is listening to music using a DTV).
The client UI 310 provides an interface for user interaction with the system 300. The UI 310 maps user interface functions to a small number of keys, receives user input from the selected keys and passes the input to the CF 305 in a pre-defined form. Further, the UI 310 displays the results from the CF 305 when instructed to by the CF 305. An implementation of the UI 310 includes a module that receives signals from a remote control and a web browser that overlays on a TV screen.
The Metadata Gatherer from Structured Sources 318 gathers metadata about local content from the Internet Structured Data Sources 320. The Internet Structured Data Sources 320 includes data with semantics that are closely defined. Examples of such sources include Internet servers that host XML data enclosed by semantic-defining tags, Internet database servers such as CDDB, etc.
The query 322 is a type of encapsulation of the information desired, and is searched for, such as on the Internet. The query 322 is formed by the CF 305 from the information and metadata gathered from the local and/or external network.
The Search Engine Interface (SEI) 324 inputs a query 322 and transmits it to one or more search engines over the Internet, using a pre-defined Internet communication protocols such as HTTP. The SEI 324 also receives the response to the query from said search engines, and passes the response (i.e., search results) to a component or device that issued the query.
The Web Pages 326 comprises any web page on the Internet that are returned as a result of a query. In one example, when a query is sent to a search engine, the search engine returns a list of URLs that are relevant to that query. For each relevant URL, most search engines also return a small piece of text such as a snippet, from a corresponding web page. The main purpose of the snippets is to provide the user a brief overview of what the web page is about. The snippet is either from the web page itself, or taken from the meta tags of the web page. Different search engines have different techniques for generating these snippets.
The Snippet Analyzer 328 inputs the search results and a query from the CF 305. The Snippet Analyzer 328 then analyzes snippets from the search results and extracts from the snippets terms that are relevant to the query. The extracted terms are provided to the CF 305.
The Internet Unstructured Data Sources 330 includes data or data segments with semantics that cannot be analyzed (e.g., free text). Internet servers that host web pages typically contain this type of data.
The CF 305 orchestrates search result snippet analysis for query expansion and result filtering, by performing the following steps:
The CF 305 can comprise: a Query Execution Planner (not shown) that provides a plan that carries out a user request, a Correlation Plan Executor (not shown) that executes the plan by orchestrating actions and correlating the results so as to deliver better results to the user, and a Correlation Constructor (not shown) that either works with the Query Execution Planner to form the plan through correlating data gathered from external sources and the data gathered from home, or forms the plan automatically through the correlation.
In the example shown in
The example functional block diagram in
The SA 328 further includes an optional Stemmer 404 that stems the snippets so that different words having the same stem are treated as one word. In one example, the Stemmer 404 stems both “continuously” and “continuing” to “continue.” The Stemmer 404 is an optional component. In another embodiment, the snippet text is not stemmed. The SA 328 further includes an Indexer 406 that indexes the processed (cleaned) snippets, and thus creates an index (list) of terms 412 from the snippets. Then for each term, the Indexer 406 stores the following information in the index 412: (1) the snippets in which this term occurs in, (2) the number of times it occurs, and (3) its location in each snippet. Using this information, the Indexer 406 then calculates the weight of each term using a TF-IDF type score.
The SA 328 further includes a Phrase Identifier 408 that identifies important phrases using frequency and co-occurrence information stored in the index 412 along with a set of rules. This is used in identifying multi-word phrases such as “United Nations,” “Al Qaeda,” etc. In one example, the Phrase Identifier 408 internally maintains three lists: (1) a list of proper nouns, (2) a dictionary, and (3) a list of stop words. The Phrase Identifier 408 uses an N-gram based approach for phrase extraction, wherein to capture a phrase of length “N” words in a text, a window of size “N” words is slid across the text and all possible phrases (of length “N” words) are collected. Then the words in the collected phrases are passed through the following set of 3 example rules to filter out what is considered to be meaningless phrases: (1) A word ending with punctuation can not be in the middle of a phrase; (2) For a phrase longer than two words or more, the first word in the phrase can not be a stop word, other than the two articles: “the” (definite) and “a/an” (indefinite), and the rest of the words cannot be stop words other than conjunctive stop words like “the,” “on,” “at,” “of,“” in,““by,” “for,” “and,” etc. This is because the above-mentioned stop words are often used to combine two or more words: e.g., “war on terror,” “wizard of oz,” “the beauty and the beast,” etc; and (3) Proper nouns and words not present in the dictionary are treated as meaningful phrases.
The SA 328 further includes a Term Extractor 410 that extracts the highest score terms and phrases 414 from the index 412 and sends the terms and phrases 414 to the CF 305.
In another example, the sequence of operation of Phrase Identifier 408 and Indexer 406 can be interchanged. In that case, the text is first passed through a Phrase Identifier 408 to capture phrases and then the captured phrases are indexed as explained above.
Accordingly, searching and analysis according to the present invention makes the process of extracting relevant information from resources (e.g., Internet) user-friendly, by suggesting topics relevant to the original query. Such searching and analysis requires minimal user involvement, can be performed in an online fashion (i.e., in real-time) and requires small memory and processing power, such as CE devices. Subtopics related to the original query are extracted and presented to the user in a way that is practical to perform in real-time on a CE device, it is not expensive in terms of the amount of memory space required and does not require the user to guide the process.
As noted, example partial taxonomy 500 is shown in
As is known to those skilled in the art, the aforementioned example architectures described above, according to the present invention, can be implemented in many ways, such as program instructions for execution by a processor, as logic circuits, as an application specific integrated circuit, as firmware, etc. The present invention has been described in considerable detail with reference to certain preferred versions thereof; however, other versions are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the preferred versions contained herein.