Modern computer networks facilitate storage and access of large amounts of data. For example, many websites (in the wider world), and data-stores (in the enterprise), contain large text corpora which can be accessed via communication networks. Due to the amount of data stored in this way, it is often difficult to locate a specific document, or documents related to a certain subject, etc. Typically, these sites and data-stores provide a search facility, or search engine, to allow a user to search for useful or desired information from the stored text corpora.
However, the provided search engine often has limited functionality and the returned results may not be adequate for a user's needs. More recently, advances have been made in providing more capable search tools which, for example, may include support for personalized searches or context based query enrichment.
While it might be desired to include such functionality in an existing search engine, this may not always be practical. For example, a user may not have control over a remotely provided resource, or it may be difficult to modify a legacy system to include the new functionality.
Embodiments of the present invention are further described hereinafter by way of example only with reference to the accompanying drawings, in which:
Embodiments of the invention provide advanced search functionality locally for accessing a remotely stored corpus of information. One approach to locally implement a more advanced search engine is to download an entire database of the corpus into a local server or server farm, index the documents, and run the improved search on the local copy of the corpus. This approach requires heavy memory resources and requires access to the underlying database behind a provided search engine, which may not always be available. A further complication arises when the corpus is regularly updated, as is often the case in real-world examples, as it then becomes necessary to ensure consistency between the downloaded database and the original copy held remotely.
The search engine provides search functionality for the contents of the database, returning a list of one or more documents present in the database in response to a search query provided over the network. Thus, to achieve a standard search of the corpus a user submits a search query to client apparatus 100 which passes the query to the search engine 104, via the network 102. The search engine 104 identifies one or more documents relating to the query present in the database 106 and provides the identified documents to the client apparatus 100.
For a search taking advantage of the advanced search functionality, the advanced search module 108 receives the search query submitted by the user and accesses the corpus 106 via the search engine 104 to generate the advanced search results, as will be discussed in greater detail below.
Embodiments of the present invention allow a user to apply more advanced search criteria at the client apparatus 100, such as to allow for personalized search or context based query enrichment, without requiring any change in the functionality of the search engine 104. In particular, a Corpus-Oriented User-Related Search Engine (COURSE) can be simulated at the client apparatus 100 using a standard search engine 104 to access the text corpus 106.
In order to provide the enhanced search capability, some statistics relating to the text corpus should be obtained prior to any searches of the corpus material being made. For example, to understand the relative importance of certain search terms in the context of the corpus, the frequency with which those terms appear in the corpus should be known. Typically, this has been achieved by analyzing the complete corpus to measure the frequencies for terms. However, downloading the whole corpus for analysis may be impractical, particularly in the case of very large remotely stored corpora.
According to embodiments of the invention, a sampling approach is applied to obtain frequency statistics for the appearance of terms in the corpus. By downloading a certain portion of the documents of the corpus, and analyzing the downloading documents, it is possible to estimate term frequencies for terms in the corpus as a whole. For example, one percent of the documents of the corpus may be sufficient to allow frequency statistics for the whole corpus to be estimated. For each term, an inverse document frequency (IDF) can be estimated based on the downloaded documents.
Using a sampling approach, as outlined above, it is possible that any initially generated statistics may not accurately reflect the contents of the corpus. However, as the steps 302 and 304 are repeated, different portions of the corpus may be considered leading to the generated IDF estimates becoming more accurate over time.
Since the client apparatus 100 does not have direct control over the weights of the search terms as applied by the remote search engine 104, the ordering of the search results may be different than desired. More importantly, since only part of the results are examined at the client apparatus 100, the ordering of search results by the search engine 104 may omit some documents considered as important at the client apparatus 100. For this reason, the client apparatus 100 requests more results from the search engine 104 than required for implementing the advanced search. For example, the client apparatus 100 may request four hundred search results, where it is desired only to use the one hundred most relevant.
In step 404 of the method 400, the text content of each document received from the search engine 104 is extracted. Using this information a weight is assigned for each document, taking into account one or more of the following items:
The received search results are then sorted according to the assigned weight values and a highest weighted portion, for example the top one hundred weighted documents, are taken as a hit list. It is assumed that this hit list does not dramatically change whether four hundred search result documents are received from the search engine 104 or many more. In other words, it is assumed that the most relevant results will also have high probability to be highly ranked by the search engine 104 supplied by the web site or data-store.
In a next step 406, the query is extended based on correlated terms present in the documents of the hit list, i.e. terms present in the documents of the hit list having a high correlation with the terms of the original query are identified to provide a context aware extension of the original search query. A method of identifying highly correlated terms is discussed below.
Let D be the sequence of all documents, ordered by their weight. Let di be the ith document in D, and wi its weight. Assume that for every document outside the hit list the weight is zero (so w is the weight vector of all documents). For each term tj let δj be a vector or same length, where δij (the ith element in δj) is an indicator whether the jth term appears in the ith document. We now compute the weighted correlation between the term and the set of results:
Note that in order to compute the above expression, to determine the weighted correlation between each term and the set of results, we only need the frequency of the term tj, the weights of the documents in the hit list, and δij for the documents in the hit list. The frequencies are assessed using the sampled statistics computed according to method 300 illustrated in
It should also be noted that a term present in the original query may not necessarily be part of the second, extended, query. Take for example the query “java and class”, and assume “and” is not a stop word. In this case, the word “and” is likely to not be strongly correlated with the top results and thus will not appear in the second query string.
After analysis of the terms present in the documents of the hit list, a number of the most correlated terms are chosen in step 408 to constitute the second, extended, query. For example, the top twenty terms, or all terms having a correlation above a certain threshold value, may be selected.
The second query to the supplied search engine 104, and a second set of search results are obtained from the search engine at step 410.
The second set of search results may then be analyzed to extract the text content and identify terms, and then to assign a weight value to each document as applied to the documents of the first search results in step 404. The same criteria may be used to assign a weight value to the documents of the second search results as are used to assign weights to the documents of the first search results. Thus, a document containing query terms with high correlation will have higher weight. Finally, the results are reranked in order to reflect the weights assigned to the documents according to those parameters.
The reranked documents can then be presented to the user of the client terminal 100 as an output of the context aware search.
According to some embodiments, the search is further personalized to the user. In order to perform personalized search, it is assumed that the identity of the user is known to the system (e.g., by logging in). For a given query, the personal details, e.g. the user name, are added as additional terms to the query; the query is then invoked in the supplied search engine. An alternative method of adding personalized search results is submitting two separate queries: one with the original terms, and the second requiring that the results contain the user name. The result lists from the two queries will be concatenated and weighted as described above.
Throughout the description and claims of this specification, the words “comprise” and “contain” and variations of them mean “including but not limited to”, and they are not intended to (and do not) exclude other moieties, additives, components, integers or steps. Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise.
Features, integers, characteristics, compounds, chemical moieties or groups described in conjunction with a particular aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.
The reader's attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US12/48863 | 7/30/2012 | WO | 00 | 10/29/2014 |