The present application claims priority from Japanese application JP 2004-174363 filed on Jun. 11, 2004, the content of which is hereby incorporated by reference into this application.
The present invention relates to a document retrieval system, and more particularly to an associative search system that displays a summary of a search result from multiple viewpoints.
With the widespread use of computers and the Internet, the electronization of document information is advancing rapidly. As accessible information increases, locating necessary information from it is becoming an important theme. Moreover, there is an increasing demand to examine the relevance levels of documents among plural document databases. For example, there is a growing demand to search for encyclopedia items related to interesting newspaper articles.
With keyword search presently in practical use, plural document databases can be switched for search, but a document set related to a document set contained in a given document database cannot be retrieved from the identical document database or other document databases (a search method called document associative search).
Within an identical document database, relevance levels among documents have only to be calculated in advance to implement the document associative search with a document set as search input. However, for plural document databases, since the relevance levels among documents to be calculated in advance increases explosively in the number of combinations as the number of document databases increases, the document associative search is practically impossible.
In contrast to this, in JP-A No. 155758/2000 “Document Retrieval Method and Document retrieval Service for Plural Document Databases,” a method is disclosed which efficiently retrieves a document set related to a document set in a user-specified document database from arbitrary document databases. This method achieves rapid document associative search by using only characteristic words within search input inputted as a document set. This method enables the user to perform accurate and efficient document retrieval by examining relevance levels of document sets while switching among different types of plural document databases. This method also aids the user in determining whether a search result is satisfactory, by extracting characteristic words occurring in a document set obtained as the search result and presenting them to the user as a summary of the search result. [Patent document 1] JP-A No. 155758/2000
To achieve document retrieval based on words, documents are indexed by words occurring in the documents. The same is also true for the method disclosed in JP-A No. 155758/2000. To extract characteristic words from a document, for words contained in the document, their importance is calculated using statistical measures (e.g., the tf*idf method) so that the words are extracted in descending order of importance. It is general to make one index for one document database. However, technical terms (disease name, gene name, and protein name, etc. in the biomedicine field) and fact information (protein-protein interaction, etc. in the biomedicine field, for example) are difficult to extract as characteristic words because they will be buried in a general word distribution. Since only one index displays a summary limited to one viewpoint as a search result, the summary display may not be satisfactory when the viewpoint does not match the user's query and interest.
The present invention has been made in view of the above circumstances and provides a document retrieval system that provides a summary display of a search result from multiple viewpoints matching user's interest.
To solve the above-mentioned problem, the present invention indexes one document database in plural ways to enable a summary display of a search result from multiple viewpoints.
For example, one document database is indexed by ordinary words, technical terms, and fact information. To establish correspondences among the indexed versions of the document database, individual documents are managed by common identifiers so that a summary of a given document can be created using the different indexes.
A document retrieval system of the present invention includes a search client having: an input part that inputs queries; a part for showing search result that displays searched document sets; and a part for showing topic words that displays summaries of the searched document sets, and a search server having: a document database that stores indexed plural documents; a part for search that retrieves, in response to a received query, highly related documents from the document database; and a part for summarization by extracting topic words that creates, for a given document set, a summary using the indexes, wherein plural different types of indexes are provided as the indexes.
The part for showing topic words of the search client displays plural types of summaries correspondingly to different viewpoints. The part for showing search result includes a part for selecting documents that selects the documents to become keys to a next search from a displayed document set, and the part for showing topic words includes a part for selecting topic words that selects the elements to become keys to a next search from elements of a displayed summary.
By viewing summaries from multiple viewpoints for a document set obtained as a search result, the user can grasp the nature of the search result more appropriately. Moreover, since relations among the viewpoints can be grasped through the documents subject to retrieval, the search result can be analyzed in more detail.
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The respective means for search 402, 502, and 602 of the search servers 40, 50, and 60 retrieves, in response to a query sent from the associative search server, highly related document sets from the document databases 403, 503, and 603, and returns a search result with weighted relevance levels to the associative search server 30. The means for search can be implemented by known keyword search methods.
The keyword search method splits, to increase the efficiency of search processing, a document contained in a document database into words (performs morphological analysis for Japanese documents and stemming processing for English documents), and creates indexes to indicate what words are contained in what documents. During search execution, since the created indexes are read into main storage, the search processing can be performed at high speed. In
The respective means for summarization by extracting topic words 401, 501, and 601 of the search servers 40, 50, and 60 create a summary of a document set retrieved from the document databases 403, 503, and 603. The summary here refers to a set of words indicating the contents of the document set. As the means for summarization by extracting topic words, existing methods disclosed in JP-A No. 155758/2000 are available. The above-mentioned indexes are also used to create summaries. That is, what words are contained in a given document is determined by referring to the indexes.
As an example, the frequencies of words contained in all documents in a document group whose summary is to be created are counted. Generally, since words occurring more frequently in a document set are more representative of the document set, they are more likely to be contained in a summary. However, common words such as “SURU (perform)” that occur frequently in any documents are not suitable as topic words. Therefore, usually, topic words are selected also in consideration of the occurrences of the words in a document database to which the document set belongs. Specifically, words that occur more frequently in a specified document set and occur less frequently in the whole document database are more characteristic words and more suitable as topic words characterizing the document set because they occur conspicuously only in the document set. To be more specific, individual words in the document set are calculated by a proper function with occurrence frequency in the document set and occurrence frequency in the document database as input, and words having a weight of a given threshold or greater are adopted as topic words.
The search client 20 includes means for inputting query 201, means for showing search result 202, and means for showing topic words 203.
The associative search server 30 includes: means for analyzing queries 301 that analyzes queries sent from the search client 20; means for constructing queries 302 that distributes queries sent from the search client 20 to the search servers 40, 50, and 60; and means for requesting topic words 303 that request topic words for document sets to the search servers 40, 50, and 60.
The means for analyzing queries 301 analyzes a query sent from the search client 20 and identifies words contained in it to create a search key. The means for analyzing queries 301 includes at least a morphological analysis process of splitting sentences into words for Japanese text, and a stemming process of reconstituting words into their original forms and attaching parts of speech for English text.
A query sent to the means for constructing queries 302 is: (1) a word set created by the means for analyzing queries 301; (2) a set of document IDs sent from the means for showing search result (means for selecting document sets) included in the search client 20; or (3) a word set sent from the means for showing topic words 203 (means for selecting topic words) included in the search client 20. When a query is (1) or (3), the word set is sent to the search server as the query. When a query is (2), the means for requesting topic words 303 requests a summary of a document set corresponding to the set of document IDs to the search server, and sends a received topic word set to a search server as the query. To which search server the means for constructing queries 302 sends a query depends on the contents of indexes the search servers hold; its operation will be described using an example described later.
In conventional associative search systems, one document database has been indexed only from one viewpoint. The present invention intends to increase user convenience by indexing one document database from multiple viewpoints. Requirements for achieving this are (1) creating an index from multiple viewpoints, and (2) managing identical documents contained in plural indexed document database by common identifiers. By managing the identical documents by the common identifiers, identification can be held between the respective indexes of document sets obtained as search result. Therefore, topic words can be created for identical document sets from different viewpoints.
Hereinafter, the flow of processing will be described using sequence diagrams of
The following describes the flow of processing with reference to the sequence diagram of
A sequence diagram of
First, the case of performing a re-search from documents obtained as a search result is described. The user selects the documents to become keys to the re-search by using the means for selecting documents 202 of the search client 20. The identifiers of selected documents are transmitted to the associative search server 30 (T21). The means for requesting topic words 303 of the associative search server 30 transmits, to create a summary of the selected document, a request to create the summary to the search server 40 (T22). The means for summarization by extracting topic words 401 of the search server 40 creates topic words using the index 404. Specifically, as described previously, it statistically selects important words by the same method as described in JP-A No. 155758/2000 to create topic words. The created topic words are transmitted to the associative search server 30 (T23).
When the user executes a re-search only from documents, obtained topic words are transmitted to the search server 40 by the means for constructing queries 302 of the associative search server 30 (T25). The means for search 402 of the search server 40 searches the document database 403 by using the index 404, and transmits the result to the associative search server 30 (T26). Subsequent processing is the same as processing after the means for summarization by extracting topic words in the sequence diagram of
When performing a re-search from topic words, the user selects the words to become keys to the re-search by using the means for selecting topic words 203 of the search client 20. At this time, words of multiple viewpoints may be specified at the same time. Selected words or word identifiers are transmitted to the associative search server 30 (T24). Subsequent processing is the same as processing after the means for constructing queries in the sequence diagram of
By performing a re-search by using topic words created from a certain viewpoint, the relation between the viewpoint and other viewpoints can be grasped through document databases. As an example, when a re-search is performed using topic words composed of protein names, documents related to the selected protein names are obtained, and moreover, protein name interactions related to the selected protein names can be obtained. This enables a detailed analysis of search result from different viewpoints.
The following describes a variant of the present invention with reference to
In the first embodiment, from which viewpoint a summary of a search result is to be created is fixed in advance. However, plural search servers to hold indexes from multiple viewpoints may be provided in advance so that the user can select a desirable viewpoint to be used.
Means for selecting viewpoints 2013, presents, as viewpoints (view1, view2), three selectable viewpoints (index by “gene”, index by “protein”, and “protein interaction.”) The user selects a viewpoint from which a summary is to be obtained. In an example of
After this, the user inputs a query to a query input area 2011 and clicks a search command button 2012 to perform a search. Subsequent processing is the same as that in the first embodiment.
The following describes a variant of the present invention with reference to
In the first embodiment, different servers hold indexes having been created from multiple viewpoints. Specifically, the index of
In the first embodiment, the means for constructing queries 302 of the associative search server 30 controls to which search server a query is to be issued according to the type of the query. As shown in
Number | Date | Country | Kind |
---|---|---|---|
2004-174363 | Jun 2004 | JP | national |