A first embodiment of the present invention will be explained with reference to
As shown in
In the document searching apparatus 1, when a user turns on the electric power thereof, the CPU 101 runs a program that is called a loader and is stored in the ROM 102. A program that is called an Operating System (OS) and manages hardware and software in the computer is read from the HDD 104 into the RAM 103 so that the OS is activated. The OS runs a program according to an operation by the user, reads information, and stores information. A typical example of an OS is Windows (registered trademark). Operation programs that run on such an OS are called application programs. Application programs include not only programs that operate on a predetermined OS, but also programs that cause an OS to take over execution of a part of various types of processes described later, as well as programs that are contained in a group of program files that constitute predetermined application software or an OS.
The document searching apparatus 1 has a structured-document searching program stored in the HDD 104, as an application program. In this sense, the HDD 104 functions as a storage medium that has stored therein the structured-document searching program.
Generally, each of the application programs to be installed in the HDD 104 included in the document searching apparatus 1 is recorded in one of storage media 110 including optical disks such as CD-ROMs and Digital Versatile Disks (DVDs), various types of magneto optical disks, various types of magnetic disks such as flexible disks, and media that use various methods such as semiconductor memories, so that the operation programs recorded on the storage media 110 can be installed into the HDD 104. Thus, storage media 110 that are portable, like optical information recording media such as CD-ROMs and magnetic media such as Floppy Disks (FDs), can also be each used as a storage medium for storing therein an application program. Further, it is also acceptable to install application programs into the HDD 104 after obtaining the application programs from an external source via, for example, the communication controlling device 106.
In the document searching apparatus 1, when the structured-document searching program that operates on the OS is run, the CPU 101 performs various types of computation processes and controls the functional units in an integrated manner, according to the structured-document searching program. Of the various types of computation processes performed by the CPU 101 included in the document searching apparatus 1, characteristic processes according to the first embodiment will be explained below.
The input unit 11 has a function of receiving an input of a search query from a user. The converting unit 12 has a function of converting the search query received by the input unit 11 into a search query that is suitable for conducting a search in structured documents being a search target. The searching unit 13 has a function of conducting a search in the structured documents by using the search query converted by the converting unit 12. The output unit 14 has a function of presenting a search result obtained by the searching unit 13 to the user.
The conversion rule DB 15 is a database that stores therein conversion rules 20.
The “search method used after conversion” is a portion that specifies a search method that corresponds to the converted search target element and the converted query sentence. This item is specified because it is necessary to specify an optimal search method for the converted query sentence for the reason that, for example, a suitable method for processing words can be different between when a search is conducted in a document written in Japanese and when a search is conducted in a document written in English. As another example, when a Kanji/Kana sentence (i.e., a sentence written by using both Chinese characters and Japanese phonetic characters) obtained as a result of performing automatic audio recognition on information uttered by a speaker is expressed in an element specified by “/audio recognition”, and also the reading of the “/audio recognition” that uses the Japanese phonetic characters is expressed in an element specified by “/audio recognition reading”, an input query sentence is converted into a query sentence written in the Japanese phonetic characters with respect to the “/audio recognition reading” portion, and a search method that uses “edit distance” is used.
The structured document index DB 16 is a database that stores therein structured document indexes 30.
For example, in the vocabulary index 31 shown in
Next, a schematic procedure in the process performed with the configuration above will be explained. First, the input unit 11 receives a search query that has been input by a user and forwards the received search query to the converting unit 12. The converting unit 12 serves as a query converting unit. Having received the search query from the input unit 11, the converting unit 12 converts the search query by using the conversion rules 20 stored in the conversion rule DB 15 and forwards the converted search query to the searching unit 13. The searching unit 13 serves as a document searching unit. The searching unit 13 conducts a search on constituting elements of structured documents, based on the structured document indexes 30 stored in the structured document index DB 16 by using the search query received from the converting unit 12 and forwards a search result to the output unit 14. The output unit 14 serves as a search-result presenting unit. The output unit 14 presents the received search result to the user.
Next, the converting unit 12 will be explained further in detail.
In this situation, a process of “conducting a search for a document that contains SHIZEN GENGO (=natural language) in the YOUYAKU (=summary) and returning the title thereof as a result” that is performed on structured documents like the one shown in
In the present example, in the search query received from the input unit 11, the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; and the search method specifying portion is “TF-IDF search with Japanese words”.
Next, the converting unit 12 checks the search target element specified in the search query received from the input unit 11 (step S2). As a result, it is understood that the element “YOUYAKU J (=summary J)” has been specified.
Subsequently, the converting unit 12 looks for a search target element after a conversion, the conversion method for the query sentence, and the search method, with respect to the specified search target element, according to the conversion rules 20 of which some examples are shown in
After that, the converting unit 12 converts the search query according to the method found at step S3 (step S4). In the present example, the query sentence “SHIZEN GENGO SHORI (=natural language processing)” within the search query received from the input unit 11 is translated into “natural language processing” according to the conversion rule 20.
As a result of the process described above, the input search query in which ‘the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; and the search method specifying portion is “TF-IDF search with Japanese words”’ is converted into a search query in which ‘the search target element specifying portion is “/YOUYAKU E (=summary E)”; the query sentence portion is “natural language processing”; and the search method specifying portion is “TF-IDF search with English words”’.
Finally, the converting unit 12 forwards the converted search query to the searching unit 13 (step S5).
The conversion method for the query sentence is not limited to the example shown in
Next, the searching unit 13 will be explained further in detail. By using the search query received from the converting unit 12 and the structured document indexes 30, the searching unit 13 conducts a search in structured documents and forwards a result to the output unit 14.
Next, the searching unit 13 processes the query sentence in correspondence with the search method (step S12). In the present example, a stemming process is performed on the query sentence “natural language processing” so that “natural”, “language”, and “process” are extracted as search words.
Next, the searching unit 13 checks a structure (i.e., an element) that is used as the search target (step S13). In the present example, it is understood that the structure (i.e., the element) being the search target is “/YOUYAKU E (=summary E)”.
Subsequently, the searching unit 13 searches for a document that contains information suitable for the query sentence within the target structure (i.e., the target element) (step S14). In the present example, it is understood that, based on the vocabulary index 31 included in the structured document indexes 30, “natural”, “language”, and “process” appear in the “/YOUYAKU E (=summary E)” in the structured document 2, and that the structured document 2 is a suitable search result.
Finally, the searching unit 13 obtains the structured document 2 from the main text index and forwards it to the output unit 14 as the search result (step S15).
The output unit 14 presents an output result as shown in
As explained above, according to the first embodiment, a new search query is generated by converting, according to the predetermined rule, a query sentence that constitutes a search query and an element being a search target of the query sentence. Thus, by setting the predetermined rule so that, when the search target element in a search query is “/YOUYAKU J (=summary J)”, the search target element is converted into a search target element “/YOUYAKU E (=summary E)”, before “English translation” is applied to the input query sentence, and a “TF-IDF search with English words” is performed by using the converted search target element and the converted query sentence, it is possible to conduct a search for a document that contains a character string “natural language processing” within the element “summary”, based on the search query indicating that a search should be conducted for a document that contains “SHIZEN GENGO SHORI (=natural language processing)” within an element “YOUYAKU (=summary)”. Consequently, it is possible to search for a document desired by a user in a flexible manner.
Next, a second embodiment will be explained with reference to
The difference between the second embodiment and the first embodiment is that the searching unit 13 has a function of conducting a search in structured documents by using both a query input by a user and a search query converted by the converting unit 12 and rearranging the structured documents found in the search in an appropriate order.
A schematic procedure of the process according to the second embodiment will be explained below. First, the input unit 11 receives a search query input by a user and forwards the received search query to the converting unit 12. Having received the search query from the input unit 11, the converting unit 12 converts the search query by using the conversion rules 20 stored in the conversion rule DB 15 and forwards the converted search query and the input search query to the searching unit 13. The searching unit 13 conducts a search on constituent elements of structured documents, based on the structured document indexes 30 stored in the structured document index DB 16 by using both the converted search query and the input search query received from the converting unit 12 and forwards a search result to the output unit 14. The output unit 14 presents the received search result to the user.
Next, the converting unit 12 will be explained further in detail. The converting unit 12 according to the second embodiment is different from the converting unit 12 according to the first embodiment in that the conversion rules 20 include weights for adjusting scores that are used when a search is conducted in structured documents by using a search query converted according to the conversion rules 20.
For example, the converting unit 12 according to the second embodiment receives, from the input unit 11, a search query in which the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; and the search method specifying portion is “TF-IDF search with Japanese words”. The converting unit 12 then converts the received search query into a search query in which the search target element specifying portion is “/YOUYAKU E (=summary E)”; the query sentence portion is “natural language processing”; and the search method specifying portion is “ITF-IDF search with English words”, by using the conversion rules 20 shown in
Next, the searching unit 13 will be explained further in detail. By using the converted search query including the weight and the input search query that have been received from the converting unit 12 as well as the structured document indexes 30, the searching unit 13 conducts a search in structured documents and forwards a result to the output unit 14.
Next, the searching unit 13 processes the query sentences in the two types of search queries received from the converting unit 12, in correspondence with the search methods (step S22). In the present example, a stemming process is performed on the converted query sentence “natural language processing” so that “natural”, “language”, and “process are extracted as search words. Also, a morphological analysis is performed on the search query “SHIZEN GENGO SHORI (=natural language processing)” that has been input by the user so that “SHIZEN (=natural)”, “GENGO (=language)”, and “SHORI (=processing)” are extracted as search words.
Subsequently, the searching unit 13 checks the structures (i.e., the elements) that are used as the search targets for the two types of search queries (step S23). In the present example, it is understood that the structures (i.e., the elements) being the search targets are “/YOUYAKU E (=summary E)” and “/YOUYAKU J (=summary J)”.
After that, the searching unit 13 conducts a search for a document that contains information suitable for the query sentence within the target structure (i.e., the target element) for each of the two types of search queries (step S24). When the search is conducted in the structured documents 1, 2, and 3 shown in
In the next step, the searching unit 13 rearranges the search results in an appropriate order based on the scores thereof (step S25). According to the second embodiment, each of the documents is scored by using the TF-IDF method. As a TF, the frequency indicating how often a word in question appears in the search target element is used. As an IDF, to keep it simple, 1/DF (Document Frequency: the number of documents in which a word in question appears) is used. In this situation, for example, it is assumed that “SHIZEN” is considered as the same word as its translated equivalent “natural”; “GENGO” is considered as the same word as its translated equivalent “language”; and “SHORI” is considered as the same words as its translated equivalent “processing”. Based on this assumption, the score of the document 1 is expressed as below:
(TF-IDF of the word “SHIZEN”)+(TF-IDF of the word “GENGO”)+(TF-IDF of the word “SHORI”)=1*1/3+1*1/3+1*1/3=1
The score of the document 2 is expressed as below:
(TF-IDF of the word “natural”)+(TF-IDF of the word “language”)+(TF-IDF of the word “process”)=1*1/3+1*1/3+1*1/3=1
The score of the document 3 is expressed as below:
(TF-IDF of the word “SHIZEN”)+(TF-IDF of the word “GENGO”)=1*1/3+1*1/3=0.67
In addition, the searching unit 13 applies the weight “0.8” for adjusting the score to the document 2 that is the search result from the converted search query. As a result of this process, the score of the document 2 is further expressed as below:
1*0.8=0.8
As a result of the processes described above, the scores of the documents found in the search can be expressed as below:
the score of the document 1>the score of the document 2>the score of the document 3
Finally, the searching unit 13 obtains main text information of the search results from the main text index and forwards the obtained information to the output unit 14, together with the ranking order of the scores (step S26).
The output unit 14 presents the search results together with the ranking order, as shown in
As explained above, according to the second embodiment, the searching unit 13 conducts a search in structured documents by using both a search query input by a user and a search query converted by the converting unit 12 and rearranges the structured documents found in the search in an appropriate order. Thus, it is possible to obtain a search result desired by the user.
In the example shown in
Next, a third embodiment will be explained with reference to
The difference between the third embodiment and the first embodiment is that the converting unit 12 has a function of also converting a presented element specifying portion specified in a search query input by a user.
The difference in a relevant module between the first embodiment and the third embodiment will be explained below.
For example, it is assumed that the input unit 11 receives a search query in which the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; the search method specifying portion is “TF-IDF search with Japanese words”; and the presented element specifying portion is “/title J”, as a search query that has been input by a user and indicates that “a search should be conducted for a document that contains SHIZEN GENGO SHORI in YOUYAKU J and the title J should be returned as a result”. The input unit 11 forwards the search query to the converting unit 12.
Having received from the input unit 11 the search query in which the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; the search method specifying portion is “TF-IDF search with Japanese words”; and the presented element specifying portion is “/title J”, the converting unit 12 according to the third embodiment converts the search query by using the conversion rules 20 shown in
As shown in
Among the conversion rules 20, the converting unit 12 looks for a rule that has the same “search target element within input search query” as the search target element specifying portion in the input search query and also has the same “presented element within input search query” as the presented element specifying portion in the input search query. As a result, the converting unit 12 finds the rule of which the ID is “1”.
Next, the converting unit 12 converts the input search query according to the rule of which the ID is “1”. As a result of this process, the search query in which the search target element specifying portion is “/YOUYAKU J (=summary J)”; the query sentence portion is “SHIZEN GENGO SHORI (=natural language processing)”; the search method specifying portion is “TF-IDF search with Japanese words”; and the presented element specifying portion is “/title J” is converted into a search query in which the search target element specifying portion is “YOUYAKU E (=summary E)“; the query sentence portion is ” natural language processing”; the search method specifying portion is “TF-IDF search with English words”; and the presented element specifying portion is “/title E”. The result of the conversion is forwarded from the converting unit 12 to the searching unit 13.
The searching unit 13 conducts a search in structured documents by using the search query received from the converting unit 12 and the structured document indexes 30 and forwards a result to the output unit 14.
The searching unit 13 receives, from the converting unit 12, the search query in which the search target element specifying portion is “/YOUYAKU E (=summary E)”; the query sentence portion is “natural language processing”; the search method specifying portion is “TF-IDF search with English words”; and the presented element specifying portion is “/title E”. When the searching unit 13 conducts a search in documents, for example, as shown in
Finally, the searching unit 13 obtains information subordinate to “/title E” specified in the presented element specifying portion within the search result from the main text index 33 and forwards the obtained information to the output unit 14 as a search result.
The output unit 14 presents an output result, for example, as shown in
As explained above, according to the third embodiment, because the converting unit 12 also converts the presented element specifying portion specified in the search query input by the user, it is possible to output, for the user, an element that is appropriate as a search result.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2006-264202 | Sep 2006 | JP | national |