The present invention generally relates to an information processing apparatus, an information processing method, an information processing program, and a recording medium, and particularly relates to sorting of information to be searched.
A technique for searching electronic data and displaying search results is becoming increasingly important due to an increased number of search results due to an increased amount of information to be searched. This is because information sought is buried in a large amount of search results, so that finding the information is becoming difficult. As such a search technique, a technique is being proposed such that a search is executed based on a search condition set according to an analysis of a search request input and the search results are ordered by a unit for calculating predetermined scores, for example.
In such a search technique as described above, for an increased speed of the search, words, etc., are extracted from a document to be searched to create an index and save the created index (see Patent Document 1, for example) in advance. Patent Document 1 discloses a proposed method of obtaining correct search results when documents to be searched are divided into multiple sets of documents and an index is created for the respective multiple sets as the number of documents to be searched increases.
In the above-described technique of calculating the predetermined scores, TF (Term Frequency), which is the number of times a search term, etc., included in the specified search condition appears in the respective documents or is used therein, and DF (Document Frequency), which is the number of documents which includes the search term, etc., are used. Therefore, creating the index as described above makes it possible to complete a search within a short time period.
Moreover, depending on the search condition, documents to be searched may be limited. For example, for searching a patent document, this includes cases such that, in addition to specifying words in the document, classifying information such as the IPC (International Patent Classification) or an FI (File Index) is set. When the classifying information is set in such a manner as described above, the search using the above-mentioned terms is carried out within the scope of simultaneously-specified classifying information, i.e., within the scope of the limited population.
Patent Document 1 JP2007-233752A
Here, for example, a search using terms is carried out using TF and DF as described above. As one of such techniques, a calculation is carried out such that the smaller the DF, the more important the term is handled as being important, and the higher the score. The DF is pre-registered in the above-described index. In the related art, even when a population is limited, scores are calculated using the DF registered in the index.
However, as described above, if the population is limited and the number of documents and images included in the documents to be searched decreases, the frequency of occurrence of information to be searched changes, so that calculating the scores using the DF which is pre-registered in the index could cause an inaccurate score to be calculated.
Moreover, as image information may be converted to a one-dimensional code sequence to calculate the scores using a technique similar to term searching, the above-described problem may become a problem not only for the term-searching DF but also for the image search.
According to an embodiment of the present invention, an information processing apparatus is provided which determines, based on fitness for a specified condition, an order of displaying a plurality of to-be-searched information items which items are pre-stored, including a specifying-condition information obtaining unit which obtains specifying-condition information on the specified condition; an index-information obtaining unit which obtains index information which includes information-element inclusion-mode information on to-be-searched information including an information element of the to-be-searched information items and the information element included in any of the to-be-searched information items; a population-limiting information obtaining unit which obtains population-limiting information, which population-limiting information is included in the obtained specifying-condition information, which population-limiting information limits information for which the order is determined of the to-be-searched information items; an index-information modifying unit which modifies the information-element inclusion mode information included in the obtained index information to information on to-be-searched information including the information element of the to-be-searched information items limited by the population-limiting information; and a fitness calculating unit for calculating the fitness based on the information element inclusion mode information associated with an information element corresponding to a specifying information element in the modified index information and an information element included in the specifying-condition information.
Other objects, features, and advantages of the present invention will become more apparent from the following detailed descriptions when read in conjunction with the accompanying drawings, in which:
Descriptions are given next, with reference to the accompanying drawings, of embodiments of the present invention.
The present invention is not limited to the specifically disclosed embodiments, but variations and modifications may be made without departing from the scope of the present invention.
Embodiments according to the present invention are described, referring to
Embodiment 1
Below embodiments of the present invention are described in detail with reference to the drawings.
In the present embodiment, description is provided for an information search system including an information search apparatus for searching patent documents.
The to-be-searched DB 200 stores patent documents information as information to be searched. In other words, the information to be searched according to the present embodiment is patent documents information stored in the to-be-searched information DB 200. As shown in
Next, a hardware configuration of the information search apparatus 1 according to the present embodiment is described.
The CPU 10, which is a processing unit, controls operations of the entire information search apparatus 1. The RAM 20, which is a volatile storage medium allowing high-speed reading and writing of information, is used as a work area for the CPU 10 to process the information. The ROM 30, which is a read-only non-volatile storage medium, stores therein programs (e.g., firmware). The HDD 40, which is a non-volatile storage medium allowing reading and writing of information, stores therein an OS (Operating System) and various control programs and application programs, etc. The I/F 50 connects the bus 80 and various hardware units and networks, and controls them. The LCD 60 is a visual user interface for the user to confirm the status of the information search apparatus 1. The operating unit 70 is a user interface (a keyboard or a mouse, etc.) for the user to input information into the information search apparatus 1. As described in conjunction with
In such a hardware configuration as described above, a software controller is configured by a program stored in a storage medium such as the ROM 30, the HDD 40, or an optical disk (not shown) being read into the RAM 20, and operating as controlled by the CPU 10. The software controller and the hardware are combined to form a functional block which implements a function of an information search apparatus 1 according to the present embodiment.
Next, a functional block of the information search apparatus 1 according to the present embodiment is described with reference to
The information input unit 110, which is arranged such that a user operates the information search apparatus 1 to input information into the search controller 100, is implemented by the I/F 50 and the operating unit 70 (shown in
The display unit 130, which is arranged for displaying the operating status of the information search apparatus 1 and the search results, is implemented by the interface 50 and LCD 60 (shown in
a) is a diagram showing information stored in the to-be-searched information DB 200. As shown in
a) is a diagram showing information stored in the entry information storage 140. As shown in
An example of
The search controller 100, which is arranged to serve a search function of the information search apparatus 1 according to the present embodiment, has a specifying-condition information obtaining unit 101, a specifying-condition information analyzing unit 102, an entry-information obtaining unit 103, a fitness calculating unit 104, and a calculation result processor 105. The search processor 100 is configured by a program loaded into the RAM 20 (shown in
The specifying-condition information obtaining unit 101 obtains, as specifying-condition information, information input by a user via an information input unit 110 or information input via a network via a network interface 120. The specifying-condition information obtaining unit 101 is configured by a program loaded into the RAM 20 (shown in
With reference to
The specifying-condition information analyzing unit 102 analyzes specifying-condition information obtained by the specifying-condition information obtaining unit 101, and converts the analyzed information to an information format according to the calculated fitness mode. Moreover, according to the analyzing of the specifying-condition information, the specifying-condition information analyzing unit 102 determines whether a condition limiting a population-to-be-searched (below called population-limiting information) is included in the specifying-condition information. In other words, the specifying-condition information analyzing unit 102 functions as a population-limiting information obtaining unit. The population-limiting information obtaining unit is configured by a program loaded into the RAM 20 (shown in
Population-limiting information by the specifying-condition information analyzing unit 102 is detected according to the present embodiment. The detected population-limiting information, details of which are described below, is used as population-limiting information when classifying information is specified as shown in
Here, modes of analyzing and transforming specifying-condition information by the specifying-condition information analyzing unit 102 are described with reference to
Then, of those words so parsed, the specifying-condition information analyzing unit 102 deletes words which do not have a meaning by themselves, and extracts only words which do have a meaning by themselves. In the present embodiment, words “A”, “B”, “C” and “D” are extracted. The words extracted as shown in
The entry-information obtaining unit 103 obtains entry information from the entry information storage 104. In other words, the entry information obtaining unit 103 functions as an index information obtaining unit. The index information obtaining unit is configured by a program loaded into the RAM 20 (shown in
The fitness calculating unit 104 calculates the fitness of each document stored in the to-be-searched information DB 200 with respect to the condition specified by specifying-condition information based on converted specifying-condition information input from the specifying-condition information analyzing unit 102 and entry information input from the entry information obtaining unit 103. The fitness calculating unit 104 is configured by a program loaded into the RAM 20 (shown in
The calculation result processor 105 generates fitness-information displaying information for displaying, on the display unit 130, or a display of the client apparatus 2, a fitness list per document calculated by the fitness calculating unit 104. In other words, the calculation result processor 105 functions as a display information generator. The display information generator is configured by a program loaded into the RAM 20 (shown in
Next, an operation of the information search system according to the present embodiment is described with reference to the drawings.
The information-specifying information transmitted to the information search apparatus 1 is input to the information search apparatus 1 from the network interface 120, and is obtained by the specifying-condition information obtaining unit 101 of the information search unit 100 (S702). The specifying-condition information obtaining unit 101 inputs the obtained specifying-condition information to the specifying-condition information analyzing unit 102 (S702). The specifying-condition information analyzing unit 102 obtains the specifying-condition information from the specifying-condition information obtaining unit 101 to analyze the input specifying-condition information (S703).
In S703, the specifying-condition information analyzing unit 102 converts a normal sentence included in the specifying-condition information as described in
a) and 8(b) are drawings showing analysis results information transmitted to the entry information obtaining unit 103 by the specifying-condition information analyzing unit 102. As shown in
a) is a drawing showing an example of a case such that, of population-limiting analysis results information, population-limiting information is included in the specifying-condition information. If the specifying-condition information includes the population-limiting information, the presence-of-population limiting information becomes information indicating “present”, so that the population-limiting information becomes actual population-limiting information. In the present embodiment, as explained in
On the other hand,
Once the population-limiting analysis result information is obtained from the specifying-condition information analyzing unit 102, the entry-information obtaining unit 103 obtains the entry information from the entry information storage 140 (S705). Then, based on the population-limiting information included in the population-limiting analysis results information obtained from the specifying-condition information analyzing unit 102, the narrowed-down documents list information is obtained from the to-be searched DB 200 (S706). In
As shown in
Upon obtaining the entry information and narrowed-down documents list information, the entry-information obtaining unit 103 modifies the entry information based on an ID of a document included in the narrowed-down documents list information (S707). With reference to
As described in
Once the modification of the entry information is completed, the entry information obtaining unit 103 transmits the modified entry information, which is a result of modification, to the fitness calculating unit 104 (S708). Upon obtaining converted specifying-condition information from the specifying-condition information analyzing unit 102 and obtaining modified entry information from the entry information obtaining unit 103, the fitness calculating unit 104 calculates the fitness of the documents to the specifying condition information based on the respective information items obtained (S709).
Here, a mode of calculating the fitness by the fitness calculating unit 104 in S709 is explained. The document fitness is determined according to the following equation:
(Document fitness)=iΣn=1 Scoren (1)
Here, Scoren indicated in equation (1) is the fitness for the search word n in the respective document. Here the search word n is, in this embodiment, the respective keywords of “A” through “D”, as shown in
Moreover, Scoren is determined according to the following equation (2):
Scoren=Hitn×Weightn (2)
Here, Hitn shown in equation (2) is the number of hits for the search word n in the respective documents, or TF (Term Frequency). The fitness calculating unit 104 obtains, in S709, Hitn information by accessing the to-be-searched information DB 200. Moreover, Weightn shown in equation (2) is a weight value for the search word n.
Furthermore, Weightn is determined according to the following equation (3):
Weightn=Log(number of narrowed-down documents/DFn) (3)
Here, the number of narrowed-down documents list shown in equation (3) shows the number of documents included in the narrowed-down documents list information shown in
Upon calculating the fitness to the specifying-condition information of the documents with ID “001” to “005” using equations (1) to (3), the fitness calculating unit 104 transmits the calculation result to the calculation result processor 105 (S710). Based on the calculation result obtained from the fitness calculating unit 104, the calculation result processor 105 generates display information for displaying calculation results, and transmits the generated information via the network interface 120 to the client apparatus 2 (S711). The client apparatus 2 displays the calculation result based on display information received from the information search apparatus 1 (S712).
In the related-art fitness calculating techniques, for determining Weightn, DF of the entry information and the total number of documents stored in the to-be-searched information DB 200 are used regardless of the presence of the population limitation, so that calculating the fitness may lead to inaccurate results depending on the limiting mode of the population. On the contrary, the search controller 110 of the information search apparatus 1 according to the present embodiment determines whether the specifying-condition information includes population-limiting information. If the population-limiting information is included, the DF of the modified entry information and the number of narrowed-down documents are obtained from the modified entry information based on the population-limiting information to determine Weightn. This makes it possible to more accurately calculate the fitness of the respective documents to the specifying-condition information.
As described above, whether the population-limiting information is included in the specifying-condition information is detected according to the present embodiment. In the present embodiment, a case has been described such that when classifying information such as IPC and FI is set in the specifying-condition information, it is detected as population-limiting information. For information used as classifying information, when using Japanese patent documents, it is also possible to use an F term (File forming term). Moreover, for searching United States patent publications, it is also possible to use a Current US Classification, etc. Such a mode as described above allows easily determining the presence of population-limiting information in the specifying-condition information.
Moreover, an example has been explained in the above embodiment of searching documents whose IPC is “G06F 17/30”, for example. However, searching documents whose IPC is not “G06F 17/30” is also possible. In other words, even searching documents whose IPC is not “G06F 17/30” may be used as population-limiting information. Moreover, the classifying information may come in multiple numbers, not in a single number according to the above embodiment.
Moreover, specifying what is other than classifying information in the specifying-condition information may be a population limitation. Such a mode as described above is described with reference to a drawing.
In the example in
Similarly, “never include” the keyword “E”, which represents calculating the fitness only for documents not including the keyword “E”, may be a condition limiting the population. Thus, in this case, the specifying-condition information analyzing unit 102 determines that the population-limiting information is included. In other words, the specifying-condition information obtaining unit 102 obtains, as the population-limiting information, information on an information element specified by the specifying-condition information, wherein documents including the information element are excluded from the search. In such a mode as described above, it is also possible to easily determine the presence of population-limiting information in the specifying-condition information.
Moreover, as shown in
Moreover, not limited to the exemplary modes, any information used for limiting documents stored in the to-be-searched information DB 200, for which documents the fitness is calculated according to the modes of the above equations (1) through (3), may be detected as population-limiting information to achieve the same effect as the above. For example, in a patent documents search, a bibliographic item which is additional information added to the documents may be specified to execute the search. Such specifying of the bibliographic item may be population-limiting information limiting the population to be searched. Such bibliographic items include human information (applicant, inventor, etc.), date information (filing date, publication date, date to which priority is retroactive, etc.), and prosecution history. In other words, the specifying-condition information obtaining unit 102 obtains, as the population-limiting information, information specified in the specifying-condition information, which information specified is additional-information specifying information specifying additional information added to the information to be searched. Such a mode as described above also makes it possible to easily determine the presence of population-limiting information in the specifying-condition information.
Next, with reference to
As described above, in the information search system according to the present embodiment, the information processing apparatus which calculates the fitness to the search condition based on the frequency of occurrence of predetermined information in the group of information items to be searched makes it possible to suitably execute the calculation of the fitness even when what is to be searched is limited.
In the above explanation, as explained using
For example, even in image search, image information included in what is to be searched (documents information in the above embodiment) is converted to a one-dimensional code sequence and image information input as a search condition is converted to a one-dimensional code sequence to make it possible to calculate the score in a technique similar to the words search. In this way, the above embodiments are applied and the fitness is suitably calculated not only in the words search but also in other search modes such as the image search.
Moreover, in the above embodiment, an example has been explained of patent documents as documents to be searched. In addition, the above embodiment is applicable for searching a library book. In such a case, the above-described classifying information such as IPC, etc., is replaced by information classifying books, including classifying numbers for Nippon Decimal Classification.
Furthermore, an example has been described in the above embodiment of a to-be-searched information DB 200 which is separately provided from the information search apparatus 1. However, the to-be-searched information DB 200 may be arranged within a storage area inside the information search apparatus 1. Moreover, in the above explanation, an example has been explained of the information search apparatus 1 and the to-be-searched information DB 200 being directly connected as shown in
Similarly, in the above explanation, an example has been explained such that the entry information storage 140 is provided within the information search apparatus 1. Such a mode as described above allows the search controller 100 to quickly obtain entry information, making it possible to reduce the time needed for the search. In addition, the entry-information storage 140 may be configured as a different apparatus, for example, as a server connected to a network. In this case, the information search apparatus 1 accesses the entry information DB 140 via the network interface 120 and obtains the above-described entry information.
Moreover, in the above-described explanation, an example has been described of the user operating the client apparatus 2 and utilizing the function of the information search apparatus 1 which functions as a server via a network. In addition, the information input unit 110 and the display 130, shown in
Moreover, in the above explanation, an example has been explained of the information processor 1 being a server connected to the client apparatus 2 via the network. In addition, it is also possible for an MFP (Multifunction peripheral), which is connected to a LAN (Local Area Network) (office LAN, etc.), to have a first function. Moreover, not only an MFP, but any apparatus, which is connected to a network, having the functions of the information processor 1 according to the present embodiment makes it possible to obtain the same effect as what is described in the above.
Embodiment 2
In the present embodiment, an example is explained of adding other elements to the mode explained using equations (1) through (3) in the embodiment 1. For the element with the same letter as the embodiment 1, an element identical with or corresponding to what is in the embodiment 1 is shown, so that the explanation is omitted.
As explained in the equations (1) and (2) in the embodiment 1, Hitn, which is a frequency of occurrence of the search word n in the respective documents is used. As a result, a more frequent occurrence of the search word n leads to the calculated fitness of the respective documents becoming correspondingly higher. However, the total length, or information amount included differs from one document to another. The larger amount of information included leads to a correspondingly higher possibility of the occurrence of the search word. Thus, a document which has a larger amount of information included leads to a likelihood of the calculated fitness becoming higher, preventing an accurate calculation of the fitness.
In order to overcome such problems as described above, in the related art techniques, the average information amount of documents information stored in the to-be-searched information DB 200, i.e., the average data length is referenced to adjust the calculated fitness. However, the referenced average data length is an average of all documents information stored in the to-be-searched information DB 200, so that the accuracy of adjusting the fitness is compromised as in the embodiment 1 when the population is limited.
The above problem is explained in further detail. For instance, as a salient example, a case is considered such that the average data length after the population limiting is 150 KB and the average data length before the population limiting is 150 MB. Moreover, a case is considered of a document A with a data length of 100 KB and a document B with a data length of 200 KB. In this case, the data length of the document A and the data length of the document B differ greatly. On the other hand, taking into account the average data length, before the limiting as a reference, they differ by an amount such that it may be determined as in an error range. In other words, for adjusting the fitness based on the average data length before the limiting, only a small adjustment of around the error range is performed, so that it is not possible to suitably adjust the fitness based on the average data length. The present embodiment solves such problems as described above.
The documents search system according to the present embodiment represents a mode such that, of the processes explained in
Upon obtaining the narrowed-down documents list information, the entry information obtaining unit 103 obtains the data length of each document information item included in the narrowed-down documents list information (S1104). In other words, the entry information obtaining unit 103 functions as an information amount obtaining unit. The information amount obtaining unit is configured by a program loaded into the RAM 20 shown in
a) is a drawing showing the data lengths of all documents information items stored in the to-be-searched information DB, and their average value. As shown in
Upon obtaining the data length of the respective document information items included in the narrowed-down documents list information, the entry-information obtaining unit 103 modifies the entry information in a manner similar to S707 of the embodiment 1 (S1105). Thereafter, the average data length, which is obtained in S1104 and the entry information which is modified in S1105 are transmitted to the fitness calculating unit 104 (S1106), and the process is completed.
The fitness calculating unit 104, which has obtained the modified entry information and the average data length information from the entry-information obtaining unit 103, calculates the fitness for each document using the calculations of equations (1) through (3). Then, the fitness calculated for each document is adjusted based on the average data length information obtained from the entry information obtaining unit 103. It is possible to use existing techniques for this adjusting process. For example, when the data length of the document to be adjusted for the fitness is determined and the determined data length is longer than the average data length, the calculated fitness is adjusted such that it is reduced based on the difference. On the other hand, when the data length of the document to be adjusted is shorter than the average data length, the calculated fitness is adjusted such that it is increased based on the difference.
Here, it is described, in the above embodiment, that the fitness calculation unit adjusts the calculated fitness. However, as the adjusted fitness is an accurate fitness which is actually used, the fitness calculating unit is also said to be calculating the fitness based on the average data length. In this way, in the information search system according to the present embodiment, the average data length of a set of documents limited by the population-limiting information is determined, and the fitness calculated for the respective documents are adjusted based on the average data length.
Moreover, in the above embodiment, a mode has been explained of calculating the average data length by the entry-information obtaining unit 103, which obtains the population-limiting information. However, the population-limiting information may be input to the fitness calculating unit 104, which calculates the average data length, thereby obtaining the same effect as what is described in the above.
Moreover, in the above embodiment, as described in S1104 in
However, when the data amount (i.e., the number of documents) for calculating the average data length is large, the amount of computation required for the calculation of the average data length becomes huge. On the other hand, sampling may be used to reduce the computational amount. In other words, it may be arranged that the entry-information obtaining unit 103 determines, as information on the total of information amount included by the respective limited information items to be searched, the average of information amounts included in the documents randomly or non-randomly extracted from documents included in the narrowing-down documents list information.
Similarly, the fitness calculating unit 104 may use, as a reference, the average data length of documents randomly or non-randomly extracted from documents included in the to-be-searched information DB 200 when adjusting the fitness based on the average data length information obtained from the entry-information obtaining unit 103. Such a mode as described above makes it possible to reduce the amount of computation needed for calculating the average data length. For sampling documents from the narrowed-down documents list information, various schemes are possible for reducing the sampling error. For example, documents included in the narrowed-down documents list information may be sorted based on the respective information amounts, and odd- or even-numbered documents only may be extracted to reduce the calculated average data length error. Such an error reducing scheme using sampling as described above may also be applicable when the fitness calculating unit 104 calculates the average data length of documents included in the to-be-searched information DB 200 for adjusting the fitness based on the average data length information obtained from the entry-information obtaining unit 103.
The present application is based on the Japanese Priority Application No. 2008-120482 filed on May 2, 2008, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | Kind |
---|---|---|---|
2008-120482 | May 2008 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
20020073095 | Ohga | Jun 2002 | A1 |
20020107842 | Biebesheimer et al. | Aug 2002 | A1 |
20040230570 | Hatta et al. | Nov 2004 | A1 |
20050210008 | Tran et al. | Sep 2005 | A1 |
20060074860 | Ishiguro et al. | Apr 2006 | A1 |
20060230031 | Ikeda et al. | Oct 2006 | A1 |
20060248055 | Haslam et al. | Nov 2006 | A1 |
20070208719 | Tran | Sep 2007 | A1 |
20070233659 | Kim | Oct 2007 | A1 |
20090132496 | Chen et al. | May 2009 | A1 |
20090234688 | Masuyama et al. | Sep 2009 | A1 |
Number | Date | Country |
---|---|---|
2003-323457 | Nov 2003 | JP |
2007-233752 | Sep 2007 | JP |
WO 2006115260 | Nov 2006 | WO |
Entry |
---|
Office Action issued Aug. 7, 2012 in Japanese Patent Application No. 2008-120482. |
Number | Date | Country | |
---|---|---|---|
20090276418 A1 | Nov 2009 | US |