The present application claims priority from Japanese application JP2006-318460 filed on Nov. 27, 2006, the content of which is hereby incorporated by reference into this application.
The present invention relates to a technology of generating a retrieval index to be used for a document retrieving system.
As one of the conventional technologies of enabling a computer to retrieve a document including a designated character string to be retrieved at fast speed, there has been known the index-based technology (referred to as the first system). The index, termed in the first system, includes (1) an index item that designates a keyword in a document to be retrieved and (2) document identification information that identifies a document having the index item and index information that designates a location of the index item in the concerned document. Further, like the first system, in the document retrieving method configured to use the index, the index items of the documents are managed in a tree structure often called a trie.
This trie means a tree structure generated by selectively grouping a partial character string to each keyword (referred simply to as a key) included in a set of character strings, that is, keywords to be retrieved (the set being referred to as a key set) as a common node. This trie is used for retrieving an index. A concerned computer operates to decompose the character string of a term to be retrieved into keys and trace the nodes with the key in the trie. When the computer trace reaches the last node of the trie, the computer enables to read pointer information set to the last node and then read the index information for the term to be retrieved on the basis of the pointer information.
The summary of this trie will be described with reference to
The trie 100 shown in
Herein, when the concerned computer retrieves a document number of a document having a character string of (a-i-ti)” and a character location in the document along this trie 100, the computer executes the following operation.
At first, the computer traces the one-gram node of (a)”, then, the two-gram node of (i)” following the one-gram node, and then the three-gram node of (ti)” following the two-gram node. Next, the computer reads the index information 101 about (a-i-ti)” from a predetermined area of a storage area by referring to the pointer information item 102 (ptr61) set to the last node of (ti)”. That is, the computer reads a document number (document identification information) 103 of a document having (a-i-ti)”, that is, “001”, and a character location 104 of (a-i-ti)” in the document, that is, “21”.
In the following description, the terms “pointer information 102” and “index information 101” are often referred to as the “pointer information item(s) 102” and the “index information item(s) 101”, each of which is connected with each node.
The foregoing operation is disclosed in JP-A-11-143901 and JP-A-59-148922.
In order to make the retrieval of the index information of the document faster when the computer manages the indexes with the foregoing tries, it is possible to make the size of each index information item and the number of grams (character number of a common partial character string (symbol string) to each key) in each trie greater. However, if the trie has such a greater number of grams, the trie may be overflown from a memory capacity. This shortcoming becomes a great obstacle especially when mounting a document retrieving system to an instrument with a small memory capacity such as a portable phone or a DVD (Digital Versatile Disk) player.
It is therefore an object of the present invention to overcome the foregoing shortcoming and provide a method and a device which are arranged to realize a fast document retrieval along a trie even if the method and the device are applied to an instrument with a small memory capacity.
In carrying out the foregoing object, according to an aspect of the invention, at first, a computer (device for retrieving a symbol string) provided with a main storage unit and a secondary storage unit operates to generate a trie. Then, the computer calculates a total of required retrieval times of index information items connected with the nodes composing the generated trie by referring to the required retrieval time of the index information retrieved along the trie. Next, the computer determines if the calculated required retrieval time of each node is equal to or less than a predetermined threshold value. Herein, the computer generates an index layered node by grouping the nodes as a family with relation to the same parent node, selectively from the nodes each required retrieval time of which is equal to or less than the predetermined threshold value. That is, those nodes are grouped as a family with relation to the same parent node. Then, the first trie is generated by replacing the nodes to be grouped and the nodes following the former nodes. This generated first trie is stored in a predetermined area of the main storage unit. The nodes to be grouped and the nodes following the former nodes are moved as a second trie to a predetermined area of the secondary storage unit. Then, the pointer information that designates the storage area of the second trie is set to the index layered node of the first trie. This arrangement allows the computer to trace the first trie stored in the main storage unit and then to access the second trie stored in the secondary storage unit when the computer retrieves the index information by referring to a symbol string (including a character string) included in the term to be retrieved. In addition, the symbol string means connection of symbols of symbol codes generated by dividing a one-byte character code or a two-byte character code into two bits or four bits.
As described above, the symbol string retrieving device according to one aspect of the invention operates to keep the trie layered as the first trie and the second trie and store them in the main storage unit and the second storage unit respectively. Hence, if the instrument (such as a computer) has a small main storage unit (such as a memory) capacity, the trie of a large size may be provided in the instrument. That is, the symbol string retrieving device enables to retrieve a document along the tire at fast speed. Further, when generating the first trie, the symbol string retrieving device keeps the nodes in the first trie grouped as a family with relation to the parent node. Hence, the nodes of the first trie stored in the main storage unit may be reduced in number. That is, the reduction of the size of the first trie allows even the computer with a small main storage unit (such as a memory) capacity to be more easily provided in the trie. Moreover, in the first trie, the nodes to be grouped as a family with relation to the parent node are restricted to the nodes following the former nodes, in which the total of the required retrieval times of the index information items is equal to or less than the predetermined threshold value. That is, as to the nodes following the former nodes in which the total of the required retrieval times of the index information items is more than the threshold value, the symbol string retrieving device enable to immediately reach the index information without through the second trie. This arrangement makes it possible to improve the retrieval efficiency of the retrieval information with the trie.
According to the present invention, even the instrument with a small memory capacity enables to retrieve a document at fast speed along the tire.
The other objects and methods of achieving the objects will be readily understood in conjunction with the description of embodiments of the present invention and the drawings.
Other objects, features and advantages of the invention will become apparent from the following description of the embodiments of the invention taken in conjunction with the accompanying drawings.
Hereafter, the best modes of carrying out the present invention (referred to as the embodiments) will be described with reference to the appended drawings.
As shown in
The display (or an output unit) 201 displays the retrieved result supplied by the CPU 203. The keyboard (or an input unit) 202 is used for inputting commands for registering and retrieving text 206 and a term to be retrieved (often referred to as a retrieval term). The CPU 203 executed the programs to be discussed below. Those programs are executed to register an index and retrieve a keyboard to be retrieved. The main storage unit 209 temporarily stores the programs for registering and retrieving an index, data to be inputted or outputted, and so forth. The secondary storage unit 205 stores the data and the programs.
The secondary storage unit 205 is provided with a disk cache (not shown). This disk cache is used for copying part of data recorded on a storage unit with a slow access speed like a harddisk drive so that the read of the data may be made faster. This disk cache is composed of a semiconductor memory like a RAM (Random Access Memory) included in the secondary storage unit 205. Further, the main storage unit 209 is also composed of the semiconductor memory like a RAM. The secondary storage unit 205 is composed of a harddisk drive (HDD) or a flash memory.
The secondary storage unit 205 stores a system control program 212 that controls the overall system 200, a document registration control program 210 and an index creation registering program 213, both of which are functioned as a registration program, and a retrieval control program 211 and an index retrieving program 221, both of which are functioned as the retrieving program. Those programs are read out to the main storage unit 209 and executed under the control of the CPU 203.
Herein, the summary of each of the foregoing programs will be descried below.
The system control program 212 controls an input and output to be executed by a user through the display 201 and the keyboard 202. Further, the program 212 controls the execution of the other programs as well.
The document registration control program 210 is a program that controls the index generating and registering program 213.
The index generating and registering program 213 is arranged to have a trie initializing program 214, an index information generating program 215, and an index layering program 216. The trie initializing program 214 is a program which initializes trie(s). The execution of this trie initializing program 214 through the CPU 203 leads to the realization of the function of the trie initializing unit claimed in a claim. The index information generating program 215 is a program that generates the index information 207 (to be discussed below). The index layering program 216 is a program that layers the index, that is, divides the trie into two layers.
This index layering program 216 is arranged to have an index layered node generating program 217, an index retrieval time comparing program 218, an adjacent partial character string retrieving program 219, and an index layered node dividing program 220.
The index layered node generating program 217 is a program that generates an index layered node (to be discussed later in detail). The execution of the index layered node generating program 217 through the CPU 203 leads to the realization of the function of an index layered node generating unit claimed in a claim.
The index layered node generating program 218 is a program that compares the required retrieval time of the index information 207 with a target retrieval time (to be discussed later in detail). The execution of the index retrieval time comparing program 218 through the CPU 203 leads to the realization of the function of the index retrieval time comparator claimed in a claim.
The adjacent character string retrieving program 219 is a program that searches the nodes having the same parent node (that is, the twin nodes) in the trie. The execution of the adjacent partial character string retrieving program 219 through the CPU 203 leads to the realization of the function of the adjacent partial symbol string retrieving unit claimed in a claim.
The index layered node dividing program 220 is a program that divides the index layered node if the size of the lower trie (the second trie) of the layered tries exceeds the predetermined threshold value.
Further, the index retrieving program 221 is composed of an upper character string retrieving program 222 and a lower partial character string retrieving program 223. The upper partial character string retrieving program 222 is a program that retrieves the upper trie (the first trie) of the layered tries. The lower character string retrieving program 223 is a program that retrieves the lower trie (the second trie) of the layered tries. The execution of the index retrieving program 221 through the CPU 203 leads to the realization of the function of the index retrieving unit claimed in a claim.
The secondary storage unit 205 stores the text 206 that is the document data and the index information 207 of the text 206. Further, a lower partial character string storage area 208 for storing the second trie is secured in the secondary storage unit 205.
The details of the foregoing programs will be set forth in the sections of describing the registering process and the retrieving process included in this embodiment.
The process for registering the document data (the text 206) inputted by the user is executed by the document registration control program 210, which is executed by the system control program 212 run by the CPU 203.
In turn, the index generating and registering program 213 will be described by using the PAD (Program Analysis Diagram) shown in
At first, the CPU 203 shown in
Next, the CPU 203 starts the index information generating program 215 so that the program 215 generates the index information 207 and stores the index information 207 in the secondary storage unit 205 (S301). In particular, the CPU 203 extracts from the text 206 stored in the secondary storage unit 205 a predetermined partial character string, a document number (a document identification information) 227 belonging to the text 206, and its character location (appearing location information) 228, generates the index information 207, and then stores the index information 207 in the secondary storage unit 205.
For example, the CPU 203 starts the index information generating program 215. The program 215 is executed to generate from the text 206 of “ . . . . (a-i-ti) . . . ” of the document number “001” the index information item 207 that designates the character string of (a-i-ti)” is included in the document of the document number “001” and “21” is the character location of the head character (a)” of the character string (a-i-ti)” in the document. Then, the program is also executed to store the generated index information item 207 in the secondary storage unit 205. Further, the CPU 203 measures the retrieval time required for retrieving the index information item 207 (required retrieval time) with respect to each index information item 207 and then adds the required retrieval time to the corresponding index information item 207.
Next, the CPU 203 starts the index layering program 216. Then, the CPU 203 executes the process for layering the index on the basis of the index information 207 generated by the index information generating program 215 (S302). This process for layering the index will be described later in detail with reference to
In turn, the trie initializing program 214 will be described in detail by using the PAD shown in
At first, the CPU 203 shown in
Then, the CPU 203 sets to each last node of the trie the pointer information of the index information item 207 corresponding with the character string (S402).
Herein, the trie generated by the trie initializing program 214 operated by the CPU 203 will be described with reference to
As illustrated in
For example, in the trie 501 shown in
Though the description is left out in
In this pre-setting, the CPU 203 sets the required retrieval time of the index information item 207 connected with the last node to the last node of the trie 501 (for example, the three-gram node of the trie shown in
For example, consider the case that the nodes of (a)” to (n)” are connected as the three-gram node with the two-gram node of (a)” in the trie 501 shown in
Though in
In turn, the index layering program 216 and the index retrieval time comparing program 218 will be described in detail with the PAD shown in
At first, the CPU 203 operates to read the trie generated by the trie initializing program 214 from the trie storage area 226 of the main storage unit 209. At a time, the CPU 203 sets initial values of variables (total, M, N, L, P) to be used for running the index layering program 216. Herein, the CPU 203 sets total=0, M=1, N=1, L=1, and P=1 as the initial values (S600).
This variable “total” is used for calculating a total value of the required retrieval times set to the nodes of the trie. The variable “M” is used for counting the number of the nodes each required retrieval time of which is equal to or more than the target retrieval time (which will be simply referred to as the nodes of the longer required retrieval time). The variable “N” is used for counting the number of processed adjacent nodes. The variable “L” is used for counting the number of processed nodes each required retrieval time of which is less than the target retrieval time (which will be simply referred to as the nodes of the shorter required retrieval time). The variable “P” is used by the variable “total” for counting the number of the nodes of the shorter required retrieval time. The target retrieval time is a threshold value to be used so that the CPU 203 may determine if the concerned node is grouped as a family with relation to a parent node. This target retrieval time is stored in the predetermined area of the main storage unit 209.
Next, the CPU 203 starts the adjacent partial character string retrieving program 219. The program 219 is executed to search the adjacent nodes and count the number of the nodes (S601). At first, the CPU 203 counts the number of the one-gram nodes in the trie. That is, the CPU 203 counts the number of twin nodes with the 0-gram node (not shown) of the trie as a parent node. For example, the CPU 203 counts the one-gram node of (a)” in the trie shown in
Then, the CPU 203 determines if the value of the variable “N” is equal to or less than the value counted in the step S601 (S602). If the CPU 203 determines that it is in the step S601, the CPU goes to a step S603.
Next, the CPU 203 selects one of the adjacent nodes which have not been processed yet (S603). For example, the unprocessed node of (a)” is selected from the one-gram nodes of (a)” to (wa)”.
Turning back to the step S602, if the variable “N” exceeds the value counted in the step S601, the operation goes to a step S607. That is, when the CPU 203 finishes the layering of all the nodes the required retrieval times of which are less than the target retrieval time (the nodes of the partial character string the required retrieval times of which do not exceed the target retrieval time), the CPU 203 goes to the step S607.
After the CPU 203 selects the node in the step S603, the CPU 203 reads the required retrieval time set to the selected node (S604). For example, the CPU 203 read the required retrieval time set to the one-gram node of (a)” in the trie 501 shown in
At first, the CPU 203 determines if the required retrieval time set to the node selected in the step S603 of
If the required retrieval time set to the node selected in the step S603 is equal to or more than the target retrieval time (Yes in the step S700 of
Afterwards, the CPU 203 puts the variable “P” to “0” and the variable “total” to “0” (S702) and then goes to the step S606. That is, the CPU 203 determines that the nodes of the longer required retrieval time (the nodes of the partial character strings of the longer required retrieval time) are not to be grouped as a family with relation to a parent node and shifts its operation to the adjacent node. For example, when the required retrieval time set to the one-gram node of (a)” in the trie shown in
On the other hand, when the required retrieval time set to the node selected in the step S603 (See
Then, the CPU 203 causes the index retrieval time comparing program 218 to start so that it is determined if the variable “total” to which the required retrieval time is added reaches the target retrieval time (S704). If the variable “total” with an addition of the required retrieval time is made equal to or more than the target retrieval time (Yes in S704), the CPU 203 determines if the value of the variable “P” exceeds 1 (S705). If the variable “P” exceeds 1 (Yes in S705), that is, if another node of the partial character string of the shorter required retrieval time is left in the adjacent nodes, the operation of the CPU 203 goes to the step S706. For example, when the CPU 203 adds the required retrieval time “1.0” set to the one-gram node of (i)” to the variable “total”, if the added value becomes equal to or more than the target retrieval time and another node of the partial character string of the shorter required retrieval time (for example, the one-gram node of (a)”) is left in the adjacent nodes, the CPU 203 goes to the step S706. On the other hand, when the variable “P” is equal to or less than 1 (No in S705), the CPU 203 goes to the step S606 of
If the variable “total” to which the required retrieval time is added is still less than the target retrieval time (No in S704), the CPU 203 increments the value of the variable “P” (S709) and then goes to the step S605 of
In the step S706, the CPU 203 starts the index layered node generating program 217. Then, the CPU 203 makes the nodes of the shorter required retrieval time grouped as a family with relation to a parent node and make the trie layered through the grouped nodes. The process of grouping the nodes as a family with relation to a parent node and layering the trie to be executed by the index layered node generating program 217 will be described later in detail with reference to
Next, the CPU 203 starts the index layered node dividing program 220 (S707). Then, the CPU 203 divides the grouped nodes and the layered trie. The division of the grouped nodes and the layered trie will be described later in detail with reference to
Then, the CPU 203 puts the value of the variable “P” to “0” and the value of the variable “total” to “0” (S708). Then, the CPU 203 shifts its operation to the step S606 of
Turning back to
At first, the CPU 203 determines if the variable “L” is equal to or less than the variable “M” (the number of the nodes of the partial character strings of the longer required retrieval time+1) (S607). Herein, when the variable “L” is equal to or less than the variable “M”, the CPU 203 selects one node that is not processed yet from the nodes of the partial character strings of the longer required retrieval time (S608). For example, when the one-gram node of (i) in the trie 501 shown in
Then, the CPU 203 increments the value of the variable “L” (S609) and searches the nodes following the node selected in the step S608 (S610). For example, the CPU 203 searches the two-gram node following the one-gram node of (u)” in the tire 501 shown in
On the other hand, if no following node exists, the CPU 203 goes back to the step S608, in which the CPU 203 starts the process of the node that is not processed yet. That is, in the trie 501 shown in
In turn, the index layered node generating program 217 will be described in detail through the use of the PAD shown in
The CPU 203 reads the nodes that are to be grouped as a family with relation to a parent node (that is, the partial character strings of the shorter required retrieval time) from the main storage unit 209 and generates the index layered node in which those nodes are grouped as a family with relation to a parent node (S800).
For example, when all the nodes other than the two-gram nodes of (a)” and (i)” (that is, the two-gram nodes of (u)” to (n)”) in the trie 501 shown in
Further, the CPU 203 copies the nodes to be grouped as a family with relation to a parent node and the nodes connected therewith into a working area 225. Then, the CPU 203 deletes the nodes to be grouped and the nodes connected therewith from the trie and then puts the index layered node in the place where the nodes that are to be grouped are located. That is, the nodes that are grouped and the nodes connected therewith are replaced with the index layered node. Next, the CPU 203 deletes the nodes as described above and stores in the upper partial character string storage area 224 the trie with the index layered node located therein as the first trie (S801).
For example, in the trie 501 shown in
The foregoing operation of the CPU 203 makes it possible to keep the number of nodes and the size of the generated first trie small. Hence, the document registering and retrieving system 200 may be provided with the trie even if the capacity of the main storage unit 209 of the system is small.
Further, the CPU 203 layers the nodes connected with the index information items 207 of the shorter required retrieval time but does not layer the nodes connected with the index information items 207 of the longer required retrieval time. Hence, when retrieving the index information item 207 of the shorter required retrieval time, the retrieving operation of the CPU 203 passes through the second trie stored in the secondary storage unit 205, while when retrieving the index information item 207 of the longer required retrieval time, the retrieving operation comes immediately from the first trie stored in the main storage unit 209 to the index information items 207 without through the second trie. This operation makes it possible to improve the retrieving efficiency of the index information items 207 throughout the whole system.
Next, the CPU 203 generates the second trie connected with the index layered node generated in the step S800 and then stores the second trie in the lower partial character string storage area 208 shown in
After the storage area of the second trie is defined as described above, the CPU 203 sets the pointer information items that designate the storage areas of the second trie to the index layered node functioned as the connectors of the second trie.
For example, in the step S802, the CPU 203 reads from the working area 225 the two-gram nodes of (u)” to (n)” of the trie shown in
When the CPU 203 retrieves the index information item 906, the foregoing operation makes it possible to jump from the index layered node of the first trie to the second trie (or the root of the second trie) following the index layered node and then reach the index information item 906.
After the foregoing process, the CPU 203 causes the index layered node dividing program 220 to divide the index layered node according to the size of the second trie.
In turn, the index layered node dividing program 220 will be described in detail by using the PAD shown in
At first, the CPU 203 of
Herein, if the size of the second trie is equal to or less than the capacity of the disk cache of the secondary storage unit 205 (No in the step S1000), the CPU 203 does not divide the index layered node, while if the size of the second trie is more than the capacity of the disk cache (Yes in the step S1000), the CPU 203 reads the index layered node, stored in the upper partial character string storage area 224, onto the working area 225 and divides the index layered node (S1001). In the step S1001, the divided index layered nodes are put back to the upper partial character string storage area 224 shown in
In the step S1001, the divisional number may be as small as possible in the range that the size of the second trie following the divided index layered nodes is equal to or less than the capacity of the disk cache. That is, the division in the step S1001 is preferable to make the size of the divided second trie equal to or less than the capacity of the disk cache and the number of the divided second tries as small as possible. This is because the division causes the number of the divided second tries to be increased and accordingly the number of the index layered nodes in the first trie to be increased, thereby making the size of the first trie larger.
Then, the CPU 203 reads the second trie stored in the storage area 208 onto the working area 225 and divides the second trie according to the division of the index layered node in the step S1001 (S1002). Next, the CPU 203 puts the root of the second trie in each of the divided second tries and then stores the result in the storage area 208.
After the storage area of the divided second tries is defined, the CPU 203 sets the pointer information item for the storage area of the second trie to the index layered node divided in the step S1001 (S1003).
Herein, the dividing process of the index layered node will be described in detail with reference to
For example, in the first trie 1100 shown in
Hence, the CPU 203 divides the second trie 1102 so that the size of the second trie 1102 is equal to or less than 6 k and accordingly divides the index layered node 1101.
For example, the CPU 203 divides the three-gram index layered node 1101 “other than (ti)” and (tu)“ ” into two index layered nodes that are the index layered node 1200(a) to (mu)”) and the index layered node 1201(me) to (n)”) as shown in
In particular, as shown in the graph of
The foregoing division of the index layered node executed by the CPU 203 allows the size of the second trie to be equal to or less than the capacity of the disk cache located in the secondary storage unit 205. Hence, the CPU 203 enables to retrieve the index information items 207 through the disk cache at fast speed.
In turn, the description will be oriented to the procedures of the CPU 203 which retrieves the index information through the index generated by the foregoing process. The retrieval of the index information item 207 concerning the retrieval term inputted by a user is executed when the CPU 203 causes the system control program 212 to start the retrieval control program 211. The retrieval control program 211 is started by the execution of the index retrieving program 221.
The index retrieving program 221 will be described in detail by using the PAD shown in
At first, the CPU 203 divides the term to be inputted for retrieval into the continuous gram number of character strings (S1400). Herein, the character number of the divided character string is equal to or less than the gram number (predetermined length) of the index. For example, if the term to be retrieved is (a-i-nu-jin)”, since the index shown in
Next, the CPU 203 continuously executes the following process of S1402 to S1404 for each of the divided character strings of the term to be retrieved (S1401). For example, if the term of (a-i-nu-jin)” is divided into two character strings of (a-i-nu) and (jin)_”, the process of S1402 to S1404 is executed twice.
Then, the CPU 203 starts the upper partial character string retrieving program 222. Afterwards, the CPU 203 traces the first trie about the divided character string and reads the pointer information item of the second trie set to the end node of the first trie (S1402). By this operation, the CPU 203 retrieves the character string (upper partial character string) included in the first trie from the divided character string and reads the pointer information item of the lower partial character string (character string included in the second trie) following the upper partial character string.
For example, the CPU 203 traces the one-gram node of (a)”, the two-gram node of (i)”, and the three-gram node of “other than (ti) and (tu)” on the first trie 900 shown in
Next, the CPU 203 starts the lower partial character string retrieving program 223. In succession, based on the pointer information item of the second trie read in the step S1402, the CPU 203 accesses the second trie. Then, the CPU 203 traces the nodes of the second trie and reads onto the working area 225 the index information item 207 designated by the pointer information item (pointer information item of the index information) set to the end node of the second trie (S1403).
For example, based on the pointer information item “ptr331” of the second trie set to the three-gram node of “other than (ti) and (tu)” of the first trie 900 shown in
Next, the CPU 203 extracts the document number 227 and the character location (location information) 228 including the concerned character string from the read index information item 207 and then stores them onto the working area 225 (S1404).
For example, the CPU 203 extracts the document number “001” and the character location “21” including (a-i-nu)” stored in the index information item of (a-i-nu)” shown by the reference number 907 of
The CPU 203 executes the foregoing process for each of the divided character strings of the term to be retrieved. Concretely, after the process of the character string (a-i-nu)” is finished, the CPU 203 executes the same process for the character string of (jin)_”. That is, the CPU 203 extracts the document number and the character location (location information) of the document including the character string of (jin)_” and stores them onto the working area 225.
Upon completion of extracting the location information of all the character strings, the CPU 203 extracts the location information items in the same locational relation from the location information of each character string stored in the working area 225 (S1405). That is, the CPU 203 retrieves the location information of the character strings listed in the same locational relation as the range of the retrieval terms and outputs the location information.
For example, the CPU 203 extracts the document number “001” and the character location “21” for the location information of (a-i-nu)”. Further, though not shown, the CPU extracts the document number “001” and the character location “24” for the location information of (jin)_”. In this case, both of the character strings have the same document number, and the character string (jin)_” (the head character (ji)” is the 24th) is located to follow the character string (a-i-nu)” (the head character (a)” is the 21st). That is, both of the character strings are listed in the same locational relation as the retrieval term. Hence, the CPU 204 enables to retrieve the information in which the character string of (a-i-nu-jin)” is located at the character location “21” or later in the document of the document number “001”.
The foregoing operation allows the CPU 203 to obtain the location information of the retrieval term in the document.
In the document registering and retrieving system according to the second embodiment, it is determined if a certain node is to be grouped on the size of the index information 207 (the total size of the index information) instead of the required retrieval time of the index information 207.
As shown in
The trie initializing program 214A is executed to add to each node of the trie the information of the size of the index information 207 (the total size of the index information) following the node.
Further, the index layering program 216A causes the index information size comparing program to compare the size of the index information (the total size of the index information) of one node with that of another node and determined if the concerned node is to be layered in the index based on the compared result.
The procedure of the index layering program 216A will be described with reference to
The CPU 203 selects a node in the step S1603 and then reads the size of the index information item set to the selected node (S1604). For example, the CPU 203 reads the size of the index information item 207 set to the one-gram node of (a)” of the trie 501 shown in
At first, the CPU 203 determines if the size of the index information item 207 set to the node selected in the step S1603 is equal to or more than a predetermined threshold value (that is, the threshold value of the size of the index information item) (S1700 shown in
If the size of the index information item set to the node selected in the step S1603 is equal to or more than the predetermined threshold value (the predetermined threshold value of the index information) (Yes in the step S1700), the process from S1701 to S1702 is executed. This process is likewise to the process of S701 to S702 shown in
On the other hand, if in the step S1700 the size of the index information item set to the node selected in the step S1603 is less than the threshold value (No in the step S1700), the CPU 203 adds the size of the index information item set to the node selected in the step S1603 to the variable “total” (S1703).
Then, the CPU 203 causes the index information size comparing program 218A to determine if the variable “total” to which the size of the index information item is added is equal to or more than the predetermined threshold value (S1704). If the variable “total” to which the size of the index information size is added is equal to or more than the foregoing predetermined threshold value (the predetermined threshold value of the index information) (Yes in the step S1704), it is determined if the value of the variable “P” is 1 or more (S1705). If the variable “P” exceeds 1 (Yes in the step S1705), that is, if another node with the size of the partial character string being less than the threshold value (referred to as the node of the smaller character string) is adjacent to the concerned node, the process goes to the step S1706. On the other hand, if the variable “P” is 1 or less (No in the step S1705), the CPU 203 causes the process to go to the step S1606 shown in
If the variable “total” to which the size of the index information item is added is less than the foregoing predetermined threshold value (the predetermined threshold value of the index in formation) (No in the step S1704), the CPU 203 increments the variable “p” (S1709) and then causes the process to go to the step S1606 shown in
In the step S1706, the CPU 203 causes the index layered node generating program 217 to start. Then, the CPU 203 groups node of the smaller character string as a family and the trie is layered with relation to this node (S1706). The subsequent process of S1707 to S1708 is likewise to the process of S707 to S708 shown in
The process of S1607 shown in
As described above, the use of the size (the total size) of the index information item 207 makes it possible for the CPU 203 to generate the retrieval-efficient trie.
The foregoing embodiments have been described with reference to the case that the nodes in the trie use the Japanese characters of “hiragana”. In place of the characters “hiragana”, the other Japanese characters of “katakana” or “Kanji” may be used therefore. Further, if the text 206 includes the other language characters than the Japanese characters, these characters may be used for the nodes in the trie. FIG. 18 shows the index of this embodiment.
For example, if the text 206 is written in English, the trie generated by the trie initializing programs 214 and 214A executed by the document registering and retrieving systems 200 and 200A includes the nodes each of which corresponds to one alphabetic character as shown in
In the foregoing embodiments, the index information 207 has been the index information of the character string include in the text 206. Instead of the character string, the picture data or the moving image data may be used as the index information.
Further, the document registering and registering system 200 or 200A may be arranged to exclude the index layered node dividing program 220. In particular, the system 200 or 200A may be arranged not to divide the index layered node after generating the index layered node.
Moreover, the system 200 or 200A are arranged to have both the index generating and registering program 213 and the index retrieving program 221. Those programs 213 and 221 may be separated from each other. In particular, apart from the computer that causes the index generating and registering program 213 to generate the index, there may be provided another computer that causes the index retrieving program 221 to retrieve the index.
In addition, the secondary storage unit 205 of the system 200 or 200A may be installed outside.
In the foregoing embodiment, one character code may be matched to one gram. For example, for a 2-byte character code, two bytes (16 bits) may be matched to one gram, while for a 1-byte character code, one byte (8 bits) may be matched to one gram. Further, one gram may match to any bit length without being limited by the character code. In this arrangement, for example, in order to register and retrieve the symbol string, the trie may be generated so that the symbol code of four bits or two bits may be set as one gram.
In the foregoing embodiment, the system 200 or 200A is arranged to store the trie connected down with the grouped nodes in the lower partial character string storage area 208 in the trie form. Without being limited to the form, for example, in the secondary storage unit 205, the trie may be stored in the B tree form so that the CPU 203 may more easily access the data. Further, in order to reduce the disk capacity, the reduced trie may be stored in the secondary storage unit 20.
The programs included in the foregoing embodiments may be supplied in the computer-readable recording medium (like a CD-ROM) or through a network (like the Internet).
While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by those embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
It should be further understood by those skilled in the art that although the foregoing description has been made on embodiments of the invention, the invention is not limited thereto and various changes and modifications may be made without departing from the spirit of the invention and the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2006-318460 | Nov 2006 | JP | national |