The present disclosure relates to subject matters contained in Japanese Patent Application No. 2009-129156 filed on May 28, 2009, which are expressly incorporated herein by reference in its entireties.
1. Field
The present invention relates to a neighbor searching apparatus for a database.
2. Related Art
A multidimensional indexing technique is a technique used for range searching or neighbor searching for a data set represented as points in a feature quantity space, such as feature quantities and component data extracted from multimedia data. This technique involves sectioning a feature quantity space with graphic elements in an inclusion relation in order to improve the efficiency of searching. Examples of the multidimensional indexing technique include R-tree and R*-tree that use a rectangle as a bounding graphic element (referred to as a cell), SS-tree that uses a sphere as a cell, and SR-tree that uses the overlapping part of a sphere and a rectangle as a cell.
Furthermore, a framework that facilitates implementation of multidimensional indexing along an abstract tree has been proposed (for example, Joseph M. Hellerstein, Jeffrey F. Naughton and Avi Pfeffer. “Generalized Search Trees for Database Systems.”, Proc. 21st Int'l Conf. on Very Large Data Bases, Zürich, September 1995, 562-5730.).
These indexing techniques are based on the concept that a multidimensional space is hierarchically divided to limit the range of searching. This is because limiting the range of searching reduces the amount of calculation accordingly. However, in a high dimensional space, a phenomenon that the distance from a certain point to its nearest point does not differ from the distance from the point to its furthest point occurs. The phenomenon known as “the curse of dimensionality” poses a problem that the range of searching cannot be limited, and as a result, the required amount of calculation approximates the amount for linear searching. In order to cope with the problem with the high dimensional space, approximate nearest neighbor searching has been studied (for example, Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A., “An optimal algorithm for approximate nearest neighbor searching.”, 1994. In Proceedings of the ACM-SIAM symposium on Discrete Algorithms.).
However, the searching system described in the reference can be applied only to a balanced tree, and the searching scheme depends on the framework. Thus, the searching system has a problem that a searching scheme suitable for a given target cannot be selected.
Furthermore, the conventional approximate neighbor searching involves increasing the pruning range to (1+ε) times indiscriminatingly for every node. However, a large subtree (a node having a large number of subordinate points) and a small subtree (a subtree having a small number of subordinate points) differ in importance and search cost.
An object of the present invention is to provide a neighbor searching apparatus that can select an index suitable for each search target.
Another object of the present invention is to optimize the trade-off between the search time and the search accuracy by changing the degree of pruning based on node information (including the size of the bounding region and the number of points in the node).
According to a first aspect of the present invention, a neighbor searching apparatus is proposed. The neighbor searching apparatus comprises: storage means (a storage unit) that stores a meta table containing index-dependent meta data associated with a data structure of each index; database means (a database unit) that searches for an index associated with an instruction when receiving the instruction from a user and makes indexing means (an indexing unit) perform a processing associated with the instruction using the index-dependent meta data associated with the index; and the indexing means that performs the processing associated with the instruction using the index-dependent meta data based on the instruction from the database means.
According to a second aspect of the present invention, a neighbor searching apparatus is proposed. The neighbor searching apparatus is a neighbor searching apparatus that searches for point data that exists in the proximity of a specified query point, and a search region for the query point is determined depending on the number of subordinate points of each node in such a manner that a search range for a node having a larger number of subordinate points is smaller than a search range for a node having a smaller number of subordinate points.
According to the present invention, a neighbor searching apparatus that can select an index suitable for each search target can be provided.
Furthermore, according to the present invention, the trade-off between the search time and the search accuracy can be optimized by changing the degree of pruning based on node information (including the size of the bounding region and the number of points in the node).
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention, and together with the general description given above and the detailed description of the embodiments given below, serve to explain the principles of the invention.
In the following, embodiments of the present invention will be described with reference to the drawings.
Definition of key terms used in this specification will be described below.
“Multidimensional data (point data)” refers to a piece of data composed of a plurality of values.
“k-neighbor searching” refers to a searching method that searches for k points existing in the proximity of a given point (query).
“Approximate neighbor searching” refers to searching a neighbor in an approximate manner. The approximate neighbor searching does not always provide the best result but is advantageous in that it is quicker than an ordinary neighbor searching.
“Number of subordinate points of a tree node” refers to the number of pieces of point data subordinate to a node including a subtree.
“Number of page accesses” refers to the number of I/Os. The “page” used in this context means a region of a certain size. The number of page accesses is used as an indicator of the performance of a database. This factor is not device-dependent, and the number of I/Os has a greater influence on the length of the processing time of most devices than the amount of calculation.
“Minimal bounding sphere (MBS)” refers to a hypersphere including all the subordinate points of a node.
“Minimal bounding rectangle (MBR)” refers to a hyperrectangle including all the subordinate points of a node.
“SR-tree” refers to a multidimensional index structure that defines the overlapping region of an MBS and an MBR as a bounding region.
A neighbor searching apparatus according to a first embodiment of the present invention is a system that performs neighbor searching.
The neighbor searching apparatus is an information processing apparatus that comprises a central processing unit (CPU), a main memory (RAM), a read only memory (ROM) and an input/output device (I/O) and optionally an external storage device, such as a hard disk drive, or a system including such an information processing apparatus. For example, the neighbor searching apparatus is a computer, a cellular phone, an HD recorder or a home electric appliance. The ROM or the hard disk drive of the neighbor searching apparatus stores a program, the program is loaded into the main memory, and the CPU executes the program to implement the neighbor searching apparatus.
[1.1.1. Storage Part]
The storage part 10, which corresponds to storage means (or a storage unit) according to the present invention, has a function of storing data used for searching. More specifically, the storage part 10 stores a node table 11, a point table 12 and a meta table 13.
The node table 11 is data (table) that describes node information for indexes.
The point table 12 is data (table) that describes information about in which node each point is included.
Although a node can include point data, it is assumed that only the leaf nodes have point data in this tree data 40. The number of pieces of point data is 28, and point IDs from 1 to 28 are assigned to the 28 pieces of point data. In
The meta table 13 is data (table) that describes meta information for indexes.
The index-dependent meta data is data used by the indexing part 30 to perform neighbor searching or the like. In the following, an example of the index-dependent meta data will be described. Although the index-dependent meta data will be described below on the assumption that the index type is SR-tree, SR-tree is not the only index type that can be used in the present invention, and the searching apparatus 1 according to the present invention can be applied to any scheme that can create an index that allows neighbor searching or the like.
Referring back to
[1.1.2. Database Managing Part]
The database managing part 20, which corresponds to database means (or a database unit) according to the present invention, has a function of processing a data access to the storage part 10 in response to a request from the indexing part 30. That is, the database managing part 20 has only to recognize the data content (the index-dependent meta data 136, for example) of the index as a byte string of a fixed length and does not need to consider or process the data content.
In addition, in response to receiving an instruction from a user, the database managing part 20 uses the index-dependent meta data in the meta table 13 to search for an indexing technique associated with (suitable for) the instruction and makes the indexing part 30 perform a procedure to execute the instruction.
[1.1.3. Indexing Part]
The indexing part 30, which corresponds to indexing means (or an indexing unit) according to the present invention, has a function of creating the index-dependent meta data and performing searching using the index-dependent meta data.
Specific examples of the procedure performed by the indexing part 30 will be listed below.
(1) Create
This procedure is invoked to create an index on the database. When this procedure is invoked, a procedure of returning the created index is performed.
(2) Connect
This procedure is invoked to connect to an index on the database. When this procedure is invoked, the index of the connection destination is returned.
(3) Insert (Index, Id, Point)
A procedure of inserting (id, point) in an index is performed.
(4) Delete (Index, Id)
ID performs a procedure of deleting an entry of id from an index.
(5) knnSearch (Index, Query, k, eps)
This is a procedure of performing knn searching. As a result of this procedure, k points close to a query are retrieved using an error coefficient eps and returned.
(6) searchByID (Index, Id)
This is a procedure of ID returning a point of id.
(7) costKNN (Index)
This is a procedure of estimating and returning the kNN search cost.
(8) getMetadataLength (Dimension)
The indexing part 30 returns the region length of the index-dependent meta data with reference to the point dimension.
(9) Free (Index)
This is a procedure of releasing an index object on a memory.
Referring back to
[1.1.4. Input Part, Output Part]
The input part 40 is a keyboard, a pointing device, a touch panel or the like and is used by the user to input an instruction or other information. The input information includes an index specified to be used, a specified point (query) for searching, and the number k of elements for k-neighbor searching, for example.
The output part 50 is a display, a printer, a speaker or the like and is used to make an inquiry to the user or output the search result to the user.
A second embodiment of the present invention is the neighbor searching apparatus described above that is configured to perform approximate neighbor searching by changing the degree of pruning depending on the side of the node (cell).
A conventional approximate neighbor searching technique considers an approximation coefficient uniform. However, a large subtree (a subtree having a large number of subordinate points) and a small subtree (a subtree having a small number of subordinate points) differ in importance and search cost. That is, from the viewpoint that a large subtree has a large number of subordinate points, the subtree is likely to include a neighbor point but requires a higher search cost because it includes a large number of points. On the other hand, from the viewpoint that a large subtree has a large bounding region, the subtree is not likely to include a neighbor point in a particular part of the large bounding region (the subordinate points can be unevenly distributed). A small subtree has the opposite characteristics.
The neighbor searching apparatus 1 according to this embodiment performs approximate neighbor searching using a search region 1104 for the large subtree and a search region 1103 for the small subtree. If a nearest point to the query point 1100 lies in a search region, the point data 1106 included in the subtree is treated as a target point of approximate neighbor searching. If the nearest point does not lie in a search region, the point data in the subtree is not treated as a target (in other words, the subtree is pruned).
In general, as shown in
In the example shown in
According to this embodiment, approximate neighbor searching is performed by changing a value that determines the size (radius) of the search regions 1103 and 1104 for the large subtree 1101 and the small subtree 1102. The search region is defined as a circle (a hypersphere in a multidimensional space) centered at the query point 1100 and having a radius r. The radius r is determined according to the following formula (Expression 1).
r=(provisional k in the course of searching−distance between neighbor bounding region and query)/(1+ε′) [Expression 1]
Once the approximate neighbor searching process is started, the indexing part 30 acquires a query point q, the number k of points to be searched for and an approximation coefficient ε as user instruction information. The instruction information is input by the user through the input part 40, transmitted to the database managing part 20 and then passed from the database managing part 20 to the indexing part 30.
The indexing part 30 refers to the meta table 13, or more specifically, the index-dependent meta data 136 and denotes the root node (Root) by N (stores the root node as a node N) (step S10). Then, the indexing part 30 arranges the cells in the node N in ascending order of distance from the query point and stores the result as C (step S20).
Then, the indexing part 30 retrieves one cell from the result C. The cell is denoted by C0. Besides, the indexing part 30 deletes the cell C0 from the result C (step S30).
Then, the indexing part 30 calculates ε′ (epsilon prime: referred to as a modified approximation coefficient in order to distinguish from the approximation coefficient ε) from the approximation coefficient ε.
The following (Expression 2) is a formula for calculating the modified approximation coefficient ε′.
In the above formula, γ is a constant (which can also be given by the query).
ε′ meets a condition that 0≦ε′≦ε.
Therefore, departure from the worst case guarantee for the given approximation coefficient ε does not occur.
The modified approximation coefficient ε′ is used to determine the radius r of the search regions 1103 and 1104 according to the following formula (Expression 3).
r=(current provisional k−distance between neighbor bounding region and query)/(1+ε′) [Expression 3]
Then, the indexing part 30 determines whether or not the distance between the nearest point of the cell C0 to the query point and the query point is smaller than the distance between the k-th point data in the search result to the query point q multiplied by 1/(1+ε′) (step S40).
If it is determined in step S40 that the distance between the nearest point of the cell C0 to the query point and the query point is smaller than the distance between the k-th point data in the search result to the query point q multiplied by 1/(1+ε′) (that is, if YES in step S40), the indexing part 30 designates the node indicated by the cell C0 as a new node N (step S50). Then, the indexing part 30 determines whether or not the new node N is a leaf node (step S60). If it is determined in step S60 that the node N is not a leaf node (that is, if NO in step S60), the indexing part 30 returns to the processing in step S20. On the other hand, if it is determined in step S60 that the node N is a leaf node (that is, if YES in step S60), the indexing part 30 calculates the distance between each piece of the point data in the cell C0 and the query point q and replaces the k-th point data in the previously retrieved point data with any point data closer to the query point than the k-th point data (step S70).
Then, the indexing part 30 sorts the point data that are candidates for neighbor point data in order of distance from the query point (step S80). Then, the indexing part 30 designates the parent node of the current node N as the node N again and designates the set of cells of the parent node as C again (step S90). Then, the indexing part 30 returns to step S30 described above.
If it is determined in step S40 that the distance between the nearest point of the cell C0 to the query point and the query point is not smaller than the distance between the k-th point data in the search result to the query point q multiplied by 1/(1+ε′) (that is, if NO in step S40), the indexing part 30 determines whether or not the current node N is a root node (step S100). If it is determined that the node N is a root node (that is, if YES in step S100), the indexing part 30 ends the approximate neighbor searching process and outputs the first to k-th point data stored at this point as the approximate neighbor search result. On the other hand, if it is determined that the node N is not a root node (that is, if NO in step S100), the indexing part 30 proceeds to step S90 described above and continues the approximate neighbor searching process.
This is the end of the description of an example of the approximate neighbor searching process according to this embodiment.
From the results shown in
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details or representative embodiments shown and described herein. Accordingly, various modification may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2009-129156 | May 2009 | JP | national |