This invention relates to efficiently performing an approximate indexing and similarity search in large collections of high-dimensional data signatures.
While there are many prior proposals for indexing high-dimensional data (for a survey, see H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann, 2006), they have all been shown to suffer from the so-called dimensionality curse (see R. Bellmann: Adaptive Control Processes: A Guided Tour. Princeton Univ. Press (1961)), which means that when the dimensionality of the data goes beyond a certain limit, all search and indexing methods aiming exact answers to the problem have shown to perform slower than a sequential scan of the data signature collection. As a result none of these approaches has been shown to be applicable to large data signature sets.
One paradigm for attacking the dimensionality curse problem is to project high-dimensional data signatures to random lines, which was introduced by Kleinberg (see Two Algorithms for Nearest-Neighbor Search in High Dimensions, Jon M. Kleinberg, 1997) and subsequently used in many other high-dimensional indexing techniques. Such projections have two main benefits. First, in some cases, they can alleviate data distribution problems. Second, they allow for a clever dimensionality reduction, by projecting to fewer lines than there are dimensions in the data.
Fagin et. al. presented in their paper, “Efficient similarity search and classification via rank aggregation (Proceedings of the ACM SIGMOD, San Diego, Calif., 2003)” an algorithm called (O)MEDRANK for projecting the data signatures to a single random line per index and storing the identifiers organized in a B+-tree on a data store. This algorithm is described in the US 20040249831 patent application.
Since the OMEDRANK algorithm needs B+-trees for its query retrieval, Lejsek et. al in their paper, “A case-study of scoring schemes for the PvS-index”, in the proceedings of CVDB, Baltimore, Md., 2005 proposed an enhanced version of the OMEDRANK algorithm called the PvS-index, which redundantly saves these B+-trees to disk for fast lookup. The PvS-index suffers, however, from its static nature, which does not support updates as soon as nodes need to be split. Further drawbacks are the limited number of random lines (one line per hierarchy), the insufficient disk storage by using multiple B+-trees and its tight tie to the OMEDRANK algorithm and the Euclidean distance.
Another strategy in high-dimensional indexing follows the idea of Locality Sensitive Hashing (LSH), published by Indyk et al. in “Similarity search in high dimensions via hashing”, in the Proceedings of VLDB, Edinburgh, 1999 and “Locality-sensitive hashing using stable distributions”, MIT Press, 2006. LSH is not based on a sorted tree structure, but on hashing the data signatures into buckets. The hash function is constructed by projecting each data signature onto a small set of random lines with fixed cardinality. Each of the projections is categorized into buckets and each of these buckets is assigned an identifier. By concatenating all the identifiers of the projections a hash value is constructed and all data signatures resulting the same hash value are stored together on the data store.
Joly et. al have shown in “Content-Based Copy Detection using Distortion-Based Probabilistic Similarity Search” in IEEE Transactions on Multimedia, 2007 a video-retrieval system based on Hilbert-Space-Filling-Curves for fast high-dimensional retrieval. This method has, however, tuned especially for this specific and rather low-dimensional application, which still needs a sequential scan at the end of the query processing.
The present invention, in the present context called the NV-tree (Nearest Vector tree) is a data structure designed to provide efficient approximate nearest neighbor search in very large high-dimensional collections. Specifically, the indexing technique is based on a combination of projections to lines through the high-dimensional space and repeated segmentation of those lines. The resulting index structure is stored on a data store in a manner which allows for extremely efficient search, requiring only one data store access per query data signature.
In essence, it transforms costly nearest neighbor searches in the high-dimensional space into efficient uni-dimensional accesses using a combination of projections of data signatures to lines and partitioning of the projected space.
By repeating the process of projecting and partitioning, data is eventually separated into small partitions or “clusters” which can be easily fetched from the data store with a single data read operation, and which are highly likely to contain all the close neighbors in the collection. In a very high-dimensional space, such “clusters” may overlap. The present invention is in contrast to the prior art of the PvS index capable of handling any distance metric, as long as the projection preserves some distance information. Therefore, the drawback of Euclidean distance being the only distance measure is eliminated. Furthermore, the present invention provides methods for searching data of any size in a constant time. Moreover, in the prior art, the search quality of the PvS-index was highly dependent on the random line generated at the beginning of the index creation. The NV-tree greatly improves the search quality by selecting the best line, given a set of data signatures. In the present context the best line is the line which has the largest projection variance.
Also, while the PvS-index needs to sort the projected values after each partitioning step, the NV-tree only sorts the projected values when they are written to the data file, making the creation process considerably more efficient than the PvS-index.
In addition, since the PvS-index had to use many B+-trees, an inefficient disk-storage structure was created which led to a significant enlargement of the index and a significant startup cost. In fact the NV-tree can store just the data signature identifiers, and optionally, the location information on the line for every n-th point, which results in a very compact data structure. Furthermore, the NV-Tree supports also non-overlapping partitioning while the PvS-Index strictly requires such overlaps between the borders of partitions.
While the PvS-index is built on top of the OMEDRANK algorithm, the NV-tree is a general indexing technique for high-dimensional nearest neighbor queries. Since the PvS-index stores only the B+-trees needed for the OMEDRANK in its index, it is not a general data structure.
Compared to prior art on Locality Sensitive Hashing (LSH), the NV-tree is a tree structure while LSH is a flat hash based structure. LSH projects the data signatures onto a set of random lines and segments each individual line in equal-length buckets. In contrast to the NV-Tree, all projection steps are performed right at the start of the creation/search/update procedure, similar as in the PvS-Index.
Since LSH is an ε-Approximate Nearest Neighbor Search, the width of those buckets (also referred to as radius) is dependent on the chosen ε-threshold. The data signature is then assigned an identifier for each line according to the bucket it had been projected to. In a next step these identifiers are concatenated to a hash keyword (with the word's size equivalent to the number of lines), which identifies the cluster on the data store where a point is stored.
While the NV-Tree always guarantees constant access time to the data store, LSH suffers from the unpredictability of its hash bucket sizes. Individual Hash buckets may be empty or may contain many thousands of data signatures, which often exceeds the capacity of a single data store access. Furthermore, LSH suffers from non-existing rankings inside its hash buckets. Therefore its authors suggest loading all actual data signatures referenced in a hash bucket from the data store and calculating the actual distances between them and the query point. For large collections, this leads to additional O (bucket size) random accesses to the data store. In contrast, the NV-Tree does not need to calculate the distances as it is a purely a rank based solution.
The size of an entry in an LSH hash table on the data store consists of at least one identifier and a control hash value per data signature, while the NV-Tree stores just the sorted identifiers inside a leaf node. In order to lookup those identifiers we insert a key value for every 8th-64th identifier in the leaf node. This enables a leaf node to store up to 100% more identifiers. A non-overlapping NV-Tree needs therefore just about half the space on a data store than a LSH hash table, while an overlapping NV-Tree provides significantly better accuracy of the results, but requires a multiple of that space because its storage requirement is inherently exponential.
In a first aspect the present invention relates to a method for creating a search tree for data mining with constant search time. The method comprising the steps of: generating a set of isotropic random lines and store said lines in a line pool, projecting all the data signatures onto a projection line from said pool, building a tree node storing information characterizing said line, segment said line into two or more line partitions, segment further said two or more line partitions. The first steps of the method are repeated for each partition until a stop value is reached and then data signatures of current partition are projected to a line. Thereafter the data signatures are sorted and stored.
In another aspect the present invention relates to a method for inserting data signatures into a search tree. The method comprising the steps of: traversing said search tree, projecting data signatures representing said data to be inserted onto a projection line and selecting one or more path based on a projection value obtained projection step. The first three steps are repeated until said data signature is projected to a value belonging to a partition stored in said data store. The next step involves searching for location in pre-sorted data signatures of said partition and finally the data signature is stored.
In another aspect the present invention relates to a method for storing data from a search tree. The method comprising the steps of: selecting a projection line, projecting data signatures representing the data to be searched for onto the projection line and building a tree node storing information characterizing the partition and the line, segment the line into two or more partitions. Next the partitions are segmented and previous steps for each partition are repeated until a stop value is reached. Then data signatures of current partition are projected to a line, data signatures are sorted and finally, the data signatures are stored.
In another aspect the present invention relates to a method for deleting data signature from a search tree. The method comprising the steps of: traversing said search tree, projecting data signatures and representing said data signatures to be deleted, onto a projection line. Next, one or more path is selected based on a projection value obtained in the second step of the method and the first steps are repeated until said data signature is projected to a value belonging to a partition stored in said data store. Then appropriate location among pre-sorted data signatures is located in said partition and the data signature is deleted.
In another aspect the present invention relates to a method for data mining a search tree with constant search time. The method comprising the steps of: traversing a search tree to a leave, retrieving one or more data signature from said leave and reading data pointed to by said data signature. Next one or more value is/are located in said data, one or more data signatures is referenced and the n-nearest data signatures neighbors are retrieved. Finally the search is terminated.
In another aspect the present invention relates to a computer program or suite of computer programs is provided, so arranged such that when executed on a processor said program of suite of programs cause(s) said processor to perform the methods of the present invention. Furthermore a computer readable data storage medium is provided for storing the computer program or at least one of the suite of computer programs of claim mentioned above.
In the following section the NV data structure, tree creation, insertion, deletion and search is described in a detail. In order to clarify the technical jargon commonly used in the field of the invention a definition of terms is provided.
Constant Time:
NV-Tree:
NV-Tree Leaf Node:
Load Factor:
Minimum Load Factor:
Data Signature:
Data Signature Identifier:
Data Mine:
Data Store:
Lp Distance:
Projection:
Projection Line:
General (k-) Nearest Neighbor Search:
ε-Approximate Nearest Neighbor Search:
Contrast filtered Nearest Neighbor Search:
SIFT:
Search Quality:
First, a large set of isotropic random lines is generated and kept in a line pool. When the construction of an NV-tree index starts, all data signatures are considered to be part of a single temporary partition. Data signatures belonging to the partition are first projected onto a single projection line through the high-dimensional space. For best retrieval quality the line with the largest projection variance is chosen from the line pool.
The projected values are then partitioned into disjoint sub-partitions based on their position on the projection line. In case of overlapping NV-Trees, sub-partitions are created for redundant coverage of partition borders. These overlapping partitions may cover just a small part exactly around the partition borders or may grow to fully adjoining overlapping partitions. Strategies for partitioning are described in detail later in the text.
This process of projecting and partitioning is repeated for all the new sub-partitions which are of a smaller cardinality using a new projection line at each level. It can stop at any time as soon as the number of data signatures in a sub-partition reaches a specified lower limit which is less than or equal to one access in the data store. When a branch stops, the following steps are performed:
Overall, an NV-tree consists of: a) a hierarchy of small inner nodes, which are kept in memory during query processing and guide the data signature search to the appropriate leaf node; and b) leaf nodes, which are stored on a data store and contain the references (data signature identifiers) to the actual data signatures.
During query processing, the query data signature first traverses the intermediate nodes of the NV-tree. At each level of the tree, the query data signature is projected to the projection line associated with the current node.
In case of overlapping partitions the search is directed to the sub-partition with center-point closest to the projection of the query data signature, otherwise it follows into the partition the projection is assigned to. This process of projection and choosing the right sub-partition is repeated until the search reaches a leaf partition.
The leaf partition is then read from the data store and the query data signature is projected onto the projection line of the leaf partition. Then the search returns the data signature identifiers which are closest to that projection.
Note that since the leaf partitions have fixed size, the NV-tree guarantees query processing time of one data store read regardless of the size of the data signature collection. Larger collections need to do more projections and therefore deeper NV-trees, but still requiring just a single access to the data store.
The cost of the query processing consists of adding these 3 factors:
The NV-Tree is composed of a hierarchy of small intermediate nodes that eventually point to much larger leaf nodes. Each intermediate node contains the following four arrays:
All leaf nodes are stored on a large data store and each leaf node is at most the size of a single data store read. The leaf nodes on the data store contain an array of data signature identifiers sorted by their projected value.
A partitioning strategy is likewise needed at every level of the NV-tree.
The Balanced partition strategy partitions data based on cardinality. Therefore, each sub-partition gets the same number of data signatures, and eventually all leaf partitions are of the same size. Although node fan-out may vary from one level to the other, the NV-tree becomes balanced as each leaf node is at the same height in the tree.
The Unbalanced partitioning strategy uses distances instead of cardinalities. In this case, sub-partitions are created such that the absolute distance between their boundaries is equal. All the data signatures in each interval belong to the associated sub-partition. With this strategy the projections leads to a significant variation in the cardinalities of sub-partitions. To implement the Unbalanced strategy, the standard deviation sd and mean m of the projections along the projection line are calculated. Then a parameter α is used to determine the partition borders as . . . , m−2αsd, m−αsd, m, m+αsd, m+2αsd, . . . .
Both strategies can be partitioned into up to 100 sub-partitions on each line, which tends to produce shallow and wide NV-Trees, while partitioning in very few (2-10) partitions per line yields deep and narrow trees.
Furthermore both strategies can flexibly interleave each other, talking thereby of a hybrid NV-Tree.
Overlapping is an additional feature that may be applied flexibly for any node and for both partitioning strategies. It creates additional partitions covering the area around the partition borders.
Inserting or deleting a data signature from the NV-tree is performed according to the following process:
In the case when a partition is full, i.e. once a new item has been added it will take more than one data read access to retrieve the partition, it needs to be split to accommodate more values. All data signatures in the partition are partitioned into sub-partitions, each sub-partition projected and then appended to the data file. Additionally the reference to the old sub-partition on disk is replaced with a reference to a new node in the NV-tree, which references the newly created sub-partitions.
In the case the number of data signatures in a partition x drops below a certain threshold (usually a small fraction of the leaf node's total storage capacity) it has to be merged with its sibling leaf nodes and when some siblings are inner nodes also with the children of those inner nodes. All data signatures of the current node and its siblings are loaded from the data store, a best line for the whole set is found and if the set does not fit within a single leaf node it is again split into sub-partitions. Afterwards the parent node of x is reorganized or replaced and all old leaf nodes marked as obsolete.
In the case that the siblings of a leaf node to be merged are together ancestors of more than 15-100 other leaf nodes, the merge step might refrain from re-indexing this whole large sub-tree, but instead distributing the remaining signatures in that leaf node among the children of a neighboring partition.
Efficient and effective search of contrast filtered nearest neighbors might be improved by using more than one single NV-Tree. These result sets might be simply merged with a naive aggregation algorithm (successively popping off the highest ranked identifier from each result list) or might take into account in how many result sets a data signature is found, which implicit a higher ranking in the final result.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
The implementations of the invention being described can obviously be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the present invention, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope of the following claims.
The nearest neighbor search capability of the NV-tree has been compared to its closest competitors: Locality Sensitive Hashing and the PvS-Index. High emphasis was laid on a fair choice of the parameters, the factor a of the NV-Tree versus radius and word size of LSH.
For the experiments a collection of 179.4 million SIFT (Scale Invariant Feature Transform) data signatures was used, extracted from an archive of about 150 000 high-quality press photos. In order to evaluate the retrieval quality of the different high-dimensional index structures, several transformed versions of images from the collection were created. The transformations include rotation, cropping, affine distortions and convolution filters.
The inner product in Euclidean space was chosen as a commonly used projection function. It has to be noted, however, that the NV-Tree has also been evaluated with L1 and L2-distance as projection function. Out of these transformed images, a set of 500 000 query data signatures were created which were evaluated in the following four different setups:
All three NV-Tree setups constantly returned a result set of 1000 nearest neighbor candidates, guaranteeing good result quality as can be seen from Table 1. For LSH no guarantee can be given on how many neighbors are retrieved. With the chosen setup it yielded a minimum of 47 neighbors and a maximum of 156 256 with a median of 465.
Preliminary studies performed on this and other large data signature collections have shown that the definition of a nearest neighbors search is just meaningful in the context of contrast filtered nearest neighbor search. This can be theoretically justified by the results published in Beyer et. al “When is nearest neighbor meaningful” in Lecture Notes in Computer Science 1540:217-235, 1999 showing that a nearest neighbor must be significantly closer to a query point than most of the other points in the dataset in order to be considered meaningful. A contrast based definition of nearest neighbors approximates also the human notion of neighbors best. While in sparsely populated areas (as in the countryside) neighbors can be several kilometers of absolute distance away from each other, while in densely populated areas (as in cities) absolute distance between neighbors is just a few meters.
In the presented experiment the contrast ratio is defined by d(n100, q)/d(ni, q)>1.8, where n100 refers to the hundredth nearest neighbor of an exact nearest neighbor search as retrieved by a linear scan. Evaluating this definition on the 500 000 query data signatures a total of 248,212 data signatures surpass the contrast filter.
With only one data store access the overlapping NV-Tree performs best in terms of search time and receives still acceptable recall performance. Major drawback is the huge demand on disk space. This can be slightly reduced by using the whole bandwidth of a single disk access (32 4 kb-pages instead of 6 4 kb-pages for the given experimental setup) and avoiding one level of overlapping. In order to achieve the same retrieval quality the leaf node was structured by itself as a small tree containing another two levels of non-overlapping projections and finally a sorted array of identifiers, where up to 4 such arrays were touched during the aggregation.
The space requirement can be reduced significantly by removing the overlapping. Non-overlapping NV-Trees do not deliver as good result quality; therefore it needs an aggregation of at least 3 trees to get acceptable recall. In order to achieve comparable recall quality LSH needs at least 12 hash tables, which cause at least 12 (one per hash table and possibly more) accesses to the data store.
Although the recall for the experiments presented is very high for searching such a huge collection, the results are lacking on precision. While such low precision is well acceptable in local data signature applications, where many data signatures “vote” on the similarity of an object, this is unacceptable for user-orientated single data signature applications, since a user can only scan a handful of results and not several hundreds.
In order to increase such precision, returning now at most 8 nearest neighbor candidates, we need to add more index structures and perform a more sophisticated aggregation on the results:
The results in table 2 show that the NV-Tree configuration clearly out beats LSH in terms of disk space and search speed for obtaining a comparable amount of recall quality and precision. The PvS-Framework suffers from its huge storage demand on the data store together with a rather low recall when only using 3 indices.
The NV-tree is a general data structure for high dimensional nearest neighbor search. It supports any kind of data dimensionality and multiple distance measures, and is applicable for at least the following applications:
Number | Date | Country | Kind |
---|---|---|---|
8499 | Jun 2006 | IS | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IS07/00014 | 6/6/2007 | WO | 00 | 4/3/2009 |