The present disclosure generally relates to vector search engines, and relates in particular to a high performance vector search engine based on dynamic multi-transformation coefficient traversal.
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
It is well known to experts in the field that it is hard to get sub-linear similarity search operation over a high dimensional large vector data set. Research results have been obtained on limited data sets such as time series data and face image data, etc. These results mainly focused on variations of statistical clustering analysis such as: (a) “static” supporting vector analysis that divides the data set into smaller numbers of clusters to facilitate the search operation; or (b) “dynamic” tree structures to support a hierarchical search. Due to the phenomenon of high dimensionality of large vector data where the distance between all the vectors tends to be concentrating to a narrower standard deviation centering on the average distance, all the clustering and tree based partitioning methods are not effective for high dimensional large vector data sets. Therefore, it is necessary to investigate new methods that can improve the speed and accuracy of the similarity search operation for high dimensional large vector data sets.
Independent from the statistical analysis methods developed for similarity search over large vector data sets, Wavelet transformation has been applied to various problems of signal/image processing. The advantages of Wavelet transformation include: (a) generality of the transformation; (b) adaptability of the transformation; (c) transformation is hierarchical; (d) transformation is loss-free; and (e) efficiency of the transformation.
A similarity search engine includes a transformation module performing multiple iterations of transformation on a high dimensional vector data set. A scanning module supports dynamic selection of coefficients generated by the multiple iterations, and store and utilize search results in subsequent search operations. A dynamic query vector tree constructed from one or more input queries enhances search performance using multiple scans. Subsequent scans have a reduced candidate vector set and increased nearest neighbor vectors in a query vector set compared to previous scans.
Further areas of applicability will become apparent from the description provided herein. It should be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.
The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.
A new set of information can be generated by original vector data using iterative transformation. Traditionally, signal analysis is done based on coefficients generated from one iteration of transformation, such as Harr and Fourier transformation. While one iteration of a transformation can achieve multiple resolution and distance preserving properties, more information can be extracted from the data by observing coefficients over multiple iterations of transformations.
For example, for a n dimensional vector V, one can apply Harr transform Harr(V), c times to get the c-transformed vector Vc (i.e., Vc=Harr(Harr(Harr( . . . (V))) for c complete iterations). This c iteration of Harr transform generates (n×c) coefficients which can be labeled as aij where 0<i<c+1 and 0<j<n+1. Then, one can select k coefficients to form an approximation vector from the (k×c) coefficients (i.e., approximation vector is {a(ij)} where i is the i-th coefficient of j-th Harr transform of original vector). One can also quantize the elements of each vector using a lower number of bits. In particular, it is possible to use quantization(projection(V)) to describe a final approximation vector. The approximation vector can be much smaller than the original vector, but, due to multiple iteration of transformation and selection of coefficients, it can retain a dense and sufficient representation of the information contained in the original vector to support similarity search operations with good accuracy.
Based on the idea explained above, a similarity search engine can allow a user to select a best method for a specific set of applications dynamically. In particular, the search engine can support dynamic selection of coefficients generated by multiple iterations of transformation on a high dimensional vector data set. The search engine can also store and utilize some of the search results in the following search operations. The stored reference vector set (due to the previous search operations) can speed up the search operation. Further, the search engine can build a dynamic query vector tree from one or more input queries to enhance search performance using multiple scans, each with a reduced candidate vector set and increased nearest neighbor vectors in the query set.
Referring to
To increase the efficiency of the search operation, the architecture also stores the nearest neighbor information obtained from previous nearest neighbor search results into each approximation vector. This nearest neighbor information is collected in an index file and represented in
Turning now to
This operation returns vectors close to each other rather than vectors close to only one query point. The maximum number of scans at step (c) is bounded by ┌log(K)┐ iterations.
The advantages of this search engine over existing art are numerous. For example, the selection of dominant coefficients is from multiple transformations rather than one transformation. Dominant coefficient selection based on standard deviation of coefficients across the sample vector set allows for a statistically more accurate calculation of L2 (Euclidian) distance. Separating the uniformly distributed elements and obtaining approximation using a quantization method can further increase the accuracy of the distance calculation, at the same time reducing the total amount of data bits involved in the calculation.
Also, by storing nearest neighbor information accumulatively into the approximation vectors, the search engine can provide a fast retrieval of nearest neighbor by using the computation results obtained from previous search operations.
Further, the query tree supports simultaneous comparison of an input approximation vector against a query vector and its' associated nearest neighbor vectors. If the input vector is similar to the query vector and its' near neighbor, it is more likely that the original vector will be similar to the query vector.