Embodiments of the present invention relate generally to visual search systems and, more specifically, to systems, circuits, and methods for providing improved compact descriptors of an image or object that reduce the bandwidth required to communicate these descriptors in visual search systems.
The widespread use of mobile devices equipped with high resolution cameras is increasingly pushing computer vision applications within mobile scenarios. The common paradigm is represented by a user taking a picture of the surroundings with a mobile device to obtain informative feedback on its content. This is the case, e.g., in mobile shopping, where user can shop just by taking pictures of products1, or landmark recognition for ease of visiting places of interest. As in the aforementioned scenarios visual search needs to be typically performed over a large image database, applications communicate wirelessly with a remote server to send visual information and receive the informative feedback. As a result, a constraint is set forth by the bandwidth of the communication channel, whose use ought to be carefully optimized to bound communication costs and network latency. For this reason, a compact but informative image representation is sent remotely, typically in the form of a set of local feature descriptors (e.g. SIFT, SURF, etc.) extracted from the image.
Despite the summarization of image content into local features, the size of state-of-the-art descriptors cannot meet bandwidth requirements.
Disclosed embodiments are directed to methods, systems, and circuits of generating compact descriptors for transmission over a communications network. A method according to one embodiment includes receiving an uncompressed descriptor, performing zero-thresholding on the uncompressed descriptor to generate a zero-threshold-delimited descriptor, quantizing the zero-threshold-delimited descriptor to generate a quantized descriptor, and coding the quantized descriptor to generate a compact descriptor for transmission over a communications network. The uncompressed and compact descriptors may be 3D descriptors, such as where the uncompressed descriptor is a SHOT descriptor. The operation of coding can be ZeroFlag coding, ExpGolomb coding, or Arithmetic coding, for example.
Visual search for mobile devices relies on transmitting wirelessly a compact representation of the query image, generally in the form of feature descriptors, to a remote server. Descriptors are therefore compressed, so as to reduce bandwidth occupancy and network latency. Given the impressive pace of growth of 3D video technology, 3D visual search applications for the mobile and the robotic market will become a reality. Accordingly, embodiments described herein are directed to compressed 3D descriptors, a fundamental building block for such prospective applications. Based on analysis of several compression approaches, different embodiments are directed generation and use of a compact version of a state-of-the-art 3D descriptor. Experimental data contained here for a vast dataset demonstrates the ability of these embodiments to achieve compression rates as high as 98% with a negligible loss in 3D visual search performance according to embodiments described herein.
A representative visual search system 100 is illustrated in
As illustrated in
In the following description, certain details are set forth to provide a sufficient understanding of the present invention, but one skilled in the art will appreciate that the invention may be practiced without these particular details. Furthermore, one skilled in the art will appreciate that the example embodiments described below do not limit the scope of the present invention, and will also understand various modifications, equivalents, and combinations of the disclosed example embodiments and components of such embodiments are within the scope of the present invention. Illustrations of the various embodiments, when presented by way of illustrative examples, are intended only to further illustrate certain details of the various embodiments, and should not be interpreted as limiting the scope of the present invention. Finally, in other instances below, the operation of well known components, processes, algorithms and protocols have not been shown or described in detail to avoid unnecessarily obscuring the present invention.
A research trend addressing effective compression of feature descriptors has emerged recently, so as to save communication bandwidth while minimizing the loss in descriptive power. Several techniques aimed at descriptor compression, also known as compressed or compact descriptors, have been proposed in the literature, The perceived market potential of mobile visual search has also lead to establishment of an MPEG committee which is currently working on the definition of a new standard focused on Compact Descriptors for Visual Search (CDVS)2.
Techniques for feature detection and description from 3D data have also been proposed in the literature, the topic recently fostered significantly by the advent of accurate and low-cost 3D sensors, such as the Microsoft Kinect and the Asus Xtion. Popular applications of 3D features include shape retrieval within 3D databases (e.g. Google 3D Warehouse), 3D reconstruction from range views, recognition and categorization of 3D objects. On the other hand, driven by the developments of 3D video technologies (3D movies, 3D TVs, 3D displays), embedded low-cost 3D sensors have started appearing on a number of diverse mobile devices. For instance, this is the case for new smartphones and tablets, such as the LG Optimus 3D P920, LG Optimus Pad, HTC EVO 3D and Sharp Aquos SH-12C, as well as game consoles, like the 3DS by Nintendo. A study by In-Stat claims that the market for 3D mobile devices is on a steady and fast growth, and that by 2015 it will count more than 148 millions units. Accordingly, new researches are investigating the development of 3D data acquisition technologies specifically conceived for mobile devices. Interestingly, novel technologies for 3D data acquisition are recently being developed for smartphones not equipped with 3D sensors: this is the case of the Trimensional Iphone app3.
Given the predicted fast development of the 3D ecosystem, it is envisioned the demand for new applications that will require querying a 3D database by means of 3D data gathered on-the-fly by mobile device or robots. They will likely adhere to the paradigm of current 2D visual search applications, as illustrated in
Key to the foreseen 3D search scenarios is therefore a novel research topic dealing with compact 3D descriptors, which ought to be developed to effectively support transmission of the relevant local information tokens related to the query 3D scene. A state-of-the-art 3D descriptor, e.g. SHOT, is considered in order to develop several approaches relying on recent data compression techniques. Experiments on a vast 3D dataset allowed the identification of the most favorable trade-off between the conflicting requirements of high compression rate and limited performance loss with respect to the original uncompressed descriptor. Results turns out quite satisfactory: an average compression rate of around 98% with a negligible loss in performance can be achieved.
As far as 2D compact descriptors are concerned, most techniques proposed so far deal with SIFT. An interesting and thorough review of SIFT compression approaches can be found in, where three different categories of compression schemes are identified: hashing, transform coding and vector quantization. In the first, each descriptor is associated with a hash code. These codes are then compared based on their Euclidean or Hamming distance. Examples of such methods are Locality Sensitive Hashing, Similarity Sensitive Coding and Spectral Hashing. Instead, transform coding is a technique used for audio and video compression (e.g. JPEG). A conventional transform coder takes an input vector X and transforms it into another vector Y=TX of the same size, then quantizes this new vector. The transformation allows for decorrelating the different dimensions of the original vector, in order to make quantization more effective and reduce the loss in performance. The decoder takes the transformed and quantized vector and applies an inverse transformation to obtain an estimation of the original vector. Example of transform coding schemes are Karhunen-Love Transform and ICA based Transform. Finally, compression based on vector quantization subdivides the descriptor space into a fixed number of bins (codebook) using clustering techniques such as the k-means algorithm. Successively, instead of a descriptor, its associated codeword ID can be sent. Two examples are Product Quantization and Tree Structured Vector Quantization. Although generally able to yield small distortions of the original signal, the main disadvantage of such approaches is that the codebook must be present at both the encoder and the decoder side. In our scenario, this requires the codebook to be stored on the mobile device and transmitted, which could be cumbersome due to its size being often considerable. Moreover, if the codebook is modified at run-time, it requires an additional transmission overhead to keep the synchronization between encoder and decoder. Another possibility deals with the use of a data-independent codebook, such as in Type Coding. In this case, the codebook is based on a regular grid defined over the descriptor space, which usually implies more distortion but does not require local storage of the codebook nor any synchronization overhead.
Alternatively to SIFT, one of the most famous compact descriptor is CHoG, which reported the best trade-off between compression rate and visual search performance when compared to other compact descriptors. To build the CHoG descriptor, first an uncompressed descriptor (UHoG) is extracted, which, like SIFT, is a vector of histograms of gradient orientations, but carries out spatial binning according to a DAISY configuration instead of a 4×4 square grid. Successively, UHoG is compressed by means of Type Coding to end up with the CHoG descriptor.
An example of compact 3D descriptor was proposed by Papadakis et al. in the prior art, but this descriptor is not a local 3D descriptor, but instead a hybrid 2D/3D global descriptor. To compress the descriptor the authors apply scalar quantization followed by Huffman Coding, reaching a compression rate of 92.6%.
The SHOT descriptor encodes a signature of histograms of topological traits. A 3D spherical grid of radius r, made out of 32 sectors, is centered at the keypoint to be described and oriented according to a unique local reference frame which is invariant with respect to rotations and translations. For each spherical grid sector, a one-dimensional histogram is computed, built up by accumulating the cosine—discretized into bS bins—of the angle between the normal at the keypoint and the normal of each of the points belonging to the spherical grid sector for which the histogram is being computed. The final descriptor is then formed by orderly juxtaposing all histograms together according to the local reference frame. To better deal with quantization effects, quadrilinear interpolation is applied to each accumulated element. Finally, to improve robustness with respect to point density variations, the descriptor is normalized to unit length. When color information is available together with depth, as is the case of RGB-D data provided by the Kinect, an additional set of histograms can be computed, where the L1 norm between the color triplet of the center point and that of each point of the current spherical grid sector is accumulated in each histogram, quantized into bC bins (usually bC 6=bS). The SHOT code is publicly available as a stand-alone library4, as well as part of the open source Point Cloud Library5.
As previously mentioned, prior proposals investigate compression schemes suitable for achieving compact 3D descriptors. Several state-of-the-art algorithms have been analyzed for data compression and four approaches have been derived corresponding to four embodiments shown in
Zero Thresholding utilizes the intuition that, generally, 3D surfaces intersect only a limited portion of a volumetric neighborhood around a keypoint, suggests that a number of proposed 3D descriptors are often quite sparse, i.e. with many values equal (or close) to zero. This is, indeed, the case for SHOT, for which experimental verification of this intuition has been done, finding that typically more than 50% of the elements are null. This characteristic may be exploited by a lossless compression step, i.e. by using just a few bits to encode the zero value. Moreover, it turns out to be even more effective to threshold to zero also those elements having a small value: this step is referred to as Zero Thresholding (ZT).
Table I shows the percentage of elements that are less than or equal to a threshold within a set of SHOT descriptors extracted from the two datasets that will be presented below, namely Kinect and Spacetime.
As demonstrated by Table I, a threshold equal to 0:01 yields a percentage of null elements as high as 83%, while thresholding at 0:1 allows reaching 94%. However, it was observed (see e.g.
Regarding Quantization, the original SHOT descriptor represents each element as a double precision floating-point number6. Given the SHOT normalization step, which results in all its elements ranging between 0 and 1, it is possible to quantize each value with a fixed number of bits, thus reducing the descriptor size. Since it was found that the descriptor performance starts to deteriorate when using less than 4 bits for the quantization step, the experiments in Sec. IV are carried out using 6 and 4 bits. It is worth noting that, depending on the coarseness of the quantization, this step can also account for the previous ZT step (e.g. this occurs in the case of 4 bits, where all values smaller than 1/32 are quantized to 0).
ZeroFlag Coding (ZFC) may also be used as a way to exploit the usually high number of null values present in the descriptor (especially after Zero Thresholding). It effectively encodes sequences of “zeros” by means of an additional flag bit, F, which is inserted before every element different from zero or every sequence of zeros. The flag bit F is inserted according to the following rules: _F=1 means that the next element is not zero, and it is followed by a fixed number of bits representing the quantized value of this element_F=0 means that the next element is a sequence of zeros, and it is followed by a fixed number of bits indicating the length of the sequence. This approach requires specifying the maximum length of a zero sequence. Since in our experiments a good performance was obtained with a value of the maximum length equal to 16, 5 bits were used to encode each zero sequence (1flag bit plus 4 bits to encode the length of the sequence). A zero sequence longer than 16 elements is split into multiple sequences.
ExpGolomb Coding (EGC) is a compression algorithm allowing the use of few bits to represent small values, the number of required bits increasing with increasing numerical values. The algorithm is controlled by a parameter k, which in our investigation is set to 0, so that each null element, particularly frequent in SHOT especially after Zero Thresholding, is represented by just one bit.
The idea behind Arithmetic Coding (AC) is to represent highly frequent values with few bits, the number of bits increasing as the symbol becomes less frequent (or less probable). Frequencies can be estimated through a training stage, wherein the probability distribution associated with symbols is learned. Alternatively, they can also be learned without a specific training stage in an adaptive manner, whereby at the beginning all symbols have the same probability, then each frequency is updated every time a symbol is encoded (or decoded). In this last case, there is no overhead due to initial codebook synchronization between encoder and decoder. In our study, the adaptive version of the AC algorithm was considered since it is more generally applicable, due to a training stage not being feasible in several application scenarios related to 3D visual search descriptors. A detailed explanation of the AC algorithm can be found in, which also provides the implementation of the adaptive version of the algorithm used.
Given an m-dimensional symbol, s, Type Coding (TC) associates its nearest neighbor q over a regular m-dimensional lattice. Hence, the index associated with q is transmitted instead of s. The lattice can be built as in, the main benefit being that its structure is independent of the data, so that TC does not require storage and transmission of any codebook. Beside m, Type Coding relies on another parameter, n, which is used to control the number of elements constituting the lattice, so that the total number of elements in the lattice coincides with the number of partitions of parameter n into m terms according to the following multiset coefficient: _
The number of bits needed to encode each index is at most: _
To experiment with TC, the approach of subdividing SHOT into equally sized sub-vectors and then applying TC compression to each sub-vector was used. As TC requires the elements to be encoded to sum up to 1, the set of required normalization factors associated to each sub-vector were appended at the end of the compressed descriptor. Finally, the array formed by the normalization factors is also 1normalized between 0 and 1, and then quantized with 8 bits to reduce its storage (this last normalization factor needs not to be stored). This allows the normalization step to be reversed at the end of the decoding stage with a limited loss due to normalization factor compression, as otherwise the information content of the descriptor would be distorted by the different normalization factors.
As SHOT consists of 32 histograms, the performance of TC was evaluated by combining them into sub-vectors consisting of k histograms, with k equal to 1, 2, 4, 8 or 16. Considering, for instance, parameter bS equal to 10, as proposed in, parameter m in (1) can be set to k_(bS+1); k=1; — — —; 168. From equation 2 it is possible to determine the size of the compact descriptor, and thus the overall compression rate, for different parameters choices.
Table II shows data obtained by choosing, for each value of m, the value of n that minimizes the accuracy loss with respect to the uncompressed SHOT descriptor. It can be noticed that the choice m=176, n=100, i.e. k=16, provides the highest compression rate.
Therefore, the experiments used these values, so as to favor compactness of the descriptor. However, it is worth pointing out that: i) the computational complexity (thus, encoding and decoding time) of TC grows with m and n, and ii) the algorithm uses, internally, integers represented with a large number of bits, which may be difficult to handle, both in software as well as in hardware. With the choice m=176, n=100, the resulting descriptor consists of two 256 bits integers: to handle them, a specific software library for large sized-integers was used, which causes a significant increase of the computational burden. As for experiments including color information, m and n have been set according to the same principle, in particular m=16_(bS+1) for the shape part and m=16_(bC+1) for the color part, and n=100.
The described approaches for achieving compact 3D descriptors are evaluated and compared here in terms of performance and compression rate with respect to the uncompressed SHOT. The cases of a 3D data as well as that of RGB-D data are described herein.
Experiments were carried out over five different datasets, two of which also contains color information and will be used in the experiments concerning RGB-D descriptors. Three of these datasets are those that were originally used in the experimental evaluation of SHOT: _Spacetime dataset10, containing 6 models and 15 scenes acquired with the Spacetime Stereo technique. _Kinect dataset10, containing 6 models and 17 scenes acquired with a Microsoft Kinect device. _Stanford10, containing 6 models and 45 scenes built assembling 3D data obtained from the Stanford repository11.
Two additional datasets, namely Virtual Stanford and Virtual Watertight, were built using, respectively, 6 models from the Stanford repository and 13 models from the Watertight dataset12. The scenes in these datasets have been created by randomly placing 3 to 5 models close to each other, then rendering 2.5D views in the form of range maps, with the aim of mimicking a 3D sensor such as the Kinect. To this end, a Kinect simulator was used which first generates depth-maps from a specific vantage points by ray casting, then adds Gaussian noise and quantizes the z-coordinates, with both the noise variance and the quantization step increasing with distance. Finally, applied bilateral filtering was applied to the depth maps to reduce noise and quantization artifacts.
All these datasets include, for each scene, ground-truth information (i.e. the list of model instances present in the scene, together with their rotation and translation with respect to the original model).
To evaluate the performance of compact descriptors, the process first extracts a predefined number of keypoints from each model via random sampling, then rely on groundtruth information to select the scene points that exactly match those extracted from models. To simulate the presence of outliers, the process randomly extracts a predefined number of keypoints from clutter, which do not have a correspondent among the models. For each keypoint, the SHOT descriptor is computed. For the SHOT parameters, the size of the radius r and the number of shape and color bins (bS and bC) were tuned so as to adapt them the specific characteristics of the dataset. The tuned values, listed in Table III, are used by all the considered compact descriptors.
After computation of the descriptors, each vector is first encoded and then decoded. This is done also for the models descriptors, so as to account for the distortions brought in by compression. Successively, the matching stage compares the descriptors extracted from each model to those identified in each scenes based on the Euclidean distance in the descriptor space. More precisely, descriptors are matched based on the ratio of distances criterion in one embodiment. Correspondences are then compared with the ground-truth to compute the number of True Positives and False Positives at different values of the matching threshold, thus attaining Precision-Recall curves. It is important to point out that, as shown in
These results show that ZFC, EGC and AC pipelines using 6 bit quantization, as well as Type Coding, are notably effective, achieving high compression rates (between 96 and 98%) with a negligible loss in performance compared to the uncompressed SHOT descriptor. Among the compared approaches, AC and Type Coding yield the best compression rates, with AC based on 6 bit quantization performing slightly better than Type Coding (average compression rate 97:71% vs. 97:66%). Moreover, as discussed previously, Type Coding with parameters tuned to achieve a performance level comparable to AC turns out to be significantly less efficient. In particular, with our implementation, encoding with Type Coding is on the average between 3 and 4 times slower than encoding with ZT, quantization and AC (i.e. 0:26 ms vs. 0:07 ms per descriptor), while decoding can be up to two orders of magnitude slower ((i.e. 0:58 ms vs. 0:05 ms per descriptor). Therefore, the pipeline based on AC seems the preferred choice to attain a compact SHOT descriptor for 3D shape data.
These findings are confirmed by the results of the experiments on RGB-D data (i.e. using both 3D shape and color), as shown in
Again, 6-bits ZFC, EGO and AC as well as Type Coding exhibit a performance level indistinguishable from the uncompressed SHOT while providing excellent compression rates. Also with RGB-D data, 6-bits AC seems the best compact descriptor, due to its higher average compression rate (98:58% vs. 98:11% of Type Coding) and lower computational complexity with respect to Type Coding.
Other compact 3D descriptors, such as the hybrid 2D/3D global descriptor proposed in the prior art, achieves an inferior compression rate (92:6%) than our proposals.
Finally, experiments used a state-of-the-art 3D keypoint detector instead of random sampling. The results confirmed the trend related to random keypoint selection, as regards both compression rates as negligible accuracy loss with respect to the uncompressed descriptor (the reader is referred to the supplementary material for the details concerning the experiment using.
The above embodiments demonstrate how the use of suitable compression techniques can greatly reduce the redundancy of a state-of-the-art 3D descriptor, providing dramatic shrinking of the descriptor size with a negligible loss in performance. Among considered ones, the approach based on Arithmetic Coding is preferable to Type Coding, the latter being the compression method deployed by the most popular image descriptor (i.e. CHOG). A key intuition behind the devised compression pipelines deals with leveraging on the sparsity of the considered 3D descriptor, a feature that is likely to be advantageous also with several other 3D descriptors relying on a volumetric support. Embodiments described here may be used for searching and knowledge discovering in large remote databases given query 3D data sensed by next generation mobile devices and robots.
One skilled in the art will understand that even though various embodiments and advantages of the present invention have been set forth in the foregoing description, the above disclosure is illustrative only, and changes may be made in detail, and yet remain within the broad principles of the invention. For example, many of the components described above may be implemented using either digital or analog circuitry, or a combination of both, and also, where appropriate, may be realized through software executing on suitable processing circuitry. It should also be noted that the functions performed can be combined to be performed by fewer elements or process steps depending upon the actual embodiment being used in the system 100 of
The present application claims benefit of U.S. Provisional Patent Application Nos. 61/596,149, 61/596,111, and 61/596,142, all filed Feb. 7, 2012, and all of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
61596149 | Feb 2012 | US | |
61596111 | Feb 2012 | US | |
61596142 | Feb 2012 | US |