This invention relates to an approach to efficient computation of inner products, and in particular relates to efficient inner product computation in image or video processing.
A number of image and video analysis approaches involve computation of feature vector representations for an entire image or video, or portions (e.g., spatial patches) of such representations. Approaches to determining similarity of vector representations include distance-based and direction-based approaches. An example of a distance-based approach uses a Euclidean distance (i.e., square root of the sum of squared differences of corresponding elements of the vectors), while an example of a direction-based approach uses an inner product metric (i.e., a sum of the products of corresponding elements of the vectors). Some approaches involve projection of a vector representation unto a basis vectors from a predetermined set. Such projections also involve inner product calculations.
Projection approaches include basis selection approaches in which the basis vectors to represent a particular feature vector are selected from a larger predetermined “dictionary” of basis vectors. One such approach is called “Orthogonal Matching Pursuit (OMP)” in which a series of sequential decisions to add basis vectors for the representation are made. These decisions involve computations of inner products between the as-yet unselected basis vectors from the dictionary and a residual vector formed from the component of the feature vector not yet represented in the span of the selected basis vectors from the dictionary.
One prior approach to computation of an inner product between two vectors u and ν uses a random projection technique. The Johnson-Lindenstrauss theorem is a basis for “Location Sensitive Hashing” (LSH) for a given a data vector ν, a bit vector h(ν)∈{0,1}p is computed such that
Here, ri's are random projection vectors, and p is the number of projections. Let └x┘ denote an operator such that └x┘=1 if x≧0 else └x┘=0. Let P be a projection matrix of random vectors P=[r1 . . . rp]T. We can write the bit-vector construction as h(ν)=└Pν┘.
As a consequence of the Johnson-Lindenstrauss theorem, the dot-product between two data vectors, u and ν, can be approximated with the hamming distance between their bit vectors, ∥h(u)−h(ν)∥1
Another prior approach provides a way of choosing P to be sparse, with non-zero entries that are ±1. An approach referred to as “Comparison Random Projection” (CRP) uses a construction of P as:
for example, where q=1 or 3. Because the projection Pν does not require multiplications, the overall computation is reduced.
Another prior approach provides a way of choosing P as a product of a sparse random projection (SRP) matrix PSRP with s non-zero elements per row drawn from normal distribution multiplied by a Hadamard matrix H and a random ±1 diagonal matrix D as:
PFJLT=PSRP H D
Note that a sparse feature vector ν applied to the projection PFJLTν can be computed as PFJLT=PSRP (H Dν)=PSRP {tilde over (ν)} where {tilde over (ν)}=HDν has the effect of making {tilde over (ν)} non-sparse even if ν is sparse.
There is a need for computationally efficient approaches to determining inner products between feature vectors, and more specifically, there is a need for efficient and accurate basis selection for techniques such as Orthogonal Matching Pursuit.
In one aspect, in general, a method for machine-implemented image feature processing includes accepting a data representation of a plurality of m dimensional feature vectors ν representing a processing of an image or video signal. For each feature vector ν, a p dimensional binary vector h is formed from the data representation of the first feature vector ν using to first procedure. This first procedure includes transforming the feature vector ν to form a transformed first feature vector {tilde over (ν)}, including applying a machine implemented computation to compute elements of the transformed feature vector as an additive combination of elements of the feature vector. Each element of h is determined according to a sign of a selected additive combination of elements of {tilde over (ν)}. For each feature vector ν a data transformation of the feature vector ν is formed by computing at least one approximation of an inner product of ν with another vector u using a second procedure including comparing the bit vector h formed from ν with a bit vector formed from u. An image analysis process is then performed according to the plurality of transformed feature vectors.
Advantages can includes reducing or eliminating the need for multiplication operations, thereby increasing processing speed, reducing power requirements for processing, and/or reducing circuit complexity of processing hardware. Furthermore, the approach can improve the accuracy of the approximation of the inner product in situations in which one or both of the input vectors are sparse or have a small number of relatively larger magnitude elements. This advantage can be particularly significant when an input vector to an inner product is a residual vector in a sequential basis selection (e.g., Orthogonal Matching Pursuit) procedure.
Other features and advantages of the invention are apparent from the following description, and from the claims.
The use of a computationally efficient inner product is described in the context of an Orthogonal Matching Pursuit (OMP) for determining representations of feature vectors for image or video signals received by a signal analysis system. It should be understood that this same approach to efficient computation of inner products is applicable in a variety of other image and video analysis approaches.
Generally, the OMP approach can be summarized as follows, recognizing that computational implementations do not necessarily implement mathematical operations in the order or manner as shown. A dictionary Φ=[ai;i=1, . . . , n] with ai∈m such that m<<n and m=Span(Φ). We also assume that the ai are unit norm vectors. Very generally, the OMP process involves an iteration selecting vectors ak
where νp is the residual (I−PS
Note that determining the pth value kp can take n−p+1 inner products. Although certain approximations of this step are known, the computation of the inner products in the selection of the dictionary elements remains a key computation requirement in the OMP and similar approaches.
Referring to
The coefficients ak
An approach to computing an inner product uTν between two vectors (e.g., between a dictionary vector and a feature vector or a residual of the feature vectors) uses a comparison random projection approach. First, transformed vectors ũ=HDu and {tilde over (ν)}=HDν are computed. Note that multiplication by the Hadamard matrix H (which has entries ±1), the a random ±1 diagonal matrix D, or their product does not involve multiplications. Rather, such a product involves only addition or subtraction of the elements of the vector being multiplied, with each entry occurring only once in the sum (i.e., either as a positive or negative term in the sum). Therefore, this matrix product does not require multiplications (i.e., does not include multiplications or sets of additions that effectively implement multiplications). Note that because HT H=n I and DTD=I, uTν=ũT{tilde over (ν)}/n. Then we define
h(ν)=└P{tilde over (ν)}┘=└PHDν┘.
and chose P as the sparse matrix PCRP as defined above PCRP
with q=3 so that the inner product is approximated as
Note that because PCRP has non-zero entries that are ±1, the matrix computation P{tilde over (ν)} does not require multiplications in its implementation. When the original feature vector ν is of dimension m, H and D are m×m matrices, and PCRP is an m×p rectangular matrix, generally with p>m and s non-zero entries per row, s/2 positive and s/1 negative. For example, for m=64 (e.g., pixel values in 8×8 patches), p=248 and s=2, 8, 16, 24 or 32.
Referring again to
Computation of a transformed feature vector a corresponding to the input vector x involves an iteration processing successive residuals νp 142, where ν0=h(x)=└PHDx┘ is computed by a random projection module 140. h(νp)=└PHDνp┘ pth the OMP procedure involves a search
where ∥h(ak)−h(νp−1)∥1 is computed by the comparison inner product module 150, which also controls the search (arg max) over dictionary elements k. The comparison inner product module 150 augments the selected basis 152 with kp. The residual and basis computation module 160 then determined the best coefficient vector a for the selected dictionary items, and computes the next residual νp, which to passes to the random projection module 140, which updates the projected residual h(νp) 142 for the next iteration.
After all the iterations (e.g., determined by a stopping rule such as a total number of iterations, characteristic of the residual, etc.) the transformed feature vector a 162 is output from the OMP module 110 as one of the transformed feature vectors 106.
Note that application of the transformation HD to the feature and dictionary vectors as described above is only one example of a broader range of transforms that preserve the inner product and that generally reduce the sparse nature of vector. Other choice include Wavelet transforms etc. Furthermore, the choice of sparse random projection PCRP can be replaced with other 0 and ±1 matrices.
Experimental application of the techniques described above have been applied to scene analysis and a novel video classification. The video classification application involves applying the OMP algorithm on visual feature vectors, e.g. Scale Invariant Feature Transform (SIFT), Speeded Up Robust Transform (SURF). After OMP, distribution statistics of the resultant projection coefficients are computed, and then video classification is performed based on these statistics. The video classification application is especially computationally challenging, requiring the use of the OMP algorithm on a very large number of feature vectors. To give an example of the amount of computation, typical feature extraction techniques result in approximately a thousand vectors per video frame; therefore, even a short five minute video clip captured at 30 Hz frame rate will have 9 million feature vectors. Several current video research projects attempt analyzing data sets having hundreds of thousands of videos, effectively requiring processing billions of feature vectors. The algorithmic speedup described here becomes especially crucial in such computationally intensive applications. Further examples, applications, and comparisons to other techniques are found in “Efficient Orthogonal Matching Pursuit using sparse random projections for scene and video classification,” Proc. 2011 IEEE International Conference on Computer Vision (ICCV), 6-13 Nov. 2011, pp 2312-2319, which is incorporated herein by reference.
Implementations of the approach described above can include software, hardware, or a combination of hardware and software. For example, a hardware implementation may include special-purpose circuitry for computing the term ∥h(ak)−h(νp−1)∥1. Software can include instructions for causing a data processing system to perform steps of the approaches. The data processing system can include a special-purpose processor, a general purpose processor, a signal processor, etc. The instructions can be machine-level instructions, or may be represented in a programming language, which may be compiled or interpreted. The software may be stored on a non-transitory medium (e.g., a volatile or non-volatile memory). In some examples, a system includes image acquisition modules, feature extraction modules, and/or feature extraction modules integrated together with the feature processing involving inner product implementations as described above. Some such examples may include integration within a single integrated circuit or multi-chip module. In some examples, a data representation in a hardware description language (e.g., Verilog) of circuitry for implementing an approach described above may be stored on a non-transitory medium and provided to impart functionality on a device specification system that is uses as part of a process of designing and manufacturing integrated circuits embodying the approach.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.