Nearest neighbor classifier is one of the most popular approaches for classification. It is naturally suitable for tasks involving many classes. For example, there are more than thousands of classes in the case of face recognition and OCR for Chinese characters. The success of nearest neighbor classifier highly depends on the quality of distance metric of data, therefore metric learning has been an important component of machine learning. A learned distance metric can also be transferred to similar tasks. For example, a distance metric can be learned from a group of subjects with many training face images, and use the learned metric to recognize a different group of subjects with only one face image per subject. One major issue of metric learning is its prohibitive computation costs, because the training algorithms typically operate with pairs or triples of training examples.
Many approaches exist for distance metric learning. Unsupervised approaches, such as PCA family, have been widely used. For example, Eigenface has been used for face recognition and gender recognition. In the cases where additional label information is available, supervised approaches may generally lead to higher-quality distance metrics. Among supervised approaches, linear discriminant analysis, such as Fisherface, is widely used because of simplicity and relatively high quality. Machine learning practitioners usually pursue approaches resulting into higher quality metrics, as the cost of computation is decreasing.
To learn metric for nearest neighbor classifiers in many-class problems, we prefer nearest-neighbor-based approaches, because the triple constraints are weaker than pairwise constraints, and directly related to the decision rule of nearest-neighbor classification—in order to correctly classify, essentially we need to ensure the triple-wise relationship that the distance between data and from the same class is smaller than that between and from different classes, while caring less about the absolute values of pair-wise distances. Thereafter in this paper, we will call the triple-wise approach by nearest-neighbor-based metric learning.
In one aspect, a nearest-neighbor-based distance metric learning process includes applying an exponential-based loss function to provide a smooth objective; and determining an objective and a gradient of both hinge-based and exponential-based loss function in a quadratic time of the number of instances using a computer.
Implementations of the above aspects may include one or more of the following. The method includes adding regularization term to control a generalization error. The method includes adding trace norm constraints to control a generalization error. The method includes using exponential-based loss function to generate a smooth loss objective. A hinge loss function can be used for learning metrics. The method includes using a sorted order to determine the objective and gradient. An exponential-based loss function for learning metrics can be used. A class soft-max distance and between-class soft-min distance can be used to determine the objective and gradient. The soft-max distance and soft-min distance can be:
The method includes using concavity of xp (0<p1<1) and log(1+b x) (b>0) to determine an up-bound of a loss function. The learned distance metric can be used to classify, recognize or retrieve data. The regularization terms and constraint terms control generalization error and reduce overall error. The loss function and its gradient can be:
where Zx,y is the set of data not belonging to the class of x and satisfying 1+d2(y,x)≧d2(z,x), Yx,z is the set of data belonging to the same class of x and satisfying 1+d2(y,x)≧d2(z,x), wx,v is
if v in the same class of x,
if v is not in the same class as x. X is an p×N matrix whose j-th column is the feature vector of xj, W is an N×N matrix whose i,j-th element is wx
The method includes determining an exponential type of surrogate function
ψ(ξ)=ξρ, where ρε(0,1] where the gradient with respect to the squared distance is
The method includes determining a logit type surrogate function as:
where β>0 where the gradient with respect to the squared distance is
The method includes determining a gradient matrix as
where W is an N×N matrix whose i,j-th element is wij, S is an N×N diagonal matrix whose i-th diagonal element is Σjwij+wji.
Advantages of the preferred system may include one or more of the following. The system efficiently determines distance metric, which is a crucial component for nearest neighbor classification and information retrieval. The system uses lower-complexity techniques for determining loss functions and their gradients. The system also applies smooth surrogate loss functions to achieve better convergence rates. Evaluation with a number of datasets shows the benefit of efficient computation of gradients and fast convergence rates of smooth loss functions. Overall, the system is advantageous in that it has:
1. Less complexity in the processing.
2. Faster operation in learning the distance metric.
3. Higher quality in the learned metric.
Other embodiments computes gradient for smooth surrogate functions. The smoothness of those loss functions makes it possible to use Nesterov's method to achieve a faster convergence rate.
Next, details of the distance determination are discussed. One embodiment uses asquared distance metric, defined as d2(x,y; A)=(x−y)T A(x−y) for all pairs of instances x and y, where A is positive semi-definite p×p matrix. The nearest-neighbor error is
ε=Pr(d2(y,x;A)≧d2(z,x;A)|y˜x,z\˜x), (1)
where y˜x denotes that x and y belongs to the same class, z\˜x denotes that z belong to another class. Usually, we make an assumption that y is uniformly sampled from the class that x belongs to and z is uniformly sampled from any class that x does not belong to. The goal of this metric learning problem is to learn the parameter A from a set of N training instancesto minimize the error ε in Eq. (1).
φ be a nondecreasing surrogate function,and φ(0)=1. By Markov inequality,nearest neighbor error is bounded by l(A)+RN (2)
where l(A)=Ex,y˜x,z\˜xφ(d2(y,x; A)−d2(z,x; A)) is the true expectation over all instances, and RN is the generalization error.
The concept of target neighbors is introduced to choose y from a subset of instances in the same class which are close to instance x. With this concept, we modify the notation y˜x as y is a target neighbor of x, where y is assumed to be uniformly sampled from the target neighbors of x. Our tests show that choosing more target neighbors (or all same-class instances) is preferred when the dimension is high and/or the number of exemplar instances is small.
To minimize the prediction error, the regularized framework in Eq. (3) minimizes l(A) and control RN, according to the analysis of the generalization error in Section 3.
where ∥A∥F is the Frobenius norm of A, which ensures the generalization error, and the trace tr(A) further controls the generalization error and enforce that A is of low rank. Let A be the matrix that optimizes f. For simplicity, d2(x,y; A) becomes d2(x,y).
To solve the constrained problem of Eq. (3), we can use projected gradient descent method. In each step, the process performs a projection after a gradient step.
An efficient process for Hinge Loss is discussed next. When we use hinge loss function φ(ξ)=[1+ξ]+, Problem (3) differs from the large margin nearest neighbor classifier (LMNN) by normalization, regularization and trace constraint. LMNN does not normalize the loss for each instances. In practice, when the number of targeted neighbors for each class is the same, the overall empirical loss of LMNN and that in Eq. (2) are not much different as long as no class dominates in the dataset. In more general cases, Eq. (2) proves that the normalized loss can bound the expected error. In addition, LMNN uses the total within-class squared distance as regularization, while we use Frobenius norm as regularization based on the generalization analysis. The trace constraint is used to enforce the low rank, and has certain effect on the generalization error based on our analysis.
In one embodiment, the loss function and its gradient can be written as
where Zx,y is the set of data not belonging to the class of x and satisfying 1+d2(y,x)≧d2(z,x), Yx,z is the set of data belonging to the same class of x and satisfying 1+d2(y,x)≧d2(z,x), wx,v is
if v in the same class of x,
if v is not in the same class as x. X is an p×N matrix whose j-th column is the feature vector of xj, W is an N×N matrix whose i,j-th element is wx
The efficient computation of the loss function and its gradient efficiently depends on whether we can compute all |Yx,z|, |Zx,y| and ΣzεZ
Pseudo code for one exemplary process to determine objective and gradient of LMNN is as follows:
To achieve the optimal convergence rate, we need smooth surrogate functions. However, certain embodiments can not be modified to work for general surrogate functions. To achieve efficient computation for loss functions and their gradients, we introduce a special family of exponential-based smooth surrogate function. Let φ(ξ)=ψ(exp(ξ), where ψ be a concave nondecreasing function and ψ(0+)=0, ψ(1)=1. Because of the concavity of ψ, we have
Clearly, δ+(x) is the soft-max of the square distances of all instances similar to x, and δ−(x) is the soft-min of the square distances of all instances not similar to x. The concavity of ψ allows us to reduce the comparison among the triples to the comparison between soft-max of within-distance and soft-min of between-distance, which enables us to efficiently compute the empirical loss function and its gradient within O(N2p+Np2) time. The details are shown in the following examples of ψ:
1. ψ(ξ)=ξρ, where ρε(0,1]. This is the exponential type of surrogate function. The gradient with respect to the squared distance is
2.
where β>0. This is the logit type of surrogate function. The gradient with respect to the squared distance is
For all these cases, {tilde over (l)}(A) is convex with respect to A. The gradient matrix is
W is an N×N matrix whose i,j-th element is wij, S is an N×N diagonal matrix whose i-th diagonal element is Σjwij+wji.
For each instance, we can compute all distances within O(Np) time, each δ+(x) and δ−(x) within O(N) time, all wij within O(N), and additional matrix gradient computation with O(p2). Thus the total time for {tilde over (l)} and {dot over ({tilde over (l)} is O(N2p+Np2). Given the above equations, the algorithm of computing the losses and its deritives should be straightforward, thus we do not list the pseudo code here. The memory consumption can be limited with O(N+p2).
In tests, we find the performance is usually sensitive to the choice of the number of target neighbors. For classification cases, if we have a large number of exemplar instances, a small number of target neighbors usually results into a better performance, because the exemplar instances far from the testing instance in the Euclidean metric are usually far from the testing instance in the learned metric when the dimension is low. Thus the uniform distribution assumption of selecting exemplar instances does not hold. The choice of smaller number of target neighbors can benefit the performance. However, in retrieval cases, the exemplar instance is just the query instance, therefore it is better to use most of the instances as the target neighbors, even if they are not that similar to the query instance.
The system is based on an analysis of the generalization error of nearest-neighbor-based distance metric learning approaches. The analysis suggests to use the regularized minimization framework for distance metric learning to control the generalization error. The system uses efficient techniques to compute objective loss functions and their gradients for hinge, exponential and logit loss with O(N2p+Np2+p3) time and O(N+p2) working memory. Embodiments of the system use the smooth surrogate functions for faster convergence rates than that nonsmooth hinge loss function. Tests confirm the accuracy of these approaches and the efficiency of the computation.
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
Although specific embodiments of the present invention have been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the particular embodiments described herein, but is capable of numerous rearrangements, modifications, and substitutions without departing from the scope of the invention. The following claims are intended to encompass all such modifications.
The present application claims priority to Provisional Application Ser. No. 61/491,606 filed May 31, 2011, the content of which is incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
20090252413 | Hua et al. | Oct 2009 | A1 |
20100198764 | Harris | Aug 2010 | A1 |
20110013847 | Statsenko et al. | Jan 2011 | A1 |
Entry |
---|
Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pp. 1473-1480. MIT Press, Cambridge, MA, 2006. |
Kilian Q. Weinberger and Lawrence K. Saul. Fast solvers and efficient implementations for distance metric learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008. |
Number | Date | Country | |
---|---|---|---|
20120308122 A1 | Dec 2012 | US |
Number | Date | Country | |
---|---|---|---|
61491606 | May 2011 | US |