The invention relates generally to local supervised learning, and more particularly to selecting a subset of training points from a training data set based on a single query point.
In supervised learning, training data are used to estimate a function that predicts an unknown output based on an input vector of query points. In local learning methods, for a given query point, the function is determined by training points that are “near” the query point. Nearness can be determined by some distance metric.
Examples of local learning methods include nearest-neighbor regression and classification, and locally weighted regression. Two example applications include prediction of future values of a time series based on past values, and detection of whether a particular object is present in an image based on pixel values.
In such problems, the training data set D is a set of pairs D={(x1, y1), . . . , (xM,yM)}⊂×, where X denotes input patterns, e.g., =. Each pair includes an input vector xi, and an output yi. A function ŷ=F(x), which estimates the output from the corresponding input vector is learned from the training data set.
In local learning methods, for each query point xq, the local function F(x) is learned based on only the training data points in the training set that are near the input query point xq. The training points that are near the query point are usually selected from the k nearest points in the training dataset according to the distance metric. Alternatively, the selected training points are less than some distance threshold d from the query point.
The idea the behind local learning methods is that the data can have different characteristics in different parts of the input space, and data that are close to the query point should be the most useful for learning the function to predict the desired output from the given input.
In an example application, it is desired to predict the daily power demand. Different factors can influence the demand load at different times of the year. If the query point corresponds to a summer day, then it can be advantageous to learn a function F( ) based on only the summer days in the training data set.
However, using the k nearest neighbors or all neighbors within some distance d does not always give the best performance.
It is desired to provide a new notion of the local neighborhood along with a method for determining which training points belong to this neighborhood.
A method selects a subset of training points near a query point from a set of training points by maximizing a cumulative similarity, wherein the cumulative similarity measures a similarity of the query point to each point in the subset and a similarity of points in the subset to each other.
As shown in
Conventional nearest-neighbor methods select neighborhood points that are near to the query point without regard for whether the resulting neighborhood is compact.
The method according to the embodiments of the invention includes a compactness criterion to improve performance when the input training data are non-uniformly distributed in the input space.
Our subset of local neighborhood points XN 302 is
XN=argmaxX⊂χG(X)=DT(X)+λe−H(X),λ>0, (1)
where the function argmax 310 returns the value of the parameter X that maximizes the function G, DT(X)=Σx∈X exp(−∥x−xq∥p), p=1, 2 is a cumulative similarity from the query point 305 to the training subset X⊂χ, e−H(X) evaluates an inverse range of a distribution induced by X, and γ is a control parameter. By the “range” of a distribution, we mean the number of points in the sample space if the sample space is discrete, and the length of the interval on which the probability density function is different from zero if the sample space is a real line according to the properties of exponential entropy. H is the Shannon entropy. The Shannon entropy is a measure of average information content that is missing when the value of the random variable is unknown.
We estimate the Shannon entropy assuming a Gaussian distribution as follows
where μ is the mean of the points in the subset, and Σ is their covariance. C is the dimensionality of the input training points 301.
The goal of Equation (1) is to maximize the cumulative similarity to the largest cluster in the vicinity of a query point. Our objective is to find the subset of the training data 302 in a way that is adaptive to the underlying structure of the training data patterns.
The combinatorial optimization nature of this problem is a key difference from a greedy approach used in conventional nearest-neighbors methods. The objective function defined in Equation (1), which we maximize, has a mathematical property known as supermodularity. A function
ƒ: Rk→R
is supermodular if
ƒ(xy)+ƒ(xy)≧ƒ(x)+ƒ(y)
for all x, y∈Rk, where xy denotes the component-wise maximum, and xy the component-wise minimum of x and y.
Maximizing the supermodular function is equivalent to minimizing a submodular function. Therefore, we can apply the conventional tools of submodular optimization to optimize this function.
After the optimal subset of points is determine using the above procedure, the subset of points can be used to train 320 any classification or regression method.
Using
The Figures clearly show that our method outperforms the conventional method. In the submodular method, the neighborhood points are selected adaptively, with a smaller number at the head of the distribution in
Heteroscedastic Support Vector Regression
The neighborhood training data selected as described above can be used to train any regression or classification method. We now describe one such technique for heteroscedastic support vector regression, which is an extension of support vector regression that uses local neighborhoods to find local regression functions. In statistics, a sequence of random variables is heteroscedastic when the random variables have different variances.
Heteroscedastic support vector regression estimates a function F(x) of the form
F(x)=wTx+b,
where wT is a vector with a transpose operator T, and b is a scalar.
The function is found by solving the following optimization problem:
where I is an N×N identity matrix, N is the dimensionality of the input vector, ξi and ξ*i are slack variables, ε is an error tolerance, and p∈{1, 2} determines the penalty type, and
is an empirical covariance for the training points in the neighborhood of xi, where Xi is the subset of the neighborhood points, ki is the number of points in Xi, and
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Entry |
---|
Mitra, P.; Murthy, C.A.; Pal, S.K.; , “Unsupervised feature selection using feature similarity,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol. 24, No. 3, pp. 301-312, Mar. 2002. |
Li, Yun; Hu, Su-Jun; Yang, Wen-Jie; Sun, Guo-Zi; Yao, Fang-Wu; Yang, Geng. “Similarity-Based Feature Selection for Learning from Examples with Continuous Values,” Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Sciencepp. 957-964 (2009). |
Hastie, T.; Tibshirani, R.; , “Discriminant adaptive nearest neighbor classification,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol. 18, No. 6, pp. 607-616, Jun. 1996. |
S. Basu, C.A. Micchelli, and P. Olsen, “Maximum Entropy and Maximum Likelihood Criteria for Feature Selection from Multivariate Data,” Proc. IEEE Int'l. Symp. Circuits and Systems, pp. III-267-270, 2000. |
J. Friedman, “Flexible Metric Nearest Neighbour Classification,” technical report, Stanford Univ., Nov. 1994. |
Number | Date | Country | |
---|---|---|---|
20110137829 A1 | Jun 2011 | US |