The present invention relates to the field of machine learning generally and in particular to machine learning using parallel processing.
Machine learning seeks to permit computers to analyze known examples that illustrate the relationship between outputs and observed variables. One such approach is known as Probably Approximately Correct learning or “PAC” learning. PAC learning involves having a machine learner receive examples of things to be classified together with labels describing each example. Such examples are sometimes referred to as labeled examples. The machine learner generates a prediction rule or “classifier” (sometimes referred to as a “hypothesis”) based on observed features within the examples. The classifier is then used to classify future unknown data. One illustrative application of machine learning is speech recognition. When machine learning is applied to this application, the labeled examples might include a large number of sound samples. Each sound sample is drawn from a human speaker and contains one or more features. The features might be attributes of the signal (or in some cases a transform of the signal) such as amplitude for example. Each sample is given a label by human reviewers, such as a specific phoneme or word heard by the reviewer when the sound is played. Thus, if a sample were that of a person uttering the word “cat” then the label assigned to the sample would be “cat.” The goal of machine learning is to process the labeled examples to generate classifiers that will correctly classify future examples as the proper phoneme or word within an acceptable error level.
A boosting algorithm is one approach for using a machine to generate a classifier. Various boosting algorithms are known including MadaBoost and AdaBoost. Boosting algorithms in some cases involve repeatedly calling a weak learner algorithm to process subsets of labeled examples. These subsets are drawn from the larger set of labeled examples using probability distributions that can vary each time the weak learner is called. With each iteration, the weak learner algorithm generates a classifier that is not especially accurate. The boosting algorithm combines the classifiers generated by the weak algorithm. The combination of classifiers constitutes a single prediction rule that should be more accurate than any one of the classifiers generated by the weak learner algorithm.
Disclosed herein are embodiments of techniques for machine learning. One aspect of the disclosed embodiments is a technique for generating a classifier using n-dimensional examples in a dataset. The technique includes storing in a computer memory at least some of the n-dimensional examples. A random projection of the n-dimensional examples generates a transform of d-dimensional examples. A set of examples is then drawn from the d-dimensional examples. Using at least one processor, a linear program operation is performed using as constraints at least some of the set of examples drawn from the plurality of d-dimensional examples. A classifier is generated based on the solution of the at least one linear program operation.
Another aspect of the disclosed embodiments is a machine for generating a classifier from n-dimensional examples in a dataset. The machine includes a memory and a computing system having at least one processor. Some of the n-dimensional examples are stored in the memory. The computing system is configured to determine d-dimensional examples using a random projection of n-dimensional examples stored in memory. In some cases, d is less than n. The computing system is further configured to draw a set of examples from the d-dimensional space. Using some of the d-dimensional examples as constraints, the computing system solves at least one linear program operation. The computing system then generates a classifier based on the solution of the at least one linear program operation.
Another aspect of the disclosed embodiments is a computer storage medium encoded with a computer program. The computer program has instructions that, when executed by a data processing apparatus, cause the data processing apparatus to perform operations. These operations include storing in a computer memory at least some of the n-dimensional examples and determining a d-dimensional space using a random projection of the at least some n-dimensional examples. In some cases, d is less than n. The operations further include drawing a set of examples from the d-dimensional space and solving at least one linear program operation. The linear program operation is solved using constraints which are based on at least some of the set of examples drawn from the d-dimensional space. A classifier is generated based on the solution of the at least one linear program.
The description herein makes reference to the accompanying drawings wherein like reference numerals refer to like parts throughout the several views, and wherein:
In the embodiments below, a method (and an apparatus for implementing the method) is disclosed for machine learning that facilitates the use of parallel processing.
Memory 14 is random access memory (RAM) although any other suitable type of storage device can be used. Memory 14 includes code and data that is accessed by processor 12 using a bus 16. Memory 14 includes an operating system 18, application programs 20 (including programs that permit processor 12 to perform the methods described herein), and dataset 22 (including the examples used for machine learning and the classifiers generated as a result of machine learning). Machine learning system 10 also includes secondary storage 24, which can be a disk drive. Because the example sets used for machine learning contain a significant amount of information, the example sets can be stored in whole or in part in secondary storage 24 and loaded into memory 14 as needed for processing. Machine learning system 10 includes input-output device 26, which is also coupled to processor 12 via bus 16. Input-output device 26 can be a display or other human interface that permits an end user to program and otherwise use machine learning system 10.
Although
One of the functions of machine learning system 10 is to generate classifiers that permit machine learning system 10 or another computer to programmatically determine the correct label of a specimen. The classifiers are generated using a dataset such as a dataset 40 of labeled examples. (See, e.g.,
To aid in the understanding of machine learning system 10 the following mathematical constructs are explained. Machine learning system 10 implements learning in accordance with a range of learning models including the PAC model of learning. In one exemplary embodiment, machine learning system 10 learns a target halfspace f(x)=sign(w·x), where w is an unknown unit vector and x is an example in a set of examples having an unknown probability distribution D over the unit ball Bn={x ε Rn:∥x∥2≦1\}. Unit ball Bn may have support on {x ε Bn:|w·x|≧γ}. For convenience, the target halfspace f(x) is sometimes referred to as an n-dimensional, γ-margin halfspace.
As explained below, machine learning system 10 accepts as input at least a portion of a dataset of labeled examples (x,f(x)) where each example x is independently drawn from D and f(x) is the label assigned to example x. From this input, machine learning system 10 generates as output a classifier h:Rn→{−1,1} that in some cases satisfies the condition Prx:D[h(x)≠f(x)]≦ε, where ε is a target accuracy.
Machine learning is sometimes computationally intensive. For example, one known method for machine learning is the Perceptron algorithm, which can run in poly
time, using
labeled examples in Rn, to learn an unknown n-dimensional γ-margin halfspace to accuracy 1−ε.
To reduce the time required for computation, machine learning system 10 performs some operations in parallel. In one exemplary embodiment, machine learning system 10 can learn an unknown n-dimensional γ-margin halfspace to accuracy 1−ε using a number of processors that is a polynominal function of
in a time that is a polynomial function of
In this example, machine learning system 10 is used to generate a classifier for distinguishing spoken words, phoneme or other units of language.
An audio recording such as specimen 28 has certain attributes or features, such as frequency or amplitude or transforms of these values which are arbitrarily designated for this example as features F1, . . . , Fn, where n is the number of dimensions of the labeled examples. These features are associated with the categorization of specimen 28 as an utterance of the word “A” (as in the sentence “A dog barked.”) versus some other sound. Many specimens of utterances can be studied and classified, and the results can be summarized in a dataset of training examples such as shown in
For example, row 42c of dataset 40 contains the values of example x3. Example x3 is a vector (7.950, 6.717, −8.607, −9.658, 5.571, . . . , −8.010) whose elements indicate the real number values of particular features in example x3. Thus, column 44a of row 42c contains a “7.950” indicating that example x3 includes the feature F1 having a value of 7.950.
Machine learning system 10 accepts as input training examples such as dataset 40 and generates as output a classifier which can be used to predict the correct label of new data. The operation of machine learning system 10 is explained below in detail; however, in general terms, machine learning system 10 accomplishes parallel operations as follows. Machine learning system 10 executes a boosting module B as described below in connection with
At block 54, machine learning system 10 initializes a probability distribution Pt as a uniform distribution. At block 56, machine learning system 10 calls a weak learner module W, passing the probability distribution Pt as an argument. Weak learner module W returns a classifier in the form of a vector w (see
At block 60, machine learning system 10 determines whether the variable t is less than T, a predetermined constant. The value of T is selected based on the number of iterations of boosting module B that are to be performed. Generally, the more iterations of boosting module B, the greater the accuracy of boosted classifier G. If t is not less than T, then processing of boosting module B terminates at a block 62. The final boosted classifier G will contain T elements, each being a weighted weak classifier (as generated by weak learner module W). The final boosted classifier G can be applied to an unknown example, and the label of the unknown example will be generated by classifier G as the weighted results of each weak classifier G1, . . . GT.
If at block 60 t is less than T, then at block 64 machine learning system 10 increments t by one. At block 66, machine learning system 10 calculates a new probability distribution Pt. The specific methodology for calculating Pt can depend on the type of boosting algorithm employed by boosting module B. Machine learning system 10 then calls weak learner module W at block 56 and the processing continues through another iteration of blocks 56 through 66 as described above.
An example of rounding to the nearest integer multiple would be rounding a value of 1.623 to 162/100. At block 72, machine learning system 10 determines an intermediate transform matrix
where x′ is the result of rounding x at block 68. At block 74, machine learning system 10 rounds each x′i to the nearest multiple of 1/(8┌d/γ┐), resulting in a d-dimensional space or example set that is the n- to d-dimensional transformation of the n-dimensional set of examples x. For a given set of examples x, machine learning system 10 can compute ΦA(x) using 0(n log(1/γ)/γ2) processors in 0(log(n/γ)) time.
Referring to
At block 78, machine learning system 10 determines a linear program LP as follows. Given (x′1, y1), . . . , (x′m, ym) ε Bd×{−1,1}, a linear program LP, with α=γ′/2 is:
minimize −s such that yt(v·xt′)−st=s and 0≦st≦2 for all t ε[m]; −1≦vi≦1 for all i ε [d]; and −2≦s≦2,
where s is the minimum margin over all examples, and st is the difference between each example's margin and s. The subspace L is defined by the equality constraints yt(v·xt′)−st=s.
With LP established, at block 80 machine learning system 10 initializes a variable k and other parameters. The variable k is set equal to 2. The constant K (which determines the number of iterations of weak learner module W) is equal to
Other parameters are initialized as follows: η1=1,
and
A precision parameter α is set based on desired bitwise precision.
At blocks 82 through 96 as described below, weak learner module W will accept a matrix u as input and, using linear programming, generate as output u(k) after k iterations, where the final output will include a weak classifier in the first d elements u(k)1, . . . u(k)d. The input matrix u has as its first elements u1, . . . , ud a set of weightings generated at block 78. These d weights are a crude classifier of the d dimensions of the transformed example set. Input matrix u has as its next m elements ud+1, . . . , ud+m values corresponding to each of the m examples. The remaining j elements ud+m+1, . . . , ud+m+j of input matrix u are values used by linear programs as slack or margins variable such as s described above. These margin variables measure the margin by which the linear program has succeeded in correctly classifying all of the examples. If the margin variable is positive and large, then all of the examples have been classified correctly by a large margin. If the margin variable is negative, then some example is classified incorrectly, and if the margin variable is positive but small, then all examples have been classified correctly, but at least one variable is classified correctly with only a small margin.
At block 82, a subroutine Abt is called to generate u(1) taking u as input. Subroutine Abt is known in the art and is taught, for example, at Section 9.6.4 of S. P. Boyd and L. Vandenberghe, Convex Optimization, Cambridge 2004. In general terms, subroutine Abt accepts as input a preliminary solution to a convex optimization problem (in this case u) and outputs a solution for which the Newton decrement step is relatively small. In some cases, the properties of Abt can be stated mathematically as follows: Suppose for any η>0, Abt is given u with rational components such that Fη(u)−optη≦p2. After many iterations of Newton's method and back-tracking line search, Abt returns u+ that (i) satisfies ∥nη(u+)∥u+≦ 1/9; and (ii) has rational components that have bit-length bounding by a polynomial in d, the bit length of u, and the bit length of the matrix A such that L={v:Av=0}
In some exemplary embodiments, u(1) can meet the following criteria: ∥nη1(u(1))∥u
w(k)=u(k−1)+nηk(u(k−1))
The computation of w(k) involves computing one step of Newton's method (as indicated by the notation n). This computation can be performed in parallel using processors 12. For example, Newton's method requires the performance of matrix inversion, which can be accomplished using Reif's algorithm as is understood by those skilled in the art. The performance of Reif's algorithm can be done in parallel.
At block 88, machine learning system 10 computes r(k) by rounding each component of w(k) to the nearest multiple of ε, and then projecting back onto L, the subspace described above in connection with block 78. At block 90, machine learning system 10 runs subroutine Abt starting with r(k) to obtain u(k). At block 92, a determination is made as to whether k is less than K. If k is not less than K, processing of weak learner module W terminates at block 94. Otherwise, if k is less than K, then machine learning system 10 increments k by one at block 96. Machine learning system 10 then re-performs the processes of blocks 84 through 92 as just described. When weak learner module W terminates at block 94, it returns to booster module B a weak classifier that is extracted as the first d elements of vector u(k).
The functions described in connection with
The above-described embodiments have been described in order to allow easy understanding of the present invention and do not limit the present invention. On the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structure as is permitted under the law.
This application claims priority to U.S. Provisional Patent Application No. 61/535,112, filed Sep. 15, 2011, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
20100198900 | Gifford | Aug 2010 | A1 |
Entry |
---|
Weston et al. “Feature Selection for SVMs”, 2004, pp. 19 http://www.ee.columbia.edu/˜sfchang/course/svia-F04/slides/feature—selection—for—SVMs.pdf. |
Zhou et al. “Linear programming support vector machines”, Pattern Recognition, 35, 2002, pp. 2927-2936. |
Fischer et al. “An Asynchronous Parallel Newton Method”, Mathematical Programming 42 (1988) 363-374. |
Arriaga, R. and Vempala, S. An algorithmic theory of learning: Robust concepts and random projection. FOCS, 1999. |
Block, H. The Perceptron: a model for brain functioning. Reviews of Modern Physics, 34:123-135, 1962. |
Blum, A. Random Projection, Margins, Kernels, and Feature-Selection. In LNCS vol. 3940, pp. 52-68, 2006. |
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929-965, 1989. |
Bradley, J. K. and Schapire, R. E. Filterboost: Regression and classification on large datasets. In NIPS, 2007. |
Bshouty, N. H., Goldman, S.A. and Mathias, H.D. Noise-tolerant parallel learning of geometric concepts. Information and Computation, 147(1):89-110, 1998. |
Collins, M., Schapire, Robert E. and Singer, Y. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253-285, 2002. |
Domingo, C. and Watanabe, O. MadaBoost: a modified version of AdaBoost. In COLT, 2000. |
Freund, Y. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256-285, 1995. |
Freund, Y. An adaptive version of the boost-by-majority algorithm. Machine Learning, 43(3):293-318, 2001. |
Greenlaw, R., Hoover, H.J. and Ruzzo, W.L. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, New York, 1995. |
Kalai, A. and Servedio, R. Boosting in the presence of noise. Journal of Computer & System Sciences, 71 (3):266-290, 2005. |
Karmarkar, N. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395, 1984. |
Kearns, M. and Mansour, Y. On the boosting ability of top-down decision tree learning algorithms. In STOC, 1996. |
Kearns, M. and Vazirani, U. An Introduction to Computational Learning Theory. MIT Press, 1994. |
Long, P. and Servedio, R. Martingale boosting. In Proc. 18th Annual Conference on Learning Theory (COLT), pp. 79-94, 2005. |
Long, P. and Servedio, R. Adaptive martingale boosting. In Proc. 22nd Annual Conference on Neural Information Processing Systems (NIPS), pp. 977-984, 2008. |
Mansour, Y. and McAllester, D. Boosting using branching programs. Journal of Computer & System Sciences, 64 (1):103-112, 2002. |
Reif, John H. Efficient Parallel Factorization and Solution of Structured and Unstructured Linear Systems, Journal of Computer and System Sciences, 2005. |
Renegar, J. A polynomial-time algorithm, based on Newton's method, for linear programming. Mathematical Programming, 40:59-93, 1988. |
Rosenblatt, F. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386-407, 1958. |
Schapire, R. The strength of weak learnability. Machine Learning, 5(2):197-227, 1990. |
Servedio, R. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research. 4:633-648, 2003. |
Valiant, L. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, 1984. |
Vitter, J.S. and Lin, J. Learning in parallel. Inf. Comput., 96(2):179-202, 1992. |
DIMACS 2011 Workshop. Parallelism: A 2020 Vision, 2011. |
NIPS 2009 Workshop. Large-Scale Machine Learning: Parallelism and Massive Datasets, 2009. |
Schapire, Robert E., The Boosting Approach to Machine Learning an Overview, MSRI Workshop on Nonlinear Estimation and Classification, 2002 (23 pp). |
Number | Date | Country | |
---|---|---|---|
61535112 | Sep 2011 | US |