In various data classification techniques, a set of tagged data points in Euclidean space are processed in a training phase to determine a partition of the space to various classes. The tagged points may represent features of non-numerical objects such as scanned documents. Once the classes are determined, a new set of points can be classified based on the classification model constructed during the training phase. Training may be supervised or unsupervised.
For a detailed description of various illustrative principles, reference will now be made to the accompanying drawings in which:
In accordance with various implementations, numbers are extracted from non-numerical data so that a computing device can further analyze the extracted numerical data and/or perform a desirable type of operation on the data. The extracted numerical data may be referred to as “data points” or “coordinates.” A type of technique for analyzing the numerical data extracted from non-numerical data includes determining a unique set of polynomials for each class of interest and then evaluating the polynomials on a set of data points. For a given set of data points, the polynomials of one of the classes may evaluate to 0 or approximately 0. Such polynomials are referred to as “approximately-zero polynomials.” The data points are then said to belong to the class corresponding to those particular polynomials.
All references herein to determining whether a polynomial evaluates to zero includes determining whether a polynomial evaluates to approximately zero (e.g., within a tolerance parameter).
Measurements can be made on many types of non-numerical data (also referred to as data features). For example, in the context of alphanumeric character recognition, multiple different measurements can be made for each alphanumeric character encountered in a scanned document. Examples of such measurements include the average slope of the lines making up the character, a measure of the widest portion of the character, a measure of the highest portion of the character, etc. The goal is to determine a suitable set of polynomials for each possible alphanumeric character. Thus, capital A has a unique set of polynomials, B has its own unique set of polynomials, and so on. Each polynomial is of degree n (n could be 1, 2, 3, etc.) and may use some or all of the measurement values as inputs.
The classes depicted in
Part of the analysis, however, is determining which polynomials to use for each alphanumeric character. A class of techniques called Approximate Vanishing Ideal (AVI) may be used to determine polynomials to use for each class. The word “vanishing” refers to the fact that a polynomial evaluates to 0 for the right set of input coordinates. Approximate means that the polynomial only has to evaluate to approximately 0 for classification purposes. Many of these techniques, however, are not stable. Lack of stability means that the polynomials do not perform well in the face of noise. For example, if there is some distortion of the letter A or extraneous pixels around the letter, the polynomial(s) for the letter A may not at all vanish to 0 even though the measurements were made for a letter A. Some AVI techniques are based on a pivoting technique which is fast but inherently unstable.
The implementations discussed below are directed to a Stable Approximate Vanishing Ideal (SAVI) technique which, as its name suggests, is stable in the face of noise in the input data. The techniques described herein are further able to model data points that sit on a union of multiple varieties, that is, data points corresponding to multiple classes that are generally indivisible and thus difficult to divide into individual training data sets.
The non-transitory storage device 130 is shown in
The distinction among the various engines 102-116 and among the software modules 132-146 is made herein for ease of explanation. In some implementations, however, the functionality of two or more of the engines/modules may be combined together into a single engine/module. Further, the functionality described herein as being attributed to each engine 102-116 is applicable to the software module corresponding to each such engine (when executed by processor 120), and the functionality described herein as being performed by a given module (when executed by processor 120) is applicable as well as to the corresponding engine.
The functions performed by the various engines 102-112 of
The method of
A polynomial is a sum of multiple monomials, and each monomial has a particular degree (the monomial 2X̂3 is a degree 3 monomial). The degree of a polynomial is the maximum degree of any of the constituent monomials comprising the polynomial. Operations 202 and 24 of
At 202, the method comprises, for each of the plurality of data points, determining a neighborhood of data points about each such data point, and may be performed by neighborhood determination engine 102. The neighborhood of data points about the particular data point are data points that are “close to” the data point, for example, points that are within a predefined threshold distance from the data point. The threshold distance may be user-specified.
At 204, a SAVI technique is performed on each such neighborhood of data points. More specifically, for each such neighborhood of points, the method includes the following operations, which are further described below:
Generating the projection set of polynomials may be performed by the projection engine 104. The projection engine 104 may process the set of candidate polynomials to generate a projection set of polynomials by, for example, computing a projection of a space linear combination of the candidate polynomials of degree d on polynomials of degree less than d that do not evaluate to 0 on the set of points. In the first iteration of operations 202 and 204 of
For the initial data point for which a neighborhood is determined and the operations of 202 and 204 are performed, the candidate polynomials are predetermined. For each subsequent data point, the candidate polynomials used in operations 202, 204 are the resulting polynomials generated by operations 202 and 204 being performed on the preceding data point.
The following is an example of the computation of the linear combination of the candidate polynomials of degree d on the polynomials of degree less than d that do not evaluate to 0 on each neighborhood of data points. The projection engine 104 may multiply the polynomials of degree less than d that do not evaluate to 0 by the polynomials of degree less than d that do not evaluate to 0 evaluated on the neighborhood of data points and then multiply that result by the candidate polynomials of degree d evaluated on the neighborhood of data points. In one example, the projection engine 104 computes:
E
d
=O
<d
O
<d(P)tCd(P)
where O<d represents the set polynomials that do not evaluate to 0 and are of lower than order d, O<d(P)t represents the transpose of the matrix of the evaluations of the O<d polynomials, and Cd(P) represents the evaluation of the candidate set of polynomials on the neighborhood of data points (P). Ed represents the projection set of polynomials evaluated on the neighborhood of data points.
Generating the subtraction matrix may be performed by the subtraction engine 106. The subtraction engine 106 subtracts the projection set of polynomials evaluated on the neighborhood of data points from the candidate polynomials evaluated on the neighborhood of data points to generate a subtraction matrix of evaluated polynomials, that is:
Subtraction matrix=Cd(P)−Ed(P)
The subtraction matrix represents the difference between evaluations of polynomials of degree d on the data points within the neighborhood, and evaluations of polynomials of lower degrees on such data points.
The SVD engine 108 computes the singular value decomposition of the subtraction matrix. The SVD of the subtraction matrix may result in the three matrices U, S, and Vt. U is a unitary matrix. S is a rectangular diagonal matrix in which the values on the diagonal are the singular values of the subtraction matrix. Vt is the transpose of a unitary matrix and thus also a unitary matrix. That is:
Subtraction matrix=USV*
A matrix may be represented as a linear transformation between two distinct spaces. To better analyze the matrix, rigid (i.e., orthonormal) transformations may be applied to the space. The “best” rigid transformations may be the ones which will result in the transformation being on a diagonal of a matrix, and that is exactly what the SVD achieves. The values on the diagonal of the S matrix are called the “singular values” of the transformation.
For each neighborhood of data points, operation 204 results in one or more evaluated resulting polynomials (e.g., a unique set of polynomials for each data point neighborhood). Neighborhoods of data points that have similar polynomials are likely to be part of the same class. As such, at 206, the method includes clustering (206) the evaluated resulting polynomials into multiple clusters to cluster the various data points into the various classes. The clustering operation may be performed by the clustering engine 110. Any of a variety of clustering algorithms may be used.
At 208, for each cluster of data points, the method includes partitioning the evaluated resulting polynomials based on a threshold. The partitioning engine 112 partitions the polynomials resulting from the SVD of the subtraction matrix based on a threshold. The threshold may be preconfigured to be 0 or a value greater than but close to 0. Any polynomial that results in a value on the points less than the threshold is considered to be a polynomial associated with the class of points being learned, while all other polynomials then become the candidate polynomials for the subsequent iteration of the SAVI process.
In one implementation, the partitioning engine 112 sets Ud equal to (Cd−Ed)VS−1 and then partitions the polynomials of Ud according to the singular values to obtain Gd and Od. Gd is the set of polynomials that evaluate to less than the threshold on the points. Od is the set of polynomials that do not evaluate to less than the threshold on the points.
The partitioning engine 112 also may increment the value of d, multiply the set of candidate polynomials in degree d−1 that do not evaluate to 0 on the points by the degree 1 candidate polynomials that do not evaluate to 0 on the points. The partitioning engine 110 further computes Dd=O1×Od-1 and then sets the candidate set of polynomials for the next iteration of the SAVI process to be the orthogonal complement in Dd of span ∪i=1d-1Gi×Od-i.
The results of the process of
At 222, the method includes initializing the candidate polynomials. This operation also may be performed by the initialization engine and may include initializing the dimension to 1 to begin the process with dimension 1 polynomials.
At 224, the method further includes determining (e.g., by the neighborhood determination engine 102) the neighborhood of data points about each selected point p, as described above. In one example, the neighborhood determination engine 102 determines the neighborhood by selecting data points within a threshold distance of the selected point p. At 226, a SAVI process 240 is performed on the neighborhood of data points about initial point p. This SAVI process 240 is designated as SAVI_A simply to differentiate from a slightly different SAVI_B process 280 described below in
Referring to
Operation 242 includes generating a projection set of polynomials by computing a projection set of space linear combination of the candidate polynomials of degree d (d=1 in this initial iteration of the method of
At 244, SAVI_A process 240 includes subtracting the projection set of polynomials (from operation 242) evaluated on the neighborhood of data points from the set of candidate polynomials evaluated on the data points to generate a subtraction matrix of evaluated resulting polynomials.
At 246, the SAVI_A process 240 includes computing a singular value decomposition of the subtraction matrix of the evaluated resulting polynomials.
Referring back to
When all data points have been processed, then at 234 the polynomials computed for each neighborhood of data points are clustered (e.g., by clustering engine 110 as described above. At 235, a representative polynomial from each cluster is chosen. At 236, the chosen clustered polynomials are partitioned (e.g., by partitioning engine 112) into approximately zero polynomials and non-approximately zero polynomials.
Operations 224-232 may be repeated for higher dimension polynomials (2, 3, etc.) before clustering and partitioning the polynomials.
The candidate polynomials considered for each neighborhood of data points may include two or more polynomials that are duplicates. Such duplicates should be eliminated from consideration to make the process more efficient. In some implementations, the polynomials are represented by the various engines/modules in “concrete form,” that is in terms of their explicit mathematical representation. An example of concrete forms of polynomials include 2X̂3+4XŶ2-17X̂2Ŷ2+4Ŷ3.
Saving such concrete forms in storage, however, may create a significant burden on storage capacity. As such, in other implementations, rather than representing polynomials in concrete form, polynomials are represented based on an iterative algorithm. For each degree d, various SVD decompositions are performed as described above. Each polynomial constructed during the process described herein is constructed either by multiplying polynomials previously constructed, subtracting existing polynomials, multiplying by one the matrices in the SVD decomposition, or by taking several rows of the subtraction matrix. The information that is used to represent each polynomial thus may include the applicable SVD decompositions, the polynomials of the previous step in the process that were multiplied together, and which rows of the subtraction matrix correspond to the approximately zero polynomials and which rows do not correspond to the approximately zero polynomials.
With polynomials being represented in the form as described above, it may difficult to determine if two or more of such representations represent the same polynomial. That is, the same polynomial may be represented in multiple such forms. To eliminate multiple representations of the same polynomial, the method of
Referring to
The candidate polynomials for each neighborhood of data points are first evaluated on the random set of points Q. If any two candidate polynomial representations result in the same value for all points Q, then such representations are considered to be describing the same polynomials and are duplicates—one of such representations is thus removed from further consideration.
At 252, an initial data point p is selected as well as the random set of points Q. Points Q may be previously determined and stored in non-transitory storage device 130 and thus selecting points Q may include retrieving the points Q from the storage device. At 254, the method of
At 256, a modified version of the SAVI_A process is run on the random set of points Q, and is referred to as the SAVI_B process 280. An example of the SAVI_B process 280 run on points Q is illustrated in
Referring briefly to
At 258, the method includes removing duplicate candidate polynomials based on the random set of points Q and may be performed by the polynomial duplicate removal engine 116. In one example, the set of candidate polynomials are all evaluated on all of points Q and a determination is made as to whether any two (or more) polynomials evaluate to the same value for at least a threshold number of points Q (e.g., for at least 20 points Q). If so, such candidate polynomials are considered duplicates and one of such candidate polynomials is removed from further consideration.
Referring again to
Once the approximately-zero polynomials are determined for each class, the polynomials can be used to classify new data points. A module/engine may be included to receive a new data point to be classified and to evaluate all of the various approximately-zero polynomials on the data point to be classified. The new data point is assigned to whichever class's approximately-zero polynomials evaluate to approximately zero for the point (or at least less than the evaluations of all other classes' approximately-zero polynomials on the point).
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/052848 | 7/31/2013 | WO | 00 |