The present invention, in some embodiments thereof, relates to data processing and, more particularly, but not exclusively, to a method and system for reducing dimensionality of data.
Many computer applications are used to extract information from large amounts of data. For example, computer applications are employed to extract useful trends and correlations from large databases of raw data. It may involve consolidating and summarizing huge databases containing many items and making data viewable along multidimensional axes, while allowing the variables of interest to be changed at will in an interactive fashion.
Typically, a multidimensional database stores and organizes data in a way that better reflects how a user would want to view the data than is possible in a two-dimensional spreadsheet or relational database file. Multidimensional databases are generally better suited to handle applications with large volumes of numeric data and that require calculations on numeric data, although they are not limited to such applications.
A dimension within multidimensional data is typically a basic categorical definition of data, each dimension may have a hierarchy associated with it, and each hierarchy may have any number of levels.
Dimensionality reduction is a known data processing technique, in which a dataset is simplified by scaling down its dimensions. Dimensionality reduction is particularly useful for the purpose of recognition and classification. The efficiency of dimension reduction tools is typically measured in terms of the required computational resources, which generally depend on the number of data points in the dataset. Dimensions-reduction typically involves a tradeoff between sparse local operators that involve less than quadratic complexity, and multi-scale models with quadratic complexity.
One family of dimensionality reduction is known as multidimensional scaling (MDS) that attempts to map all pairwise distances between data points into small dimension Euclidean domains. MDS can be used as a technique for storing data to elements as a relative set of points in a space having reduced dimensionality with respect to the size of the set. The relative location of the pints is dependent upon the element similarities or dissimilarities, which are interpreted as a set of distances between the points.
For example, U.S. Pat. No. 6,569,096 discloses an MDS technique in which sets of observable samples are generated, where each observable sample is generated by a different algorithm combining a number of factors. Data on observable relative differences between samples are collected for each set, and multidimensional statistical analysis is performed on the collected data. U.S. Pat. No. 8,645,440 discloses an MDS technique in which an iterative optimization technique and a vector extrapolation technique are sequentially and repeatedly applied on a vector of coordinates which represents the coordinates of data elements of the dataset. U.S. Pat. No. 7,496,597 discloses a technique for representing an MDS space as a hierarchical data structure with root nodes and leaf nodes.
According to an aspect of some embodiments of the present invention there is provided a method of reducing a dimensionality of a dataset. The method comprises: calculating an interpolation matrix based on a Laplacian eigenbasis matrix of a sparse representation of the dataset; applying multidimensional scaling (MDS) to a transformation matrix of the interpolation matrix, thereby providing a reduced dataset; and storing the reduced dataset is a computer readable medium.
According to some embodiments of the invention the method further comprises obtaining a kernel function for defining diffusion distances over the sparse representation, wherein the calculation of the interpolation matrix is based on the kernel function but not on the diffusion distances.
According to some embodiments of the invention the sparse representation of the dataset is characterized by a dissimilarity matrix, wherein the calculation of the interpolation matrix is also based on the dissimilarity matrix.
According to some embodiments of the invention the method further comprising calculating the dissimilarity matrix.
According to some embodiments of the invention the invention the method comprises selecting a subset of elements from the dataset, thereby providing the sparse to representation.
According to some embodiments of the invention the method the selection of the subset is by a farthest-point sampling procedure.
According to some embodiments of the invention the method comprises calculating the Laplacian eigenbasis matrix.
According to some embodiments of the invention the calculation of the dissimilarity matrix comprises calculating a dissimilarity measure between every two elements of the selected subset.
According to some embodiments of the invention the dissimilarity measure comprises a geodesic distance over a manifold defined by the dataset. According to some embodiments of the invention the calculation of the interpolation matrix comprises applying an optimization procedure to traces of matrices obtained by transformations of an eigenvalue matrix of the Laplacian eigenbasis by the interpolation matrix.
According to some embodiments of the invention the calculation of the interpolation matrix comprises transforming a matrix describing the sparse representation using a matrix constructed from the Laplacian eigenbasis matrix, from an eigenvalue matrix of the Laplacian eigenbasis, and from a projection matrix describing a projection of the dataset on the sparse representation.
According to some embodiments of the invention the transformation matrix of the interpolation matrix is a matrix defined as a transformation of the interpolation matrix using the Laplacian eigenbasis matrix.
According to some embodiments of the invention According to some embodiments of the invention the MDS is effected by a singular value decomposition procedure followed by an eigen decomposition procedure.
According to some embodiments of the invention the dataset comprises coordinates describing a plurality of objects.
According to some embodiments of the invention the dataset comprises images of handwritten characters or symbols.
According to some embodiments of the invention the dataset comprises biometric data.
According to some embodiments of the invention the dataset comprises audio data.
According to some embodiments of the invention the dataset comprises video data.
According to some embodiments of the invention the dataset comprises biological data.
According to some embodiments of the invention the dataset comprises chemical data.
According to some embodiments of the invention the dataset describes signals acquired by a medical device.
According to some embodiments of the invention the dataset comprises meteorological data.
According to some embodiments of the invention the dataset comprises seismic data.
According to some embodiments of the invention the dataset comprises hyperspectral data.
According to some embodiments of the invention the dataset comprises financial data.
According to some embodiments of the invention the dataset comprises marketing data.
According to some embodiments of the invention the dataset comprises textual corpus.
According to an aspect of some embodiments of the present invention there is provided a computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a data processor, cause the data processor to access a dataset execute the method as described above and optionally further detailed hereinbelow.
According to an aspect of some embodiments of the present invention there is provided a system of reducing a dimensionality of a dataset. The system comprises a data processor configured for accessing the dataset, calculating an interpolation matrix based on a Laplacian eigenbasis matrix of a sparse representation of the dataset, and applying multidimensional scaling (MDS) to a transformation matrix of the interpolation matrix.
According to some embodiments of the invention the data processor is configured for receiving a kernel function for defining diffusion distances over the sparse representation, wherein the calculation of the interpolation matrix is based on the kernel function but not on the diffusion distances.
According to some embodiments of the invention the sparse representation of the dataset is characterized by a dissimilarity matrix, and wherein the data processor is configured for calculating the interpolation matrix also based on the dissimilarity matrix.
According to some embodiments of the invention the system wherein the data processor is configured for calculating the dissimilarity matrix. According to some embodiments of the invention the data processor is configured for selecting a subset of elements from the dataset, thereby providing the sparse representation.
According to some embodiments of the invention the system wherein the data processor is configured for selecting the subset by employing a farthest-point sampling procedure.
According to some embodiments of the invention the data processor is configured for calculating the Laplacian eigenbasis matrix.
According to some embodiments of the invention the data processor is configured for calculating the dissimilarity matrix comprises by calculating a dissimilarity measure between every two elements of the selected subset.
According to some embodiments of the invention the dissimilarity measure comprises a geodesic distance over a manifold defined by the dataset.
According to some embodiments of the invention the data processor is configured for calculating the interpolation matrix by applying an optimization procedure to traces of matrices obtained by transformations of an eigenvalue matrix of the Laplacian eigenbasis by the interpolation matrix.
According to some embodiments of the invention the data processor is configured for calculating the interpolation matrix by transforming a matrix describing the sparse representation using a matrix constructed from the Laplacian eigenbasis matrix, from an eigenvalue matrix of the Laplacian eigenbasis, and from a projection matrix describing a projection of the dataset on the sparse representation.
According to some embodiments of the invention the transformation matrix of the interpolation matrix is a matrix defined as a transformation of the interpolation matrix using the Laplacian eigenbasis matrix.
According to some embodiments of the invention the MDS is effected by a singular value decomposition procedure followed by an eigen decomposition procedure.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
The present invention, in some embodiments thereof, relates to data processing and, more particularly, but not exclusively, to a method and system for reducing dimensionality of data.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present inventors discovered that a dimensionality of a dataset can be efficiently reduced by considering the geometrical properties of the dataset. Specifically, the present inventors found that it is advantage to consider a dataset of elements as a sampled version of a manifold, e.g., a Riemannian 2-manifold, and reducing the dimensionality of the dataset by utilizing the geometry of the manifold.
Some of the operations for reducing the dimensionality of a dataset are matrix operations. Representative examples of operations include summation, multiplication, decomposition, transformation, and calculations of eigenvectors and eigenvalues. All these operations are well known to those skilled in the art of matrix operations. Herein, matrices are represented by bold letters.
The method can be embodied in many forms. For example, it can be embodied in on a tangible medium such as a computer for performing the method steps. It can to be embodied on a computer readable medium, preferably a non-volatile computer readable medium, comprising computer readable instructions for carrying out the method steps. In can also be embodied in electronic device having digital computer capabilities arranged to run the computer program on the tangible medium or execute the instruction on a computer readable medium.
Computer programs implementing the method of the present embodiments can commonly be distributed to users over a communication network, such as the internet, or on a distribution medium such as, but not limited to, a CD-ROM or a flash drive. From the communication network or distribution medium, the computer programs can be copied to a hard disk or a similar intermediate storage medium. The computer programs can be run by loading the computer instructions either from their distribution medium or their intermediate storage medium into the execution memory of the computer, configuring the computer to act in accordance with the method of the present embodiments. All these operations are well-known to those skilled in the art of computer systems.
The method begins at 10 and continues to 11 at which a dataset is received. The dataset is typically composed of a set of n elements {V1, V2, . . . Vn}, wherein each element is a multidimensional dimensional element. The number of dimensions of each element is denoted dim. The elements of the dataset can represent any type of data, including, without limitation, coordinates describing a plurality of objects (e.g., CAD data, biological surfaces), audio data, video data, biological data (e.g., protein expression data, genomic data), chemical data (e.g., chemical structures, chemical properties), biometric data, signals acquired by a medical device (e.g., EEG data, MEG data, MRI data), meteorological data, seismic data, hyperspectral data, financial data, marketing data, textual corpus, images of handwritten characters or symbols and the like.
The elements of the dataset optionally and preferably sample a manifold, preferably a Riemannian 2-manifold, and can therefore be considered as points on the manifold. For example, when the elements of the dataset represent coordinates describing an object, the manifold can be the surface of the object. It is appreciated that a manifold, particularly a Riemannian 2-manifold, can also be defined for the aforementioned types of data even if the elements of the dataset are not coordinates on a surface of an object.
The metric tensor of the manifold is denoted by the upper case letter G. G defines distances on the manifold, scalar products between vectors or vector fields that are tangential to the manifold, and scalar products between functions that are defined on the manifold. The determinant of the metric tensor G is denoted by the lower-case letter g, and the discretization matrix of the square root of g is denoted A. For example, when the manifold represents a triangulated surface, A can be a diagonal matrix whose Λii element is the sum of areas of all triangles that share the surface vertex i.
The method optionally and preferably continues to 12 at which a sparse representation of the data set is constructed. Alternatively, the method can receive the sparse representation as input from external source (e.g., user input). The sparse representation is defined over a subset of ms (ms<n) elements from the dataset, which subset can be selected by the method or it can be received as input from external source (e.g., user input). The selection of the subset (in embodiments in which such operation is employed) can be according to any sampling technique known in the art. In some embodiments of the present invention the sampling is by a procedure known as the farthest-point sampling procedure, which is described in Eldar et al., “The farthest point strategy for progressive image sampling,” IEEE Trans. Image Processing, 1997, 6(9):1305-1315, the contents of which are hereby incorporated by reference. This procedure is known to be 2-optimal in the sense of covering.
The sparse representation is characterized by a dissimilarity matrix D, which contains a dissimilarity measure between every two elements of the selected subset.
It is not necessary for the dissimilarity matrix D to be calculated. As will be explained below, it was found by the present inventors that the dimensionally of the dataset can be reduced without knowledge of the matrix elements of the matrix D. Yet, in some embodiments, the matrix D is used for the dimensionality reduction. In these embodiments, the matrix D can be calculated by the method or received as input.
Each matrix element of D represents a dissimilarity measure between two elements over of the subset. In some embodiments of the present invention the dissimilarity measure between two elements relates to a geodesic distance between the respective two points of the manifold. For example, the dissimilarity measure can be the square of the geodesic distance. The value of the geodesic distance between two points is the length of the minimal geodesic of the manifold which passes through the points.
The calculation of geodesic distance matrices is well known in the art. In some embodiments of the invention the calculation of D is performed using the fast marching method (FMM), found, e.g., in J. A. Sethian, “A fast marching level set method for monotonically advancing fronts,” Proc. Nat. Acad. Sci., 1996, 93(4): 1591 -1595; and R. Kimmel and J. A. Sethian “Computing geodesic on manifolds,” Proc. US National Academy of Science, 1998, 95:8431-8435, the contents of which are hereby incorporated by reference. FMM is an efficient numerical method to compute a first-order approximation of the geodesic distances. Given a set of points on the manifold a distance map from these points to other points on the manifold is obtained as the solution of an Eikonal equation.
In some embodiments of the present invention the dissimilarity measure between two elements relates to a diffusion distance between the respective two points on the manifold. The concept of diffusion distance is known in the art and found in Berard et al., 1994, “Embedding Riemannian manifolds by their heat kernel,” Geom Funct Anal 4(4):373-398, and Coifman et al., 2006 “Diffusion maps,” Appl Comput Harmon Anal 21(1):5-30, the contents of which are hereby incorporated by reference.
Generally, given a kernel function K(,) , the diffusion distance D between two points satisfies D2(x,y)=Σk(φk(x)−φk(y))2K(λk) where φk represents the kth eigenfunction of a Laplace operator (e.g., the Laplace-Beltrami operator) on the manifold, and λk is its associated eigenvalue. It was found by the present inventors that when the dissimilarity measure relates to the diffusion distance, it is sufficient to obtain the kernel function for reducing the dimensionally of the dataset, whereby the diffusion distance themselves (the individual matrix elements of the matrix D) are not required.
The method optionally continues to 13 at which a Laplacian eigenbasis of the sparse representation of the dataset is obtained. The Laplacian eigenbasis can be expressed as a Laplacian eigenbasis matrix Φ, whose ith column is the ith eigenfunction of the Laplace operator, where 1≦i i≦ms. The Laplacian eigenbasis can be calculated by the method or received as input from external source (e.g., user input).
In some embodiments of the present invention the columns of the Laplacian eigenbasis matrix Φ are eigenfunction of the Laplace-Beltrami operator (LBO). The LBO is defined over a non-planar surface, and is generally defined as the divergence to of the gradient of the surface. The LBO is typically expressed by means of a discretization matrix L. Typically, the diagonal of L is the negative sum of the off- diagonal elements of the corresponding row. Any discretization matrix can be used for constructing the Laplacian eigenbasis. In some embodiments of the present invention L satisfies the relation L=A−1W, where W is a weight matrix.
In some embodiments, the weight matrix W is defined in terms of cotangent edge weights which are suitable for constructing a discrete LBO operator on a triangle mesh that discretized the manifold. Cotangent edge weights are weights that are assigned to the edges of the triangles of the meshes, and that are proportional to cotangents of angles between edges. The use of the cotangent function is particularly useful since it expresses the ratio between a scalar product and a vector product between two edges. Typically, an edge is assigned with a weight that is proportional to cotangents of angles between edges that share triangles with it. When an edge is on the boundary of the surface, it is associated with one angle and it can be assigned with a weight that is proportional to the cotangent of the angle against the edge at a vertex of the triangle opposite to the edge. When an edge is internal with respect to the boundary of the surface, it is associated with two triangles and it can be assigned with a weight that is proportional to the sum of cotangents of the angles against the edge at the vertices of the two triangles opposite to the edge.
A portion of a triangle mesh is illustrated in
where E is the set of edges of the triangle mesh.
In some embodiments, the weight matrix W is the graph Laplacian, where the graph is constructed by connecting nearest neighbors. Graph Laplacian matrices are known in the art and found, for example, in Belkin et al., 2001, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in Neural Information Processing Systems (MIT Press, Cambridge, Mass.), pp 585-591, the contents of which are hereby incorporated by reference.
Once the discretization matrix L of the LBO is obtained, the eigenvectors of L can be calculated and the matrix Φ can be constructed by setting its ith column to be the ith eigenvector of.
The method continues to 14 at which an interpolation matrix α is calculated based on the Laplacian eigenbasis matrix Φ, and optionally also based on the dissimilarity matrix D. The calculation of α can be done in more than one way.
In some embodiments of the present invention an optimization procedure is applied to the traces of the matrices αTΛα and αΛαT, where Λ is an eigenvalue matrix of the Laplacian eigenbasis.
Herein, expressions of the form CRCT and CTRC where, C and R are some matrices, are referred to as “the transformation of the matrix R using the matrix C”.
Thus, the optimization procedure is applied to traces of matrices obtained by transformations of the matrix Λ by the matrix α. The optimization procedure is preferably subjected to the constraint (ΦαΦT)ij=D(Vi,Vj). This optimization can be written as:
where the summation is over the indices of the sparse representation, and the notation ∥·∥F represent the Frobenius norm.
In some embodiments of the present invention α is calculated by transforming a matrix F describing the sparse representation using a matrix M, according to the relation α=MFMT. The matrix F can be defined as Fij=F(Vi,Vj), where F is a function over the set of elements Vi, and is defined for pairs of elements in the sparse representation. The matrix M is preferably constructed from several matrices, including the Laplacian eigenbasis matrix Φ, eigenvalue matrix Λ, and a projection matrix B describing a projection of the dataset on the sparse representation. A to preferred expression for the matrix M is M=2 μ(Λ+μΦTBTBΦ)−1ΦTBT, where μ is a predetermined regularity parameter. A typical value for μ is from about 0.1 to about 0.9. In experiments performed by the present inventor μ was selected to be 0.5.
In embodiments in which the dissimilarity measures relate to diffusion distances, α is optionally and preferably calculated according the relation α=−2K, where K is a diagonal matrix whose elements are calculated using the kernel function of the diffusion length, for example, Kkk=K(λk).
The method preferably continues to 15 at which multidimensional scaling (MDS) is applied to a matrix {tilde over (D)} defined as a transformation matrix of α. Preferably, {tilde over (D)} is the transformation of α using Φ, e.g., {tilde over (D)}=ΦαΦT. Generally, the MDS is applied so as to provide a reduced dataset including n elements (corresponding to the n elements of the original dataset), wherein the dimension of each element is k<dim. This corresponds to the first k columns of the matrix
where J is defined as Jij=δij−1/n, δij being the Kronecker delta. It was nevertheless found by the present inventors that it is not necessary to calculate the matrix elements of {tilde over (D)} for the purpose of executing the MDS procedure, as will now be explained.
In some embodiments of the present invention the MDS is applied by performing a singular value decomposition (SVD) to the matrix
where H is a matrix defined as H=ΦTαJΦ, and selecting a portion of the vectors obtained by the SVD. The number k of the selected vectors is preferably less than the dimension dim of the dataset.
In some embodiments of the present invention the MDS is effected by an SVD procedure followed by an eigen-decomposition procedure. The SVD procedure is preferably applied to the matrix JΦ so as to express this matrix as JΦ=SUVT. The eigen-decomposition procedure is applied to the matrix UVTαVU, so as to express this matrix as UVTαVU=PΓPT. Once the matrices S, P and Γ are calculated from the SVD and eigen decompositions, a matrix Q defined as Q=SP(Γ)1/2, is calculated. The skilled person would appreciate that the matrix QQT satisfies the relation QQT=J{tilde over (D)}JT, so that the MDS can be completed by calculating the first k columns of Q, where k is lower than the dimension dim of the dataset. The advantage of this to technique is that it at least partially overcomes potential distortions caused by ignoring high-frequency components.
It is expected that during the life of a patent maturing from this application many relevant MDS procedures will be developed and the scope of the term MDS procedure is intended to include all such new technologies a priori.
Once the MDS is completed, the reduced dataset can be stored in a computer readable medium, more preferably non-transitory computer readable medium.
The method ends at 15.
A display device 40 is shown in communication with data processor 32, typically via I/O circuit 34. Data processor 32 issued to display device 40 graphical and/or textual output images generated by CPU 36. A keyboard 42 is also shown in communication with data processor 32, typically I/O circuit 34.
It will be appreciated by one of ordinary skill in the art that system 30 can be part of a larger system. For example, system 30 can also be in communication with a network, such as connected to a local area network (LAN), the Internet or a cloud computing resource of a cloud computing facility.
In some embodiments of the invention data processor 32 of system 30 is configured for accessing the dataset, calculating the interpolation matrix based on a Laplacian eigenbasis matrix of the sparse representation of the dataset, and applying the MDS procedure to a transformation matrix of the interpolation matrix, as further detailed hereinabove.
In some embodiments of the invention system 30 communicates with a cloud computing resource (not shown) of a cloud computing facility, wherein the cloud to computing resource accesses the dataset, calculates the interpolation matrix based on a Laplacian eigenbasis matrix of the sparse representation of the dataset, and applies the MDS procedure to a transformation matrix of the interpolation matrix, as further detailed hereinabove.
The method as described above can be implemented in computer software executed by system 30. For example, the software can be stored in of loaded to memory 38 and executed on CPU 36. Thus, some embodiments of the present invention comprise a computer software product which comprises a computer-readable medium, more preferably a non-transitory computer-readable medium, in which program instructions are stored. The instructions, when read by data processor 32, cause data processor 32 to access the dataset and execute the method as described above.
As used herein the term “about” refers to ±10%.
The word “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments.” Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
The terms “comprises”, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”.
The term “consisting of” means “including and limited to”.
The term “consisting essentially of” means that the composition, method or structure may include additional ingredients, steps and/or parts, but only if the additional ingredients, steps and/or parts do not materially alter the basic and novel characteristics of the claimed composition, method or structure.
As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
Throughout this application, various embodiments of this invention may be to presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental support in the following examples.
Reference is now made to the following examples, which together with the above descriptions illustrate some embodiments of the invention in a non limiting fashion.
Manifold learning refers to the process of mapping given data into a simple low-dimensional domain that reveals properties of the data. When the target space is Euclidean, the procedure is also known as flattening. The flat embedding is usually a simplification process that aims to preserve, as much as possible, distances between data points in the original space, while being efficient to compute. One family of flattening techniques is multidimensional scaling (MDS), which attempts to map all pairwise distances between data points into small dimensional Euclidean domains. Review of MDS applications in psychophysics can be found in Ref. [12], which includes the computational realization that human color perception is 2D.
The geometry of given data points can be explored by computing all pairwise distances. Then, a flattening procedure attempts to keep the distance between all couples of corresponding points in the low dimensional Euclidean domain. Computing pairwise distances between points was addressed by extension of local distances on the data graph [1, 13], or by consistent numerical techniques for distance computation on surfaces [14]. The complexity involved in storing all pairwise distances is quadratic in the number of data points, which is a limiting factor in large databases.
Alternative models try to keep the size of the input as low as possible by limiting the input to distances of just nearby points. One such example is locally linear embedding [15], which attempts to map data points into a flat domain where each feature coordinate is a linear combination of the coordinates of its neighbors. The minimization, in this case, is for keeping the combination similar in the given data and the target a flat domain. Because only local distances are analyzed, the pairwise distances matrix is sparse, with an effective size of O(n), where n is the number of data points. Along the same line, the Hessian locally linear embedding [16] tries to better fit the local geometry of the data to the plane. Belkin and Niyogi [17] suggested embedding data points into the Laplace-Beltrami eigenspace for the purpose of data clustering. Only distances of nearby points by which an LBO is defined. The data can then be projected onto the first eigenvectors that correspond to the smallest eigenvalues of the LBO. The question of how to exploit the LBO decomposition to construct diffusion geometry was addressed in Ref. [18, 19].
De Silva and Tenenbaum [20] recognized the computational difficulty of dealing with a full matrix of pairwise distances, and proposed working with a subset of landmarks, which is used for interpolation in the embedded space. Bengio et al. [21] proposed to extend subsets by Nystrom extrapolation, within a space that is an empirical Hilbert space of functions obtained through kernels that represent probability distributions. However, they did not incorporate the geometry of the original data. Asano et al. [22] subdivided the problem into O(√{square root over (n)}) subsets of O(√{square root over (n)}) data points in each. It was found by the present Inventors that such a reduction may be feasible provided the distances between data points can be computed in constant time, which is seldom the case.
The computational complexity of multidimensional scaling was addressed by a multigrid approach in Ref. [23] and vector extrapolation techniques in Ref. [24]. In both cases the acceleration, although effective, required all pairwise distances, an O(n2) input. In Ref. [25], geodesics that are suspected to distort the embedding due to topology effects were filtered out in an attempt to reduce distortions. It was found by the present Inventors that while such filters eliminate some of the pairwise distances, it is not to the extent of substantial computational or memory savings, because the goal was mainly reduction of flattening distortions. In Ref. [26], the eigenfunctions of the LBO on a surface were interpolated into the volume bounded by the surface. This procedure was designed to overcome the need to evaluate the LBO inside the volume, as proposed in Ref. [27]. Both models deal with either surface or volume LBO decomposition of objects in 3′ and were not designed to flatten and simplify structures in big data. Another dimensionality reduction method, is the principal component analysis method (PCA). The PCA procedure projects the points to a low-dimensional space by minimizing the least-square fitting error. In Ref. [28], that relation was investigated through kernel PCA.
The present Example exploits the property that the gradient magnitude of distance functions on the manifold is equal to 1 almost everywhere. It was found by the present Inventors that interpolation of such smooth functions can be efficiently obtained by projecting the distances into the LBO eigenspace. The present Inventors employed such interpolation and successfully reduced the complexity of the flattening procedure. The efficiency was established by considering a sparse represe3ntation of the pairwise distances that are projected onto the LBO leading eigenfunctions.
Broadly speaking, the differential structure of the manifold is captured by the eigenbasis of the LBO, whereas its multiscale structure is encapsulated by sparse sampling of the pairwise distances matrix. The present Example demonstrates this idea by extracting the 4 structure of points in 10,000 canonizing surfaces to cope with nonrigid deformations, and mapping images of handwritten digits to the plane. The technique of the present embodiments operates on data in any dimension.
A Riemannian manifold having a metric G, is considered. The manifold and its metric induce a LBO, often denoted by ΔG, and here, without loss of generality by Δ. The LBO is self-adjoint and defines a set of functions called eigenfunctions, denoted φi, such that Δφi=λiφi and
∫Mφi(x)φ{hacek over (j)}(x)da(x)=δij,
where da is an area element. These functions have been used to construct descriptors [30] and linearly relate between metric spaces [31]. In particular, for problems involving functions defined on a triangulated surface, when the number of vertices of is large, the dimensionality of the functional space can be reduced by considering smooth functions defined in a subspace spanned by just a couple of eigenfunctions of the associated LBO.
To prove the efficiency of LBO eigenfunctions in representing smooth functions, consider a smooth function ƒ.
As used herein the term “smooth,” in the context of a function ƒ, refers to a function having a gradient ∇Gƒ whose L2 norm ∥∇Gƒ∥2 is below a predetermined threshold. For a smoothest functional orthonormal basis, it is typically desired to find a finite basis of, say, n functions {φi}, i=1, . . . , n, that approximate any given smooth function. Formally, for any given function ƒ on the manifold, such that ∥∇Gƒ∥2<c, the desired set of basis functions {Φi} allow to approximate f≈Σf,φiφi, such that the representation error, defined by
The present inventors found that for each i satisfying 1≦i≦n, each of the scalar products rn, φiG and ∇Grn, ∇GφiG equals zero. Using these properties, the norms of rn and its gradient are obtained:
where ordered eigenvalues λ1≦λ2 . . . have been assumed.
Since ∇Grn, ∇GφiG=0, the square of the norm of the gradient off, ∥∇Gƒ∥G2, can be written as:
It follows that
For d dimensional manifolds, the spectra has a linear behavior in n2/d, that is, λn≈C1n2/d as n→∞. The residual function rn depends linearly on ∥∇Gƒ∥ which is bounded by a constant. It follows that rn converges asymptotically to zero at a rate of O(n2/d). This convergence depends on ∥∇Gƒ∥G2, so that there ∃C2 such that
A d dimensional manifold which is sampled at n points {Vi} , and a subset of {1, 2, . . . , n} such that ||=ms≦n is now considered.
Given a smooth function ƒ defined on ={Vj, j∈}, it is desired to interpolate ƒ, and to construct a continuous function {tilde over (ƒ)} such that {tilde over (ƒ)}(Vi)=ƒ(Vi) for each j in . One type of interpolation is linear interpolation. It is desired that the function {tilde over (ƒ)} be as smooth as possible in L2 sense. The smoothness of a function is measured by:
E
smooth(ƒ)=∫M∥∇ƒ∥22da.
The problem of smooth interpolation can be rewritten as:
Since the norm ∥∇h∥2 satisfies:
∫M∥∇h∥22da=∫MΔh, hda
the interpolation problem can be then be written as:
Any discretization matrix of the LBO can be used for the eigenspace construction. In the present example the LBO general form L=A−1W is used, where A−1 is a diagonal matrix whose entries are inversely proportional to the metric infinitesimal volume elements. One example is the cotangent weights approximation for triangulated surfaces. Another example is the generalized discrete LBO suggested in Ref. [17]. In the first case, W is a matrix in which each entry is zero if the two vertices indicated by the two matrix indices do not share a triangle's edge, or the sum of the cotangents of the angles supported by the edge connecting the corresponding vertices. In the latter case, W is the graph Laplacian, where the graph is constructed by connecting nearest neighbors. The diagonal of the Laplacian matrix is the negative sum of the off-diagonal elements of the corresponding row. In the triangulated surface case, A is a diagonal matrix in which the element Aii is proportional to the area of the triangles about the vertex Vi. A similar normalization factor also applies in the general case.
A discrete version of the smooth interpolation EQ. A.2 can be rewritten as:
where B is the projection matrix on the space spanned by the vectors ej, j∈, where ej is the jth canonical basis vector, and ƒ now represents the sampled vector ƒ(V).
The spectral projection of ƒ to the set of eigenfunctions of the LBO {φi }, i=1, . . . , me is denoted {circumflex over (ƒ)}. {circumflex over (ƒ)} can be written as:
where Δƒi=λiφi. The matrix whose ith column is φi is denoted D. Thus, {circumflex over (ƒ)} can be to written as {circumflex over (ƒ)}=Φα, where α is a vector such that αi=ƒ, φi. EQ. A.3 can now be approximated by:
It is recognized that the transformation ΦTWΦ=Λ, where Λ is the diagonal matrix whose elements are the eigenvalues of L. Alternatively, the constraint can be incorporated as a penalty in a target function, and the problem can be rewritten as:
The solution to this problem is given by:
α=2 μ(Λ+μΦTBTBΦ)−1ΦTBTƒ=Mƒ. EQ. A.5
It was found by the present inventors that the above interpolation expressions can be used to formulate a pairwise geodesics matrix in a compact and accurate manner. The smooth interpolation, EQ. A.2, for a pairwise distances function can be defined as follows.
Let I=× be the set of pairs of indices of data points, and F(Vi,Vj) a value defined for each pair of points (or vertices in the surface example) (Vi,Vj), where (i,j)∈I. The present inventors successfully interpolated the smooth function D: ×→, whose values are given at (Vi,Vj), (i,j)∈I, by D(Vi,Vj)=F(Vi,Vj) for each (i,j)∈I. For that goal, a smooth-energy measure for such functions is defined, as follows:
E
smooth(D)=∫∫M∥∇xD(x,y)∥2+∥∇yD(x, y)∥2da(x)da(y).
The smooth interpolation problem can be written as:
Any matrix D defined such that Dij=D(Vi,Vj), satisfies the following relation:
where DjT represents the jth column of D.
Then,
A similar result applies to:
∫∫M∥∇yD(x, y)∥2da(x)da(y)
so that:
∫∫M∥∇xD(x,y)∥2da(x)da(y)≈trace(DTWDA) and
∫∫M∥∇xD(x,y)∥2da(x)da(y)≈trace(DWDTA). EQ. A.7
The smooth energy can be discretized for a matrix D by:
E
smooth(D)=trace(DTWDA)+trace(DWDTA). EQ. A.8
The spectral projection of D onto Φ, is given by:
where
αij=∫∫M×MD(x,y)φi(x)φj(y)da(x)da(y).
In matrix notations:
D=ΦαΦT, EQ. A.9
where D is a matrix whose matrix elements are given by Dij={tilde over (D)}(xi,yj).
Combining EQs. A.8 and A.9 the discrete smooth energy of the spectral projection of a function can be written as:
Using the spectral smooth representation introduced in EQ. A.9, the smooth spectral interpolation problem for a function from × to as:
Expressing the constraint as a penalty function the following optimization problem can be defined:
where ∥·∥F represent the Frobenius norm, and me the number of eigenfunctions. EQ. A.12 describes a minimization problem of a quadratic function of α. Representing α as an (me2×1) vector α, the problem can be rewritten as a quadratic programming problem. Similarly to EQ. A.5, a matrix M that satisfies the relation α=MD can be found, where D is the row stack vector of the matrix D(Vi,Vj).
Another, more efficient but less accurate way to obtain an approximation of the matrix α is to compute
α=MFMT EQ. A.13
where F is the matrix defined by Fi,j=F(Vi,Vj) and M is the matrix introduced in EQ. A.5. Notice that spectral projection is a natural choice for the spectral interpolation of distances because the eigenfunctions encode the manifold geometry, as do distance functions. Moreover, the eikonal equation, which models geodesic distance functions to on the manifold, is defined by ∥∇GD∥=1. EQ. A.1, in this case, provides a clear asymptotic convergence rate by spectral projection of the function D, because ∥∇GD∥ is a constant equal to 1.
Following is a description of an exemplified suitable procedure for reducing the dimension of a dataset, which procedure can be employed according to some embodiments of the present invention. For simplicity only classical scaling is considered, but other types of scaling can be similarly applied.
Given a metric space , such as a manifold, having a metric D: ×→, and ={V1, . . . , Vn a finite set of elements of , the multidimensional scaling of in k involves finding a set of points ={X1, . . . , Xn} in k whose pairwise Euclidean distances dist(Xi,Xj)=|Xi−Xj∥2 are as close to D(Vi,Vj) for all (i,j). For the MDS member known as classical scaling, such embedding can be realized by the following minimization procedure:
where and Jij=δij−1/n. Classical scaling (see EQ. A.12) finds the first k singular vectors and corresponding singular values of the matrix
The classical scaling solver requires the computation of an (n×n) matrix of distances, which is a challenging task when dealing with more than, say, several thousands of data points. The quadratic size of the input data imposes severe time and space limitations.
The Present inventors successfully employed spectral interpolation to overcome these complexity limitations.
A subset of ms points, with indices of the data is selected. For efficient covering of the data manifold, this set can be selected using the farthest-point sampling strategy, which is known to be 2-optimal in the sense of covering. The geodesic distances between every two points (Vi,Vj)∈×, (i,j)∈I=×.
A Laplacian operator is then constructed from the local relations between data points. In the present example, the LBO is selected. Use of techniques other than Laplacian operators, such as the Finite Elements Method is also contemplated.
The LBO takes into consideration the differential geometry of the data manifold. The Laplacian's first me eigenvectors are computed and are arranged in an to eigenbasis matrix denoted by Φ. The spectral interpolation matrix α is then extracted from the computed geodesic distances and the eigenbasis Φ, using EQs. A.12 or A.13.
The interpolated matrix distance {tilde over (D)} between every two points of can be computed by {tilde over (D)}=ΦαΦT. It is noted that there is no need to compute this matrix explicitly. The choice of representation is advantageous from the reason that follows.
Denote by Dy: →+ the geodesic distance function from a source point y to the rest of the point in the manifold. Geodesic distance functions are characterized by the eikonal equation |VGD∥=1. This property can be used in the bound provided by EQ. A.1. Specifically,
∥∇GDy2∥2=∥2Dy∇y∇GDy∥2≦4∥Dy∥2.
Plugging this relation to EQ. A.1, the error of the squared geodesic distance function projected to the spectral basis is given by:
where ∥D∥G is the diameter of the manifold. Thus, a bound on the relative projection error has been obtained, which bound depends only on Weyl's constant and the number of eigenvectors used in the reconstruction by projection.
The spectral MDS solution is given by
This decomposition can be approximated using more than one technique.
In a first technique, the projection of X to the eigenbasis Φ is considered. This is equivalent to representing X by {tilde over (X)}=Φβ, to provides the equation
Since ΦTAΦ=Im
so that a singular value decomposition applied to the (me×me) matrix
can provides the columns of β.
In a second method, the following decompositions are applied. An SVD procedure is applied to the matrix JΦ so as to express this matrix as JΦ=SUVT. An eigen-decomposition procedure is applied to the matrix UVTαVU, so as to express this matrix as UVTαVU=PΓPT. Once the matrices S, P and Γ are calculated from the to decompositions, a matrix Q defined as Q=SP(Γ)1/2, is calculated. QQT satisfies the relation QQT=J{tilde over (D)}JT, so that the MDS can be completed by calculating the first k columns of Q. The advantage of the second technique is that it at least partially overcomes potential distortions caused by ignoring the high-frequency components.
Algorithm 1, below, is an exemplified algorithm, for reducing the dimensionality of the dataset.
Given a kernel function K(λ), the diffusion distance between two points (x,y)∈ is defined as:
where φk represents the kth eigenfunction of Δ, and λk its associated eigenvalue.
Note that Dij=D2(xi,yi). In a matrix form, EQ. A.15 reads:
where Φ represents the eigen-decomposition matrix of the LBO and K is a diagonal matrix such that Kkk=K(λk).
Denoting by Ψ, a matrix such that Ψij=Φij2, it follows that:
D=ΨK
m
n
T+(ΨKm
where x denotes a column vector with x elements, all of which being 1.
The dimensionality reduction of the present embodiments is now applied to D, which by itself defines a flat domain. Since nTJ=0, it follows that ΨKm
JDJ=−2JΦKΦTJ=JΦ(−2K)ΦTJ.
Thus, when the dimensionality reduction technique of the present embodiments is applied to diffusion distances it is sufficient to set α=−2K, and the inventive dimensionality reduction can be directly applied without explicit computation of these distances. Moreover, using data-trained optimal diffusion kernels, as those introduced in Ref. [34], a discriminatively enhanced flat domain, can be obtained, which enhanced flat domain facilitates a robust and efficient classification between different classes as part of the construction of the flat target space.
The embedding of the intrinsic geometry of a shape into a Euclidean space is known as a canonical form. When the input to an MDS procedure is the set of geodesic distances between every two surface points, the output is such a form. These structures are invariant to isometric deformations of the shape.
In another experiment, two shapes containing 5,000 vertices were used. For each shape, all pairwise geodesic distances were computed. Then, a subset of 50 vertices was selected using the farthest-point sampling strategy, and 50 eigenvectors of the corresponding LBO were extracted. The surfaces were then flattened into their canonical forms using the dimensionality reduction procedure of the present embodiments. A qualitative evaluation is presented in
The relation of within classes and between classes for lionesses and dogs from the TOSCA database [43] is shown in
In an additional experiment, the spectral interpolation of the matrix {tilde over (D)} was computed for sampled points from Armadillo's surface. The same number of eigenvectors as that of sample points were used. The average geodesic error was plotted as a function of the points used. The mean relative error is computed by mean(|{tilde over (D)}−D|/∥D+ε∥).
In an additional next experiment, the influence of the number of sampled points on the accuracy of the reconstruction was measured. Several triangulations of a sphere were generated, where each triangulation was at a different resolution reflected by the number of triangles. Next, for each such triangulation, the spectral to interpolation of the present embodiments was used for calculating the geodesic distances using just √{square root over (n)} points, where n represents the number of vertices of the sphere at a given resolution.
In an additional experiment, an 4 manifold embedded in 10,000, was extracted. A low-dimensional hyperplane structure from a given set of 100,000 points in 10,000 was obtained. The dimensionality reduction technique of the present embodiments handles 105 points with very little effort, unlike conventional MDS techniques which are known to have difficulties already with 104 points.
In an additional experiment, the dimensionality reduction technique of the present embodiments. to a subset of images taken from the MNIST database. Conventional MDS and the dimensionality reduction technique of the present embodiments were computed for 1000, 2000, . . . , 10000 sample of randomly chosen images, referred to as points. 200 eigenvectors of the LBO were evaluated on an i7-Intel® computer with 16 GB memory. Table 1, below, shows the computation times for the conventional and inventive techniques.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.
This application claims the benefit of priority under 35 USC §119(e) of U.S. Provisional Patent Application No. 61/875,396 filed on Sep. 9, 2013, the contents of which are incorporated herein by reference in their entirety
Number | Date | Country | |
---|---|---|---|
61875396 | Sep 2013 | US |