The present invention relates to a data processing method, a data processing apparatus, and a data processing program.
Matrix decomposition is a basic technology in data analysis and machine learning. In many matrix decompositions, a matrix is decomposed into a plurality of matrices to perform dimensionality reduction or low-rank approximation, such as singular value decomposition.
Among them, CUR matrix decomposition (see, for example, NPL 1) is drawing attention because of its high interpretability of the decomposed matrix. This is because in the CUR matrix decomposition, rows and columns of the decomposed matrix are subsets of rows and columns of an original matrix before decomposition. That is, in the CUR matrix decomposition, the decomposed matrix is a submatrix of the original matrix, and original data is preserved also after the decomposition, so that it is easy to interpret the matrix even for human eyes. This property of the CUR matrix decomposition is a property not found in other matrix decompositions such as singular value decomposition. The CUR matrix decomposition is often used to regard rows and columns of a decomposed matrix as important rows and columns and to extract important rows and columns from matrix data.
As described in NPL 1, in the CUR matrix decomposition, a solving approach using a randomized algorithm is common. However, in the method described in NPL 1, a result changes every time due to randomization, and an error tends to be large when a decomposed matrix is small. Thus, a deterministic algorithm has been proposed in order to deal with this tendency (see, for example, NPL 2).
In the method described in NPL 2, a problem of the CUR matrix decomposition is formulated as a convex optimization problem with sparse regularization, and a solution thereof is obtained by repeatedly updating parameters of an objective function thereof using an algorithm called coordinate descent.
Specifically, in the method described in NPL 2, a parameter vector corresponding to the rows and columns of the matrix is introduced into the objective function, and in the coordinate descent, this parameter vector is updated in order until the parameter vector converges for each row and column so that the objective function becomes smaller. In this case, the parameter vector tends to be a zero vector due to an effect of sparse regularization. Because rows and columns corresponding to the parameter vectors that have become zero vectors can be regarded as unimportant rows and columns in the objective function, important rows and columns can be extracted from the original matrix.
In other words, the coordinate descent updates the parameter vectors in order for respective rows and columns and repeats the update until all the parameter vectors converge. Ultimately, rows and columns that become zero vectors are unimportant rows and columns, and rows and columns in which parameter vectors become non-zero vectors can be said to be important rows and columns.
However, the coordinate descent of the CUR matrix decomposition has a problem that calculation is slow for large-scale data. This is because, in the coordinate descent of the CUR matrix decomposition, when the number of rows of a matrix is n and the number of columns is p, time complexity of O(p2) or O(np) is required for two updating calculations of the parameter vectors. Further, this is because, in the coordinate descent of the CUR matrix decomposition, this updating calculation must be repeated until all parameter vectors converge. Thus, it is difficult to apply the CUR matrix decomposition to large-scale data.
There are not many studies dealing with an increase in speed of the coordinate descent of the CUR matrix decomposition, but it is possible to increase the speed by using safe screening (see, for example, NPL 3). With safe screening, it is possible to specify and delete rows and columns in which the parameter vectors become zero vectors before the coordinate descent is applied.
NPL 1: Michael W. Mahoney, and Petros Drineas, “CUR matrix decompositions for improved data analysis”, Proc. Natl. Acad. Sci. U.S.A., 106(3): 697-702, 2009.
NPL 2: J. Bien, Y. Xu, and M. W. Mahoney, “CUR from a Sparse Optimization Viewpoint”, In NeurIPS, pp. 217-225, 2010.
NPL 3: E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon, “Gap Safe Screening Rules for Sparsity Enforcing Penalties”, Journal of Machine Learning Research, 18(1): 4671-4703, 2017.
However, when the number of rows and columns that can be deleted by safe screening is small, there is a problem that a speed of the coordinate descent is not increased. In particular, it is theoretically known in the safe screening that it is difficult to delete rows and columns when an initial value of the parameter vector is far from a solution.
The present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program for increasing a speed of coordinate descent in order to apply CUR matrix decomposition to large-scale data.
In order to solve the above-described problem and achieve the object, a data processing method according to the present invention is a data processing method executed by a data processing apparatus for extracting important rows or columns from matrix data, the data processing method including calculating norms of rows or columns of a gram matrix of given data, calculating, in search of a hyperparameter, based on a norm of a row or column to be processed, a lower bound of a determination value of an optimal condition when a solution of a parameter vector corresponding to the row or column to be processed is a zero vector, determining whether the row or column to be processed is important based on the lower bound, to extract the row or column that is determined to be important, updating a parameter corresponding to the row or column that is determined to be important, calculating an upper bound of the determination value of the optimal condition when the solution of the parameter vector corresponding to the row or column to be processed is the zero vector, and determining whether parameter updating for the row or column to be processed is necessary based on the upper bound, to perform the parameter updating when the parameter updating is determined to be necessary.
Further, a data processing apparatus according to the present invention is a data processing apparatus for extracting important rows or columns from matrix data, the data processing apparatus including a first calculation unit that calculates norms of rows or columns of a gram matrix of given data, a second calculation unit that calculates, in search of a hyperparameter, based on a norm of a row or column to be processed, a lower bound of a determination value of an optimal condition when a solution of a parameter vector corresponding to the row or column to be processed is a zero vector, a first determination unit that determines whether the row or column to be processed is important based on the lower bound, an extraction unit that extracts the row or column that is determined to be important by the important matrix determination unit, a first updating unit that updates a parameter corresponding to the row or column that is determined to be important, a third calculation unit that calculates an upper bound of the determination value of the optimal condition when the solution of the parameter vector corresponding to the row or column to be processed is the zero vector, a second determination unit that determines whether parameter updating for the row or column to be processed is necessary based on the upper bound, and a second updating unit that performs the parameter updating when the parameter updating is determined to be necessary by the second determination unit.
Further, a data processing program according to the present invention causes a computer to execute calculating norms of rows or columns of a gram matrix of given data, calculating, in search of a hyperparameter, based on a norm of a row or column to be processed, a lower bound of a determination value of an optimal condition when a solution of a parameter vector corresponding to the row or column to be processed is a zero vector, determining whether the row or column to be processed is important based on the lower bound, to extract the row or column that is determined to be important, updating a parameter corresponding to the row or column that is determined to be important, calculating an upper bound of the determination value of the optimal condition when the solution of the parameter vector corresponding to the row or column to be processed is the zero vector, and determining whether parameter updating for the row or column to be processed is necessary based on the upper bound, to perform the parameter updating when the parameter updating is determined to be necessary.
According to the present invention, it is possible to increase a speed of coordinate descent in order to apply CUR matrix decomposition to large-scale data.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in description of the drawings, the same parts are denoted by the same reference signs.
Hereinafter, for A that is a vector, matrix, or scalar, “
First, the present embodiment will be described.
A data processing apparatus 10 according to the present embodiment illustrated in
The gram matrix calculation unit 11 calculates a gram matrix of given data. The norm calculation unit 12 calculates a norm for each row or column of the gram matrix. The parameter search unit 13 searches for hyperparameters. The lower bound calculation unit 14 calculates a lower bound of the determination value (optimal condition value) of the optimal condition when a solution of a parameter vector corresponding to the row or column to be processed is a zero vector, based on the norm for each row or column to be processed. The important matrix determination unit 15 determines whether the row or column is important based on the lower bound of the determination value of the optimal condition. The important matrix extraction unit 16 extracts the row or column determined to be important. The important matrix updating unit 17 mainly updates a parameter corresponding to the extracted important row or column.
The optimal condition value calculation unit 18 calculates the determination value of the optimal condition when the solution of the parameter vector corresponding to the row or column to be processed is a zero vector. The upper bound calculation unit 19 calculates an upper bound of the optimal condition value when the solution of the parameter vector corresponding to the row or column to be processed is the zero vector. The calculation omission determination unit 20 determines whether parameter updating is necessary for the row or column to be processed based on the upper bound of the determination value of the optimal condition. The updating calculation unit 21 performs updating when updating is necessary. The convergence determination unit 22 determines convergence of the parameter vector.
Because the data processing apparatus 10 omits unnecessary calculations in CUR matrix decomposition and preferentially performs important calculations, it can execute the CUR matrix decomposition at a high speed and extract important rows or columns from the matrix data at a high speed.
Mathematical Background
The CUR matrix decomposition and the coordinate descent will be described herein as background knowledge.
The CUR matrix decomposition is a scheme for decomposing data in a matrix format into a plurality of matrices. When n is the number of pieces of data and each piece of data is expressed by a p-dimensional feature quantity, the data in a matrix format can be expressed by a matrix X∈Rn×p.
The CUR matrix decomposition decomposes X into three matrices as shown in Expression (1). Sizes of the respective matrices are C∈Rn×c, U∈Rc×r, and R∈Rr×p.
[Math. 1]
X≈CUR (1)
Here, C includes c column vectors in X. R includes r row vectors. C and R are submatrices of X, and can be said to be highly important column vector and row vector for approximating X.
Under this setting, a deterministic algorithm for CUR matrix decomposition solves an optimization problem with sparse regularization to extract C and R. Here, the optimization problem for extracting C is described as in Expression (2) for simplicity.
In Expression (2), W∈Rp×p is a parameter that is an optimization target. ∥•∥2F is a Frobenius norm. λ≥0 is a hyperparameter and is a target for manual tuning. W(i) is an i-th row vector of W.
In Expression (2), ∥W(i)∥2 is a norm (constraint term) for inducing sparseness, and by solving an optimization problem with this norm, it becomes easy for W(i) to become a zero vector. Further, a term shown in Expression (3) in Expression (2) is an error function, and optimization is performed so that an error between X and XW becomes small.
A row vector of W includes many zero vectors due to an influence of the constraint term as a result of optimization, but in this case, an index of rows that are non-zero vectors is I⊆{1, . . . , p}. Then, XW contributing to minimization of the error function becomes substantiality XIWI. Here, XI is a matrix consisting of a column vector of X corresponding to an index I. WI is a matrix consisting of the row vectors of W corresponding to the index I. C=XI is set so that C can be extracted. A method of simultaneously extracting C and R will be described below.
Next, coordinate descent will be described. The coordinate descent is an algorithm for solving an optimization problem of Expression (2). Specifically, W(i) is repeatedly updated for each row until the whole W converges so that a solution of the optimization problem of Expression (2) is obtained. When ∥W(i)∥2=1, an updating equation for W(i) is given as in Equation (4) below.
[Math. 4]
W
(i)=(1−λ/∥zi∥2)+zi (4)
In Equation (4), (1−λ/∥zi∥2)+ is calculated as in Equation (5).
In Equation (4), zi ∈Rl×p is calculated as in Equation (6).
[Math. 6]
z
i
=X
(i)T(X−Σj≠ipX(j)W(j)) (6)
X(i) indicates an i-th column vector of X.
Here, in calculation of Equation (4), large time complexity of O(p2n) is required. For Equation (4), calculation can be devised for O(p2) or O(pn), but a calculation cost is still large for a large data matrix X.
Mathematical Background in Embodiment
Next, a mathematical background in the embodiment will be described. The present embodiment is for increasing a speed of the coordinate descent of the CUR matrix decomposition, and includes the following two ideas.
The first idea is to specify the rows in which W(i) is a zero vector with low computational complexity (O(p)), and omit updating calculation of Equation (4), which is a bottleneck of the coordinate descent, for such rows.
A second idea is to specify rows in which W(i) is always a non-zero vector and update rows preferentially from such rows. In the present embodiment, an increase in speed is achieved by the first idea and the second idea.
Specifically, a condition (optimal condition) when W(i)=0 becomes an optimal value is approximately evaluated so that the first idea and the second idea are achieved. The optimal condition when W(i)=0 is an optimal value is illustrated in Expression (7) below using an optimal condition value Ki=∥zi∥2.
[Math. 7]
K
i≤λ (7)
It can be said that W(i)=0 when the condition of Expression (7) is satisfied.
When a condition shown in Expression (7) is evaluated, it can be confirmed whether the row is a zero vector. That is, when the condition shown in Expression (7) is satisfied, the row can be said to be a zero vector without executing Equation (4), so that Equation (4) can be skipped and the first idea can be achieved. In addition, when the condition shown in Expression (7) is not satisfied, the row can be said to be a non-zero vector, so that the second idea can be achieved.
Here, time complexity required for evaluation of the condition shown in Expression (7) is as large as O(p2) or O(pn). Thus, in the present embodiment, the condition shown in Expression (7) is approximately evaluated so that the computational complexity is reduced. Specifically, in the present embodiment, an upper bound
[Math. 8]
K
i
[Math. 9]
K
i
={tilde over (K)}i−∥ΔW(i)∥2−∥G(i)∥2∥ΔW∥F (9)
Here, ˜Ki is an optimal condition value immediately before entrance to the internal loop. In Equations (8) and (9), when ˜W is W immediately before entrance to the internal loop, ΔW(i)=W(i)−˜W(i) and ΔW=W−˜W. Further, G(i) ∈Rl×p indicates an i-th row vector of G=XTX∈Rp×p. Expressions (10) and (11) are satisfied for Equations (8) and (9), respectively.
[Math. 10]
K
i
[Math. 11]
K
i
≤Ki (11)
In the present embodiment, a determination is made whether the updating calculation of Equation (4) is omitted by using the upper bound in order to achieve the first idea described above. Further, in the present embodiment, the lower bound is used to specify rows that become non-zero vectors, and the updates are performed preferentially from such rows in order to achieve the second idea.
Using the upper bound
[Math. 12]
K
i
This is because the condition shown in Expression (7) is satisfied because Expression (13) is satisfied.
[Math. 13]
K
i≤
However, Equation (8) still requires the time complexity of O(p2). Thus, in the present embodiment, Equation (8) is modified and the computational complexity is reduced. Specifically, the upper bound
[Math. 14]
K
i
In Equation (14), δ is Equation (15).
[Math. 15]
δ=√{square root over (∥ΔW∥F2−∥ΔW(j)∥22+∥ΔW′(j)∥22)} (15)
the time complexity of Equation (14) is O(p), which is a sufficiently low computational complexity. Thus, when the upper bound
By using the lower bound _Ki, the row that becomes a non-zero vector can be specified and the second idea can be achieved. Specifically, when Expression (16) is satisfied, W(i) is a non-zero vector.
[Math. 16]
K
i
>λ (16)
This is because the condition shown in Expression (7) is not satisfied because Expression (17) is satisfied.
[Math. 17]
K
i≥Ki>λ (17)
The second idea is to preferentially update the parameters for such rows that become non-zero vectors. Thus, a set in which only rows that become non-zero vectors are collected is formed as shown in Equation (18) below in order to preferentially update rows that becomes non-zero vectors.
[Math. 18]
M={i∈{1, . . . ,p}|Ki>λ} (18)
Further, it is possible to reduce the computational complexity of Equation (9) for the lower bound by reusing a term of the calculation of the upper bound by Equation (14). Specifically, the lower bound is calculated using Equation (19) below.
[Math. 19]
K
i
=
When the term calculated by Equation (14) is reused, the computational complexity in Equation (19) becomes O(1), which is a sufficiently low computational complexity. The second idea is achieved by first executing the coordinate descent by using only rows corresponding to a set M.
Thus, in the present embodiment, first, only the rows corresponding to the set M are used to execute the coordinate descent until it converges (second idea). Then, in the present embodiment, all the rows are used to execute the coordinate descent, but at this time, the upper bound is used to perform updating while safely omitting unnecessary calculations (first idea). Thus, in the present embodiment, because necessary calculations are not omitted, convergence to a value of the same objective function as the original coordinate descent occurs.
Processing Procedure of Data Processing
Next, a processing procedure for data processing executed by the data processing apparatus 10 according to the present embodiment will be described.
The gram matrix calculation unit 11 calculates a gram matrix G of given data (row number 1 in
Subsequently, the data processing apparatus 10 performs search for a hyperparameter λ (loop of row numbers 4 to 25 in
The lower bound calculation unit 14 calculates the lower bound _Ki of the determination value of the optimal condition when the solution of the parameter vector corresponding to the row to be processed is a zero vector by using Equation (19) (row number 7 in
After step S6 in
When step S5 is performed on all rows (step S7 in
Subsequently, the data processing apparatus 10 performs loop processing of coordinate descent by using the upper bound
The upper bound calculation unit 19 calculates the upper bound
When
On the other hand, when
On the other hand, when the upper bound is calculated for all rows (step S16 in
The data processing apparatus 10 ends the processing when all parameters have converged (step S17 in
As described above, in the present embodiment, a determination whether the parameter vector becomes a zero vector is made with low computational complexity before the updating calculation for the parameter vector, which is a bottleneck of the coordinate descent, is performed, and when the parameter vector becomes a zero vector, the updating calculation is omitted. Thus, in the present embodiment, it is possible to increase the speed of coordinate descent. Further, in the present embodiment, the parameter vectors that become non-zero vectors are specified in advance and are updated preferentially and intensively.
Thus, according to the present embodiment, the speed of the coordinate descent can be increased so that the speed of extraction of important rows and columns based on the CUR matrix decomposition can be increased. In the present embodiment, updating calculation for rows and columns that become a zero vector is safely omitted, and rows and columns that become a non-zero vector are mainly updated. Thus, in the present embodiment, because it can be guaranteed that the objective function value as a result of the optimization according to the present embodiment matches the original coordinate descent, it is possible to accurately execute the CUR matrix decomposition and extract important rows and columns.
Thus, according to the present embodiment, because the speed of coordinate descent can be accurately increased, the CUR matrix decomposition can be applied to large-scale data.
So far, description has been given using an example in which C is extracted. In the present modification example, an extension method for simultaneously extracting C and R will be described. In the extraction of C, the optimization problem of Expression (2) is solved, but in the simultaneous extraction of C and R, an optimization problem illustrated in Expression (20) below is solved.
V∈Rp×n, H∈Rp×n, and W is expressed by W=V+H. Σpi=1∥V(i)∥2 and Σnj=1∥H(j)∥2 are constraint terms that make it easy for a row vector and a column vector to become zero vectors, respectively. λr and λc are hyperparameters for controlling strength of the constraint.
C and R are extracted by using the same scheme as described above in correspondence to indexes of non-zero vectors of V and H, respectively. Because there are two variables V and H, two types of coordinate descent are also executed in correspondence to V and H. The optimal condition values when V and H take zero vector can be calculated as in Equations (21) and (22) below, respectively.
[Math. 21]
R
i
=∥X
(i)T
{X−(XW−X(i)V(i))X}XT∥2 (21)
[Math. 22]
C
j
=∥X
T
{X−X(WX−H(j)X(j))}X(j)T∥2 (22)
For the above, when Ri≤λr, V(i)=0 is satisfied, and when Cj≤λc, H(j)=0 is satisfied.
If the upper and lower bounds can be calculated for the optimal condition values Ri and Cj, the data processing apparatus 10 described so far can be used for simultaneous extraction of C and R. An upper bound of Ri is expressed as in Equation (23), and a lower bound of Ri is expressed as in Equation (24).
[Math. 23]
R
i
[Math. 24]
R
i
={tilde over (R)}i−G(i)(i)∥ΔV(i)∥2∥F∥F−∥G(i)∥2∥ΔW∥F∥F∥F (24)
In the above, F=XXT. ˜Ri is an optimal condition value immediately before entrance to the internal loop.
An upper bound of Cj is expressed as in Equation (25), and a lower bound of Cj is expressed as in Equation (26).
[Math. 25]
C
j
[Math. 26]
C
j
={tilde over (C)}j−∥G∥F∥ΔH(j)∥2F(j)(j)−∥G∥F∥ΔW∥F∥F(j)∥2 (26)
˜Cj is an optimal condition value immediately before entrance to the internal loop.
System Configuration of Embodiment
Each component of the data processing apparatus 10 illustrated in
Further, all or some of processing operations performed in the data processing apparatus 10 can be implemented by a CPU and a program analyzed and executed by the CPU. Further, each of the processing operations performed by the data processing apparatus 10 may be implemented as hardware by wired logic.
Further, all or some of the processing operations described as being performed automatically among the processing operations described in the embodiment can be performed manually. Alternatively, all or some of the processing operations described as being performed manually can be performed automatically using a known method. In addition, information including the processing procedures, control procedures, specific names, and various types of data or parameters described above and illustrated in the drawings can be appropriately changed unless otherwise specified.
Program
The memory 1010 includes a read only memory (ROM) 1011 and a random access memory (RAM) 1012. The ROM 1011 stores, for example, a boot program such as a basic input output system (BIOS). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130.
The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and a program data 1094. That is, a program that defines each of the processing operations of the data processing apparatus 10 is implemented as the program module 1093 in which a code that can be executed by a computer 1000 has been described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as that of a functional configuration in the data processing apparatus 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced with a solid state drive (SSD).
Further, configuration data to be used in the processing of the embodiment described above is stored as the program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. The CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 or the hard disk drive 1090 into the RAM 1012 and executes the program module 1093 and the program data 1094, as necessary.
The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090 and may be stored, for example, in a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (a local area network (LAN), a wide area network (WAN), or the like). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
Although the embodiment to which the invention made by the present inventor has been applied has been described above, the present invention is not limited by the description and the drawings that form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in a category of the present invention.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2020/018013 | 4/27/2020 | WO |