The present disclosure relates to a technique of learning a neural network.
A neural network, in particular, a convolutional neural network (hereinafter referred to as a “CNN”) which has been studied in recent years increases the number of parameters although the convolutional neural network has high recognition capability. Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus, Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation, Advances in Neural Information Processing Systems 27 (NIPS 2014) discloses a method for reducing an amount of memory required for a recognition device.
According to the method disclosed in Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus, Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation, Advances in Neural Information Processing Systems 27 (NIPS 2014), a weight parameter of a convolutional calculation of the CNN is represented by a direct product of vectors of three axes and a plurality of such direct products are added to one another so that approximation compression is performed (low rank approximation). However, it is likely that weight parameters in higher layers in the CNN in particular are sparse or inconsecutive, and therefore, it is difficult to improve accuracy when approximation using the direct product is used. Therefore, there is a need in the art for approximation with higher accuracy when compared with general methods relative to sparse weights, such as the weight parameters in the higher layers of the CNN.
According to an embodiment of the present invention, an information processing apparatus includes a division unit configured to divide a weight parameter of a neural network into a plurality of groups, and an encoding unit configured to approximate the weight parameter in accordance with a codebook and encode the weight parameter for individual divided groups.
Further features of the present invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
Hereinafter, a first embodiment of the present disclosure will be described with reference to the accompanying drawings. In this embodiment, basic patterns of a method for compressing weight parameters of the neural network and a recognition operation using compressed parameters are described.
The information processing apparatus further includes, as peripheral functions, a data input unit 106 which supplies data to be processed to the neural network and a result output unit 107 which outputs a result of a process performed in the neural network. The information processing apparatus further includes a neural network parameter storage 108 which stores parameters of the neural network before compression and which supplies the parameters to the parameter division unit 101 and a user instruction unit 109 which is used by a user to input various conditions when parameters are to be divided or encoded.
The information processing apparatus includes a hardware configuration including a central processing unit (CPU), a read only memory (ROM), a random access memory (RAM), and a hard disk drive (HDD), and various functional configurations and processes in flowcharts described below are realized when the CPU executes programs stored in the ROM or a hard disk (HD), for example. The RAM includes a storage region functioning as a work area used by the CPU developing and executing the programs. The ROM includes a storage region which stores the programs to be executed by the CPU. The HD includes a storage region which stores various programs and various data including data on parameters to be used when the CPU executes processes.
Note that the information processing apparatus of the present disclosure may process various data, such as audio, images, and text. However, input data in this embodiment is a still image of colors of three channels (hereinafter the term “channel” is abbreviated as “ch”) as schematically illustrated in
Next, an operation of approximately compressing parameters of the neural network performed by the information processing apparatus will be described in detail with reference to a flowchart of
Hereinafter, a process of the alignment will be described in detail. The parameters of the convolution calculation of the CNN may be generally represented by four-dimensional tensor. A size of the tensor is denoted by “W×H×DIN×DOUT”. Here, “W” and “H” denote a vertical pixel size and a horizontal pixel size for convolution, and “DIN” and “DOUT” indicate the number of feature channels of input data and the number of feature channels of output data which is output as a result of the convolution.
When a first layer of the neural network of
f:R
W×H×D×DIN×DOUT→RW×H×D′ Expression 1
Note that the following equation is satisfied: D′=DIN×DOUT″. As a concrete example of the calculation operation f, a calculation operation represented by Expression 2 below is taken as an example.
C′[i, j, p+(q−1)×DIN]:=c[i, j, p, q] Expression 2
Note that the following equations are satisfied.
p=1, . . . , DIN
q=1, . . . , DOUT
c′∈RW×H×D′, c∈RW×H×DIN×DOUT
The calculation operations described above are performed to align a parameter in raster order. According to this calculation operation, a tensor having the size of 3×3×3×64 in the first layer is converted into a three dimensional tensor having a size of 3×3×192.
In step S104, the parameter division unit 101 divides the parameter aligned in the preceding step into a plurality of partial parameters. It is assumed here that a parameter having a size of 3×3×192 is divided into partial parameters having a size of 3×3×N as illustrated in
As a result of the division, the weight parameters of the individual layers are divided into partial parameters c(i, j) having the same size as illustrated in
Next, the parameter divided in a unit of element is subjected to the approximation compression by a codebook which is independently provided. This process will be described in detail with reference to the flowchart of
The codebook vectors and the codebook coefficients are learnt by minimizing a loss function as illustrated in Expression 3 below.
minx, A Σn||cn−Axn||2+λ|xn|,
Subject to ||an∥∥≦1∀n=1, 2, . . . , M Expression 3
Here, “cn” denotes n-th one of the divided weight parameters, ci, j)∈RW×H×D which is three-dimensional data is aligned as a column vector having a length L (W×H×D) so as to obtain cn∈RL×1. A is a set of M codebook vectors ai and is represented as follows: A=[a1, a2, . . . , aM]. The codebook vectors ai correspond to ai∈RL×1. Xn is a coefficient of a codebook used for reconstruction of an n-th weight parameter and corresponds to xn∈RM×1.
A first term of a formula in a first row in Expression 3 is a loss term of an approximation error, and a second term is a loss term referred to as a “sparse term”. “λ” indicates a hyper parameter which adjusts the two terms. A formula in a second row indicates a constraint condition for eliminating apparent trivial solutions. When learning calculations are performed, minimization of x and minimization of A in Expression 3 are alternately performed until convergence is reached or a predetermined number of times is reached (step S108 to step S113). The second term in the first row is a cost term of L1 norm, and therefore, a large number of values of codebook coefficients xn are converged to 0, that is, the codebook coefficient xn are sparse. Therefore, approximation reconstruction of a weight parameter cn is enabled only using k codebook coefficients which have large absolute values among the codebook coefficients xn. The sparse coding is general technique as illustrated in J. Yang, K. Yu, Y. Gong, and T. Huang, Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, IEEE Conference on Computer Vision and Pattern Recognition, 2009, and therefore, a more detailed description is omitted.
In this way, the weight parameters of the convolution of the layers are approximated using the codebook A including M codebook vectors and the codebook coefficients X for reconstruction. The codebook storage 103 stores the codebook A and the codebook coefficient X obtained in step S114 to step S117, and thereafter, the approximation compression operation is terminated.
Note that a compression rate is changed depending on the number M of codebooks which are hyper parameters and the number K of codebook coefficients to be used for the reconstruction. For example, a compression rate obtained when a general CNN which is referred to as “AlexNet” is compressed is illustrated in
c(1, j)=Σm∈{Top(K)}x(i, j, m)am Expression 4
Thereafter, a processing operation of the neural network using the convolution process is performed similarly to the general CNNs (step S206). In this way, the recognition operation according to this embodiment is performed.
According to this embodiment, a weight parameter of the neural network is divided into a plurality of portions (groups) having the same size and approximation is individually performed on the portions using a weighted sum of the codebook vectors. Accordingly, as with weight parameters in higher layers of the CNN, approximation may be performed with high accuracy on sparse weights.
Furthermore, various embodiments may be employed as parameter alignment and parameter division in addition to the embodiment described above. For example, after a parameter is aligned in a size of 9×3×64, the parameter is divided into a size of 9×3×4, or after a parameter is aligned in a two-dimensional manner in a size of 27×64, the parameter is divided into a size of 3×64. This embodiment is not limited to specific embodiment. However, since the convolution is performed for individual channels in the recognition operation of the CNN, it is preferable, in terms of a speed of implementation, that a dimension of a target of the division is not in a convolution space direction but in a direction of a dimension of input/output channels as described above.
Furthermore, although the full connected layer and the bias term are not to be compressed in the foregoing description, they may be included in targets of the compression. For example, although a weight parameter of the full connected layer is an array in a size of Di×Di+1, the weight parameter may be aligned to be shaped as a three-dimensional parameters in a size of 3×3×[Di×Di+1/9]. As an alignment method, raster order may be employed, that is, any order may be employed as long as the order is a certain method having reproducibility. If the weight parameter is subjected to the shaping operation, the parameter may be easily divided by an element unit of 3×3×N. Note that, if a value obtained from “Di×Di+1/9” is indivisible or if a remainder is obtained in division performed for N channels, the remainder is not compressed but a value of the original parameter is stored. Alternatively, a dummy value, such as 0, may be added to the parameter so that a divisible size is obtained. Note that the dummy value is removed after parameter reconstruction in the recognition operation. Furthermore, in the recognition, only in the full connected layer, the calculation process of the neuron network is required to be performed after the general reconstruction of a weight parameter is performed and the parameter is aligned again in an array of portions having a size of Di×Di+1. The bias value may be compressed by the same method.
Furthermore, codebook approximation compression of a neural network other than the CNN may be taken as another modification. In this case, weight parameters of all layers are two-dimensional parameters in a size of Di×Di+1. In this case, the parameters in the size of Di×Di+1 may be aligned in the size of [W]×[Di×Di+1/W] so as to have a predetermined size W. Note that each square bracket correspond to one dimension of a parameter. The parameters are aligned in raster order. Thereafter, each of the parameters is divided in an element unit of W×N channel, and obtained parameters are approximated by a codebook. Note that, as with the case described above, in the indivisible case, a dummy value is added.
Furthermore, as a further modification, a case where a convolution layer other than a convolution layer having a pixel size of a convolution calculation of 3×3, that is, a convolution layer having a size of 5×5 or 7×7, is mixed is taken as an example. In this case, a codebook may be provided for each size, and encoding learning may be individually performed.
Note that, as a learning method using a codebook, a method for approximating the leant weight parameters of the neural network using a codebook is described above. However, various modifications of the approximation method may be made as described below, and methods in the modifications affect final capability. Hereinafter, a modification of the learning operation will be described. First Modification of Learning Operation
As a first modification, a method for gradually approximating parameters in individual layers from a lower layer instead of a method for performing approximation compression on all the layers at once will be described. A procedure will be described in detail hereinafter. First, codebooks and codebook coefficients are leant so that weight parameters in all the layers of a neural network are to be approximated. Thereafter, only a parameter in a first layer of the neural network is replaced by a value which has been approximated and reconstructed by the codebook.
Subsequently, learning data is supplied to the neural network, and weights in a second layer onwards are learnt again using an error backpropagation method. This process is performed on the individual layers from a lowest layer to a highest layer. As the method described above, a risk that when all the layers are individually subjected to approximation compression, approximation errors are stacked in upper layers is high. However, the errors may be reduced if the approximation is performed on the layers one by one. Second Modification of Learning Operation
As a second modification, an embodiment in which learning of a codebook is performed simultaneously with learning of the neural network will be described. In the second modification, first, a codebook A and a codebook coefficient X are initialized by a random number, and a weight w of the neural network is converted into an approximation formula w: =ΣmAxm in advance. Then the codebook coefficient X is updated using a stochastic error backpropagation method. Formulae in Expression 5 below are used for the update.
Note that “ENN″” indicates an error amount relative to a target value at the time of learning of the neural network. “η” indicates a learning coefficient. indicates an error amount obtained by adding an error of the neural network to a loss of a spares term. “Sign(x)” indicates an operator for returning x. “∂E/∂w” indicates a gradient of an error and may be obtained by the general error backpropagation method.
Furthermore, a variable A of the codebook is updated by the stochastic error backpropagation method in accordance with Expression 6.
Note that “ε” indicates a learning coefficient. By alternately performing update by the method described above, learning of the neural network, the codebook, and the codebook coefficient may be simultaneously performed.
As a third modification, an embodiment in which change and learning of order of channels are performed so that weights of the neural network match an existing learnt codebook is taken as an example. Although a weight parameter of the CNN may be aligned in raster order, a process of changing order of channels is not performed. In the CNN, the order of the channels in the individual layers is not important, and therefore, change of order of the channels does not affect the learning as long as consistency of the parameters is maintained among the layers. Therefore, in the third modification, the weight parameters of the CNN are sorted so as to be suitable for the leant codebook.
Specifically, it is assumed that, as illustrated in
Making use of the characteristic described above, a sorting method described below may be employed, for example. First, a pair of a convolution parameter and a codebook vector which has lowest approximation accuracy is determined in approximation performed using the temporary codebook. Subsequently, a feature channel layer having lowest approximation accuracy is determined in the parameter. Thereafter, this channel layer is randomly swapped by another channel in the same layer, and as a result, if entire approximation accuracy is improved, the swapping is adopted.
The learning method for sorting the weights of the CNN relative to a codebook coefficient is described above. However, in terms of learning of the CNN in accordance with an existing codebook, various methods may be employed, and this embodiment is not limited to the method described herein.
In a fourth modification, a user sets a constraint condition of parameters using a user instruction unit 109, and learning is optimized within the constraint condition. For example, the fourth modification corresponds to the following method. That is, a maximum value of a memory size or the like is input, and the parameter encoding unit 102 searches for hyper parameters K and N so that a size after compression does not exceed a condition value. Examples of the method include a method for changing values of the parameters in a certain interval in learning and employing change of the parameter having a largest value of an evaluation formula as represented by Expression 7 and which satisfies the constraint condition.
Evaluation Value=Size Increasing Rate after Compression×Reduction Rate of Approximation Error Expression 7
Next, a second embodiment of the present disclosure will be described. In the first embodiment, the weight parameters are compressed using the codebook which is common in all the layers. On the other hand, in this embodiment, a method for reading and using a set of different codebooks at different timings in different layers so that a memory amount of an information processing apparatus is compressed will be described. Note that descriptions of components which are the same as those of the first embodiment are omitted.
Lower layers of the CNN are likely to have weight distribution like a Gabor filter, and higher layers are likely to have parameters having sparse weights including a large number of zero values. Therefore, different codebooks are used in a lower layer, a middle layer, and a higher layer which are loosely divided so that the approximation accuracy may be improved without increasing the amount of use memory. Note that, when different codebooks are used in different layers, the codebooks and codebook coefficients are learnt in individual layers at the time of learning of the codebooks.
On the other hand, as illustrated in
In a processing flow of this embodiment, first, sizes of codebook sets corresponding to the individual layers are set in step S301. This setting is performed by assigning predetermined values in advance or by causing the user to input values using the user instruction unit 109. In step S302, the parameter encoding unit 102 initializes all the codebook sets and values of codebook coefficients using a random number. Subsequently, in step S304 to step S309, learning update is successively performed on the codebook coefficients of the individual layers. Specifically, first, the parameter encoding unit 102 reads a weight parameter of a target layer and all codebook sets to be used (step S305). It is assumed here that the weight parameter has been divided.
Thereafter, the parameter encoding unit 102 updates the codebook coefficients in accordance with Expression 3 so that the weight parameter of the layer is approximated (step S307). In this case, only a codebook vector included in the codebook set used in this layer is used in the approximation. In this way, the learning update is performed on the individual layers. When the update on all the layers is terminated by one iteration, values of codebook vectors of all the codebook sets are updated in accordance with Expression 3 (step S310). By repeatedly performing the process described above a certain number of times, the plurality of codebook sets to be used in an overlapped timing are appropriately learnt. With this configuration, different codebook sets may be read and used at different timings in the individual layers.
Although a codebook set for a plurality of layers has been described in the foregoing description, the different layers may independently have different codebook sets and the codebook sets may be read every time before calculation is performed in the layers. As described above, this embodiment relates to holding and reading timings of the codebook sets and is not limited to a specific embodiment.
Next, a third embodiment of the present invention will be described. In the first and second embodiments, the weight parameters of the CNN are subjected to the approximation and compression for image data. However, this embodiment is further generalized and is applicable to a CNN which processes higher dimensional data. Examples of the higher dimensional data include data on depth information, voxel images to be used in medical image diagnosis, and moving images. In a description below, an embodiment of the approximation compression on parameters of the CNN which process a moving image data, for example, will be described. Note that descriptions of components the same as those of the first and second embodiments are omitted.
As described above, according to this embodiment, the approximation compression may be performed on data on depth information, voxel images, moving images, or the like, with high accuracy.
Next, a fourth embodiment of the present invention will be described. Although the codebook vectors of the parameters are real numbers in the foregoing embodiments, the vectors are binary in this embodiment. Since the codebook vector is binarized in this embodiment, accuracy of approximation may be lowered. However, reduction of a memory size or reduction of a calculation load amount may be expected. Note that descriptions of components the same as those of the first to third embodiments are omitted.
On the other hand,
(1) With reference to 3×3×N elements of the codebook vector, a value of a feature map is read when a value is 1 and the value is added to a feature map addition result 1201. On the other hand, when the value is 0, the value is not added.
(2) When the process has been performed on the K codebooks, K feature map addition results 1201a to 1201k are multiplied by corresponding codebook coefficients and a sum total is obtained as a result of the convolution.
In this way, a convolution calculation on the single portion is completed. The number of times multiplication is performed for the convolution is K, and the number of times addition is performed for the convolution is 3×3×N×K+K. In particular, when a space size of convolution is large, such as a size of 5×5 or 7×7, since the number of times multiplication is performed is small in this embodiment, this embodiment is advantageous in terms of a size of a circuit or the like.
Next, a method for obtaining binary codebook vector by learning will be described. In this method, a codebook is learnt in accordance with Expression 8 below.
minx, A Σn||cn−Axn||2+λ1|xn|+λ2Q(A),
Q(A)=Σij|aij−qnearest| Expression 8
Expression 8 is obtained by generalizing Expression 3 of the third embodiment and includes a binary constraint term Q(A) of the codebook. An term qnearest in Q(A) is a variable of a value in binary {0, 1} closer to a value of aij. The following process is performed to obtain a binary codebook vector in accordance with Expression 8 by learning.
First, all codebook vectors are initialized by a random number before learning is started. As the learning progresses, a value of X2 is gradually increased so that the value becomes close to binary. When the learning is converged, binarizing is finally performed using a threshold value 0 so that values of all elements of the codebook vectors are rounded to a binary {0, 1}. By this, a codebook having elements of values of binary are obtained.
Note that, as a modification of this embodiment, an element of a codebook may be a binary of a∈{−1, 1} or a ternary of a∈{−1, 0, 1}. Furthermore, a discrete value may have arbitrary accuracy in a range from a binary to n bits. Furthermore, accuracy of the discrete value may be changed every codebook vector. Moreover, a plurality of constant values may be set as elements of the codebook vector. In this embodiment, the codebook vector may be represented by a small number of bits since a reference table is additionally used.
Furthermore, in addition to the codebook vector, a codebook coefficient may be discretized in various methods described above.
Furthermore, as another modification, as disclosed in Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, BinaryConnect: Training Deep Neural Networks with binary weights during propagations, NIPS 2015, an embodiment in which a special neural network including values of weight of the neural network constituted by a binary {−1, 1} or a ternary {−1, 0, 1} is approximated may be considered. In this case, a codebook vector or a codebook coefficient may be a binary or a real value. In the case where a value of a weight of the neural network is a binary {−1, 1}, threshold-based processing is performed in accordance with Expression 9 below when a weight parameter is reconstructed.
c
(i,j)=sign(Σm∈{Top(K)}x(i, j, m)am) Expression 9
According to this embodiment, a memory size may be further reduced and a calculation load amount may be further reduced using a binary codebook. Note that, as described above, various modifications of a codebook vector, a codebook coefficient, and a weight parameter which is a target of reconstruction may be made. However, this embodiment is not limited to a specific embodiment and an appropriate configuration is employed based on a required compression rate and approximation accuracy, or the like.
The present invention is realized when software (programs) which realizes the functions in the foregoing embodiments is supplied to a system or an apparatus through a network and a computer (or a CPU) included in the system or the apparatus reads and executes the programs. Furthermore, the present invention may be applied to a system including a plurality of devices or to an apparatus including a single device. The present invention is not limited to the foregoing embodiments, and various modifications (including organic combinations of the embodiments) may be made based on the scope of the invention, and the modifications are also included in the scope of the present invention. Specifically, combinations of the foregoing embodiments and the modifications are also included in the present invention.
According to the present invention, approximation may be performed on sparse weights, such as weight parameters in higher layers of the CNN, by a general method.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2016-188412 filed Sep. 27, 2016 which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2016-188412 | Sep 2016 | JP | national |