This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-69144, filed on Apr. 7, 2020, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an information processing apparatus and an information processing method.
In recent years, in order to improve the recognition performance of a deep neural network (DNN), the number of parameters used for deep learning and the number of pieces of learning data have been increasing. Here, the parameters indude weights between nodes, data held by the nodes, filter elements, and the like. For this reason, the computation load and memory load of a parallel computer used for speeding up the deep learning have grown larger, and the learning time has increased. In re-learning during the service of the DNN, the increase in learning time brings about a heavy burden.
Thus, in order to lighten the DNN, the number of bits used by the parameter to represent data is shrunk. For example, by using an 8-bit fixed-point number instead of a 32-bit floating-point number, the amount of data may be reduced and the amount of computation time may be reduced.
However, using the 8-bit fixed-point number deteriorates the accuracy of operations. In view of this, a dynamic fixed-point number capable of dynamically modifying the fixed-point position of a variable used for learning is used. When the dynamic fixed-point number is used, the parallel computer acquires statistical information on the variable during learning and automatically adjusts the fixed-point position of the variable. Furthermore, the parallel computer may decrease the overhead expected for acquiring the statistical information by providing a statistical information acquisition circuit in respective processing devices that perform operations in parallel.
Japanese Laid-open Patent Publication No. 2018-124681 is disclosed as related art.
According to an aspect of the embodiments, an information processing apparatus performing deep learning using a first number of processing devices that perform processes in parallel, the deep learning being performed using dynamic fixed-point number, the information processing apparatus includes a memory and a processor coupled to memory and configured to allocate, when allocating a propagation operation in a layer of the deep learning to the first number of processing devices, a second number of processing devices for every third number of pieces of input data, the third number being less than a first number, the second number of the processing device acquiring a statistical information used for adjusting decimal point positions of the dynamic fixed-point numbers, and allocate output channels for every third number of pieces of input data while shifting the output channels by a fourth number.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In the related art, if the statistical information acquisition circuits are provided in all the processing devices of the parallel computer, the circuit area of the parallel computer becomes larger. Thus, in order to reduce the circuit area, it is conceivable to provide the statistical information acquisition circuit only in some processing devices. However, if the statistical information is acquired only by some processing devices and thinned out, an error occurs as compared with a case where the statistical information is acquired by all the processing devices, and an appropriate decimal point position may not be set. For this reason, there is a problem that the saturation and rounding of variable values increase during learning, and the learning accuracy deteriorates.
In one aspect, an object of the present embodiments is to suppress the deterioration of learning accuracy when a statistical information acquisition circuit is provided in some processing devices.
Embodiments of an information processing device and an information processing method disclosed by the present application will be described in detail below based on the drawings. Note that the embodiments do not limit the technology disclosed.
First, the information processing device (apparatus) according to an embodiment will be described.
The accelerator board 10 is a board equipped with a parallel computer that performs deep learning at high speed. The accelerator board 10 includes a controller 11, a plurality of processing elements (PEs) 12, a dynamic random access memory (DRAM) 13, and peripheral component interconnect express (PCIe) hardware 14. The number of PEs 12 is, for example, 2,048.
The controller 11 is a control device that controls the accelerator board 10. For example, the controller 11 instructs each PE 12 to execute an operation, based on an instruction from the host 20. The storage location of data input and output by each PE 12 is specified by the host 20. Note that, although omitted in
The PE 12 executes an operation, based on the instruction from the controller 11. The PE 12 reads out and executes a program stored in the DRAM 13. A part of PEs 12a include a statistical information acquisition circuit and a statistical information storage circuit. The ratio of the part of PEs 12a to the number of all PEs 12 is, for example, 1/16. The number of the part of PEs 12a is, for example, a divisor of the number of all PEs 12. Note that, in the following, the part of PEs 12a will be referred to as information acquisition PEs 12a.
The statistical information acquisition circuit acquires statistical information. Note that the statistical information will be described later. The statistical information storage circuit stores the statistical information acquired by the statistical information acquisition circuit. The statistical information stored in the statistical information storage circuit is read out by the controller 11 and sent to the host 20. Note that the statistical information may be stored in the DRAM 13 so as to be read out from the DRAM 13 and sent to the host 20.
Furthermore, the information acquisition PE 12a is not limited to the configuration including the dedicated statistical information acquisition circuit and statistical information storage circuit as long as the information acquisition PE 12a can acquire the statistical information and send the acquired statistical information to the host 20. For example, a program executed by the PE 12 described later may include an instruction sequence for acquiring the statistical information. The instruction sequence for acquiring the statistical information is such that, for example, the result of a multiply-add operation is stored in a register #1 as a 32-bit integer, information on the most significant digit position of the result stored in the register #1 is stored in a register #2, and 1 is added to the value in a table indexed by the value in the register #2.
The DRAM 13 is a volatile storage device that stores a program executed by the PE 12, data input by each PE 12, and data output by each PE 12. An address used by each PE 12 for data input and output is specified by the host 20. The PCIe hardware 14 is hardware that communicates with the host 20 by PCI Express (PCIe).
The host 20 is a device that controls the information processing device 1. The host 20 includes a central processing unit (CPU) 21, a DRAM 22, and PCIe hardware 23.
The CPU 21 is a central processing unit that reads out a program from the DRAM 22 and executes the read-out program. The CPU 21 instructs the accelerator board 10 to execute parallel operations and performs deep learning by executing a deep learning program. The deep learning program includes an allocation program that allocates operations in deep learning to each PE 12. The CPU 21 implements an allocation unit 40 by executing the allocation program. Note that the details of the allocation unit 40 will be described later.
The DRAM 22 is a volatile storage device that stores programs and data stored in the HDD 30, intermediate results of program execution by the CPU 21, and the like. The deep learning program is called from the HDD 30 to the DRAM 22 and executed by the CPU 21.
The PCIe hardware 23 is hardware that communicates with the accelerator board 10 by PCI Express.
The HDD 30 stores the deep learning program, input data used for deep learning, a model generated by deep learning, and the like. The information processing device 1 may include a solid state drive (SSD) instead of the HDD 30.
Next, deep learning according to the embodiment will be described.
The deep learning according to the embodiment is executed divided into processing units referred to as mini-batches. Here, the mini-batch is a combination of k pieces of data obtained by dividing a collection of input data to be learned {(Ini, Ti), i=1 to N} into plural sets (for example, M sets of k pieces of data, N=k*M). Furthermore, the mini-batch refers to a processing unit of learning that is executed on every such input data set (k pieces of data). Here, Ini is input data (vector) and Ti is correct data (vector). The information processing device 1 acquires statistical information about some of variables of each layer and updates the decimal point position of each variable of each layer for each mini-batch during the deep learning as follows. Here, a decimal point position e corresponds to an exponent part common to all the elements of a parameter X. When the element of the parameter X is denoted by x and the integer representation is denoted by n, the representation x=n×2e can hold. Note that the information processing device 1 may update the decimal point position every time the learning of the mini-batch is ended a predetermined number of times.
The information processing device 1, for example, determines the initial decimal point position of each variable by trial (for example, one time on a mini-batch) with a floating-point number or user specification, and starts learning. Then, the information processing device 1 saves the statistical information about some variables in each layer during learning of one mini-batch (k pieces of data) (t1). If overflow occurs while learning the mini-batch, the information processing device 1 performs a saturation process and continues learning. Then, the information processing device 1 updates the decimal point position of the fixed-point number in line with the statistical information after the learning of the mini-batch one time is ended (t2). Thereafter, the information processing device 1 repeats t1 and t2 until a predetermined learning end condition is satisfied.
In
Furthermore, the numerical values given to the horizontal axis of
The information processing device 1 may determine an appropriate fixed-point position by obtaining the distribution of the position of leftmost set bit for positive number and position of leftmost zero bit for negative number, during learning execution. For example, the information processing device 1 can determine the fixed-point position such that the data to be saturated is equal to or less than a specified ratio. This means that, as an example, the information processing device 1 can determine the fixed-point position prior to the data saturation becoming a predetermined degree rather than the data underflow becoming a predetermined degree.
Note that, as statistical information, instead of the distribution of the position of leftmost set bit for positive number and position of leftmost zero bit for negative number, the information processing device 1 may use the distribution of the non-sign least significant bit positions, the maximum value at the position of leftmost set bit for positive number and position of leftmost zero bit for negative number, or the minimum value at the non-sign least significant bit position.
Here, the distribution of the non-sign least significant bit positions means the distribution of the positions of the least significant bits where the bits have different values from the signs. For example, when the bits are placed in an array from the most significant bit being bit[39] to the least significant bit being bit[0], the least significant bit position is the position of a bit with the smallest index k among the bits[k] different from the sign bit bit[39]. In the distribution of the non-sign least significant bit positions, a least significant bit induding valid data is grasped.
Furthermore, the maximum value at the position of leftmost set bit for positive number and position of leftmost zero bit for negative number is the maximum value among the values at the most significant bit positions that have values different from the value of the sign bit for one or more fixed-point numbers targeted for instruction execution from the time when the statistical information storage circuit was cleared by a clear instruction to the present time. The information processing device 1 can use the maximum value at the position of leftmost set bit for positive number and position of leftmost zero bit for negative number to determine an appropriate decimal point position of the dynamic fixed-point number.
The minimum value at the non-sign least significant bit position is the minimum value among the values at the least significant bit positions that have values different from the value of the signs for one or more fixed-point numbers from the time when the statistical information storage circuit was cleared by a clear instruction to the present time. The information processing device 1 can use the minimum value at the non-sign least significant bit position to determine an appropriate decimal point position of the dynamic fixed-point number.
Next, the allocation unit 40 will be described. The information processing device 1 executes all the operations performed in deep learning in parallel as much as possible in order to effectively utilize the PEs 12. Here, the information processing device 1 collectively perform operations of the mini-batches to proceed with the learning.
Taking the operation of the convolution layer as an example, it is assumed that the filter size is 3×3, the number of images in the mini-batch is N, the number of input channels is Cin, the number of output channels is Cout, the height of the image is H, and the width of the image is W. The number of pixels of data to be input is N*Cin*(H+2)*(W+2). Here, “*” indicates multiplication. Furthermore, “2” indicates the number of paddings at two ends in a height direction or a width direction of the image. The number of pixels of the filter to be input is Cin*Cout*3*3. The number of results to be output is N*Cout*H*W. The operation content is indicated by following expression (1).
In expression (1), n=0, N−1, co=0, Cout−1, h=0, 1, . . . , H, w=0, 1, . . . , W−1, c=0, 1, . . . , Cin−1, p=0, 1, 2, and q=0, 1, 2 hold. Furthermore, an output [n][co][h][w] indicates the value of a pixel of an n-th image in a co-th output channel at an h-th place in the height direction and a w-th place in the width direction, and an input[n][ci][h+p][w+q] indicates the value of a pixel of the n-th image in a ci-th input channel at the (h+p)-th place in the height direction and the (w+q)-th place in the width direction. A filter [ci][co][p][q] indicates the value of a pixel of a filter in the co-th output channel of the ci-th input channel at a p-th place in the height direction and a q-th place in the width direction.
As illustrated in expression (1), the operation of the convolution layer can be computed independently between each of the image (n), the output channel (co), and the pixel (h, w). In addition, since the input pixel data and filter data are used many times, it is efficient to achieve parallelization in an image direction and an output channel direction in this order, in order to enhance the efficiency of data transfer between the DRAM 13 and the PEs 12.
Thus, as illustrated in
In this allocation, only the statistical information on a specific image such as an image #0 and a specific output channel such as an output channel #0 is acquired. The statistical information on an image #1, an image #(N−1), and the like, and the output channels such as an output channel #1 and an output channel #(Cout−1) is not acquired. For this reason, the statistical information will be different compared with a case where the statistical information is not thinned out.
Furthermore,
As illustrated in
In this manner, if the images and output channels are mechanically allocated to the PEs 12, the statistical information will be different from a case where the statistical information is not thinned out.
In
In view of this, the allocation unit 40 allocates the PEs 12 such that all images and all output channels are targeted for acquiring the statistical information.
As illustrated in
In this manner, since the allocation unit 40 rotates the output channels for each image to allocate the output channels to the PEs 12, even when the information acquisition PEs 12a are thinned out as a part of the whole PEs 12, a bias in the statistical information may be mitigated.
For example, the allocation unit 40 allocates the information acquisition PEs 12a to images #0, #4, #8, . . . , but does not allocate the information acquisition PEs 12a to images #1, #2, #3, #5, #6, #7, . . . . Then, when the remainder obtained by dividing the image number by 16 is 0, the allocation unit 40 allocates output channels #0, #4, #8, . . . to the information acquisition PEs 12a. Furthermore, when the remainder obtained by dividing the image number by 16 is 4, the allocation unit 40 allocates output channels #1, #5, #9, . . . to the information acquisition PEs 12a. Similarly, when the remainder obtained by dividing the image number by 16 is 12, the allocation unit 40 allocates output channels #3, #7, #11, . . . to the information acquisition PEs 12a.
In this manner, since the allocation unit 40 rotates the output channels for each image to allocate the output channels to the PEs 12 in regard to the images to which the information acquisition PEs 12a are allocated, even when the information acquisition PEs 12a are thinned out as a part of the whole PEs 12, a bias in the statistical information may be mitigated.
Next, the flow of a learning process by the information processing device 1 will be described.
Then, the host 20 repeats the processes in steps S3 to S11 until an end condition for learning is satisfied. The end conditions for learning include, for example, the number of times of learning and the fulfillment of a desired value. As repetitive processes performed on the accelerator board 10, the host 20 loads the learning data (step S3) and calls a layer's forward propagation operation (step S4) in a forward direction of the layers. The propagation operation is a convolution operation in the convolution layer, a pooling operation in the pooling layer, and a fully connected operation in the fully connected layer.
When called by the host 20, the accelerator board 10 executes the forward propagation operation (step S5). Then, the host 20 calls a layer's backpropagation operation (step S6) on the accelerator board 10 in a reverse direction of the layers. When called by the host 20, the accelerator board 10 executes the backpropagation operation (step S7).
Then, the host 20 instructs the accelerator board 10 to update the parameter (step S8). When instructed by the host 20, the accelerator board 10 executes the parameter update (step S9). Then, the host 20 determines the decimal point position of the dynamic fixed-point number based on the statistical information, and instructs the accelerator board 10 to update the decimal point position (step S10). When instructed by the host 20, the accelerator board 10 executes the decimal point position update (step S11).
In this manner, in the basic form, since the host 20 performs the PE allocation, the host 20 instructs the accelerator board 10 to execute the propagation operation together with the PE allocation information.
On the other hand, in the derivative form, the host 20 calls the propagation operation on the accelerator board 10 together with the input data address and the output data address (step S26), as illustrated in
In this manner, in the derivative form, since the controller 11 performs the PE allocation, the host 20 instructs the accelerator board 10 to execute the propagation operation without the PE allocation information.
Next, the flow of an allocation process will be described with reference to
In
Note that it is assumed that NL is a multiple of X and Cout is a multiple of Y. NL denotes the number of images allocated at one time. For example, when N is assumed as a multiple of NL and the number of PEs 12 is denoted by NP, the product of the total number of allocations=NP and the number of times of allocation to all PEs 12=NP*(N/NL) holds. Meanwhile, since the total number of allocations=N*Cout holds, NP*(N/Ni)=N*Cout holds. Therefore, NP/NL=Cout holds, and NP/Cout=NL holds. CEIL(x) is a function that rounds up x to an integer.
Furthermore, in
As illustrated in
The allocation unit 40 increments n by 1 from 0 to NL−1, and allocates the output channels of an image #n to the PEs 12. The allocation unit 40 computes the variables j and k, and sets k*Y+j*Cout in a variable p0 that represents the top PE number to which the image #n is allocated (step S32). The allocation unit 40 increments c by 1 from 0 to Cout−1, and repeats the process of allocating the output channel #c of the image #n to the PE 12 Cout times.
In one process of allocating one combination of the image and the output channel to each PE 12 entirely, the allocation unit 40 computes the variables l and m to set m+l*X*Y in a variable p1 that represents the relative value of the PE number to which the channel #c is allocated (step S33), and allocates an image #(n+i*NL) and the output channel #c to PE #(p0+p1) (step S34). The allocation unit 40 increments c by 1 from 0 to Cout−1, and repeats steps S33 and S34.
Furthermore, in
As illustrated in
The allocation unit 40 sets n Cout in the variable p0 that represents the top PE number to which the image #n is allocated (step S42). The allocation unit 40 increments c by 1 from 0 to Cout−1, and repeats the process of allocating the output channel #c of the image #n to the PE 12 Cout times.
In one process of allocating one combination of the image and the output channel to each PE 12 entirely, the allocation unit 40 sets (c−n+Cout) % Cout in a variable c′ for the channel #c to set c′ in the variable p1 that represents the relative value of the PE number to which the channel #n is allocated (step S43), and allocates the image #(n+i*NL) and the output channel #c to PE #(p0+p1) (step S44). For example, the allocation unit 40 shifts the output channels using n in step S43. The allocation unit 40 increments c by 1 from 0 to Cout, and repeats steps S43 and S44.
In this manner, when allocating the combination of the images and the output channels to the PEs 12, the allocation unit 40 shifts the output channels using n, which means to rotate the output channels for each image, such that a bias in the statistical information may be mitigated.
In this manner, when allocating the combination of the images and the output channels to the PEs 12, the allocation unit 40 shifts the output channels using j, which means to rotate the output channels for each allocation of X images, such that a bias in the statistical information may be mitigated.
Next, the effect of allocation by the allocation unit 40 will be described.
As described above, in the embodiment, the accelerator board 10 includes the information acquisition PEs 12a as a part of the whole PEs 12. Furthermore, when allocating the layer's propagation operation of deep learning to the PEs 12, the allocation unit 40 of the host 20 evenly allocates the information acquisition PEs 12a for every certain number of images, and rotates the output channels for every certain number of images to allocate the output channels to the PEs 12. Therefore, the information processing device 1 may suppress a bias in the statistical information and may suppress the deterioration of the learning accuracy.
Furthermore, in the embodiment, the allocation unit 40 evenly allocates the information acquisition PEs 12a for each image, and rotates the output channels for each image to allocate the output channels to the PEs 12, such that a bias in the statistical information may be suppressed.
In addition, in the embodiment, when allocating the propagation operation in the convolution layer of deep learning to the PEs 12, the allocation unit 40 evenly allocates the information acquisition PEs 12a for every certain number of images, and rotates the output channels for every certain number of images to allocate the output channels to the PEs 12. Therefore, the information processing device 1 may suppress a bias in the statistical information acquired in the propagation operation in the convolution layer.
Besides, in the embodiment, the controller 11 of the accelerator board 10 may perform the allocation process instead of the allocation unit 40, such that the load on the host 20 may be lowered.
Additionally, in the embodiment, the case of learning images has been described, but the information processing device 1 may learn other data.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-069144 | Apr 2020 | JP | national |