This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-207436, filed on Dec. 15, 2020, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an arithmetic processing device, an arithmetic processing method, and a storage medium.
To improve recognition performance of deep neural networks (hereinafter also referred to as DNN), learning data used for learning DNN tends to increase. With the increase in the learning data, a band width of a memory bus that connects a computer that executes learning and a memory that stores data used for learning tends to increase. Therefore, a method of reducing the band width of the memory bus by compressing the data used for learning has been proposed. For data compression, a flag indicating “0” or “non-0” is provided for each byte of uncompressed data, and the data compression is conducted by truncating a predetermined bit of the “non-0” data such that the “non-0” data falls in a compressed data size.
Furthermore, a method of improving the accuracy of learning while reducing the data amount by updating a decimal point position on the basis of a distribution of bit positions of fixed-point number data obtained by an operation using the fixed-point number data for DNN learning or the like has been proposed. Moreover, a method of reducing the number of acquisition units and the number of signal wirings to reduce a circuit scale by acquiring an operation result from a set of operators in order in the case of calculating the distribution of bit positions of the fixed-point number data has been proposed.
Japanese National Publication of International Patent Application No. 2020-517014, Japanese Laid-open Patent Publication No. 2018-124681, and International Publication Pamphlet No. WO 2020/084723 are disclosed as related art.
According to an aspect of the embodiments, an arithmetic processing device includes one or more memories; and one or more processors coupled to the one or more memories and the one or more processors includes execute an operation of fixed-point number data, acquire statistical information that indicates a distribution of positions of most significant bits of a plurality of fixed-point number data obtained by the operation, update, based on the statistical information, a range for restriction of bit width of the plurality of fixed-point number data to be used for the operation, estimate respective data amount after compression of the plurality of fixed-point number data by a plurality of compression methods based on the statistical information, determine a compression method by which data amount after compression of the plurality of fixed-point number data is minimum among plurality of compression methods, transfer the plurality of fixed-point number data compressed by the compression method to the one or more memories, and execute deep neural network learning by using the plurality of fixed-point number data compressed by the compression method.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In DNN learning, an operation for a large amount of data is executed, and characteristics (distribution, value, and the like) of the data used for learning such as data obtained by the operation change each time. Even when the decimal point position is updated on the basis of the distribution of bit positions of the fixed-point number data obtained by the operation, the characteristics of the data used for learning change each time. Therefore, in the case of compressing the data used for learning by a specific compression method, the compression efficiency may vary according to the characteristics of the data. In the case of a decrease in the compression efficiency, data transfer time to the memory increases and learning time increases.
In one aspect, the present embodiment aims to reduce the learning time by improving the compression efficiency of data used for learning a deep neural network to be transferred to a memory.
By improving the compression efficiency of data used for learning a deep neural network to be transferred to a memory, the learning time can be reduced.
Hereinafter, embodiments will be described with reference to the drawings. Hereinafter, a signal line in which information such as a signal is transmitted is given a reference sign that is the same as a name of the signal.
Furthermore, the signal line illustrated by a single line in the drawings may be a plurality of bits.
The instruction control unit 10 includes a program counter PC, an instruction decoder DEC, and the like. The instruction control unit 10 fetches an instruction from the instruction memory 216 on the basis of an address indicated by the program counter PC, and supplies the fetched instruction to the instruction decoder DEC. The instruction decoder DEC decodes the fetched instruction and issues a decode result to the register unit 20, the vector unit 30, and the scalar unit 40. The register unit 20, the vector unit 30, and the scalar unit 40 function as an arithmetic unit that execute the instruction decoded by the instruction decoder DEC. Note that the instruction control unit 10 may have an instruction buffer or an instruction cache for prefetching the instruction.
The register unit 20 includes a vector register file VRF including a plurality of vector registers used by the vector unit 30, and a plurality of vector accumulators VACC corresponding to a predetermined number of vector registers. Furthermore, the register unit 20 includes a scalar register file SRF including a plurality of scalar registers used by the scalar unit 40, and a scalar accumulator ACC. Hereinafter, various registers in the register unit 20 are also simply referred to as registers.
Moreover, the register unit 20 includes a statistical information storage unit 22. The statistical information storage unit 22 stores statistical information acquired by the statistical information aggregation unit 50. For example, the statistical information is frequency distribution data illustrating a distribution of positions of most significant bits of each of operation result data (fixed-point number data) in the vector unit 30 or the scalar unit 40, and information indicating the positions of the most significant bits for obtaining the frequency distribution data.
The vector unit 30 includes, for example, an 8-element arithmetic unit. The vector unit 30 has a function to execute an integer operation, a product-sum operation using a vector accumulate register, and the like. Furthermore, the vector unit 30 executes clearing of the vector accumulate register, product-sum operation (multiply-accumulate (MAC)), cumulative addition, transfer of data to the vector register, and the like. Moreover, the vector unit 30 loads data from the data memory 218 and stores data in the data memory 218.
Each arithmetic unit of the vector unit 30 includes an integer operator (OP) 32, a data conversion unit 34, and a statistics acquisition unit 36. The data conversion unit 34 and the statistics acquisition unit 36 are provided for each integer operator 32. The integer operator 32 is an example of an arithmetic unit, and the data conversion unit 34 is an example of an update unit. Note that the function of the data conversion unit 34 may be included in the integer operator 32.
For example, the vector unit 30 inputs the data stored in the vector register and executes operations in parallel in the integer operators 32 of the 8-element arithmetic unit. Then, the vector unit 30 stores output data that is an operation result in the vector register. Furthermore, the vector unit 30 executes the product-sum operation in each of the 8-element integer operators 32, and stores each of the cumulative addition values of the product-sum operation results in the vector accumulator VACC.
For example, the integer operator 32 is an 8-bit operator. The integer operator 32 can execute not only the 8-bit data operation but also two 4-bit data parallel operations and four 2-bit data parallel operations. Since each data contains a sign bit, a bit number representing a data value used in an operation is one bit less than the bit number of the data. Note that the integer operator 32 may be a 16-bit operator. In this case, the integer operator 32 may cause the 16-bit operator to function as two 8-bit operators.
The scalar unit 40 includes an integer operator (OP) 42, a data conversion unit 44, and a statistics acquisition unit 46. The integer operator 42 is an example of an arithmetic unit, and the data conversion unit 44 is an example of an update unit. Note that the function of the data conversion unit 44 may be included in the integer operator 42. The scalar unit 40 has a function to execute a four-rule operation, a shift operation, a branch instruction, a load instruction, a store instruction, and the like. The scalar unit 40 executes operations using the scalar register and the scalar accumulator ACC.
For example, the integer operator 42 calculates input data stored in any of the scalar registers, and stores output data that is an operation result in the scalar register. In the case of executing a product-sum operation, the integer operator 42 stores a product-sum operation result in the scalar accumulator ACC. The operation result by the scalar unit 40 is stored in one of the scalar register, the scalar accumulator ACC, or the data memory 218. For example, the integer operator 42 may be an 8-bit arithmetic unit or a 16-bit arithmetic unit, like the integer operator 32. Note that the bit number of the integer operators 32 and 42 is not limited to 8 bits or 16 bits.
Each data conversion unit 34 receives fixed-point number data (operation result data) output from the integer operator 32 on the basis of an operation instruction. Each data conversion unit 34 extracts data having a predetermined bit number (bit width) from the received fixed-point number data on the basis of bit width information. At this time, each data conversion unit 34 executes saturation processing for upper-side bits to overflow and rounding processing for lower-side bits to underflow.
For example, each data conversion unit 34 converts the 24-bit fixed-point number data, which is the bit width of the output data of the integer operator 32, into 8-bit fixed-point number data, which is the bit width of the input data of the integer operator 32. Then, each data conversion unit 34 stores the fixed-point number data with a changed bit position in the register unit 20.
The function of the data conversion unit 44 is similar to the function of the data conversion unit 34. That is, the data conversion unit 44 changes the bit position (bit range) by selecting, on the basis of the bit width information, data of a predetermined bit number (bit width) from the fixed-point number data (operation result data) output from the integer operator 42 on the basis of the operation instruction. At this time, the data conversion unit 44 executes the saturation processing and the rounding processing.
For example, the data conversion unit 44 converts the 24-bit fixed-point number data, which is the bit width of the output data of the integer operator 42, into 8-bit fixed-point number data, which is the bit width of the input data of the integer operator 42. Then, the data conversion unit 44 stores the fixed-point number data with a changed bit position in the register unit 20.
Each statistics acquisition unit 36 receives the fixed-point number data (operation result data) output from the integer operator 32 on the basis of the operation instruction. Each statistics acquisition unit 36 acquires, for example, the position of the most significant bit of the received fixed-point number data, and outputs position information indicating the acquired position of the most significant bit to the statistical information aggregation unit 50.
The statistics acquisition unit 46 receives the fixed-point number data (operation result data) output from the integer operator 42 on the basis of the operation instruction. The statistics acquisition unit 46 acquires the position of the most significant bit of the received fixed-point number data, and outputs position information indicating the acquired position of the most significant bit to the statistical information aggregation unit 50.
Note that each of the statistics acquisition units 36 and 46 may acquire the position information indicating the position of the most significant bit of the operation result data and output the acquired position information to the statistical information aggregation unit 50 only in the case where the decode result of the instruction by the instruction decoder DEC includes an instruction of acquisition of statistical information. Furthermore, one data conversion unit 34 and one statistics acquisition unit 36 may be provided in common to the plurality of integer operators 32. In this case, the statistics acquisition unit 36 acquires the position of the most significant bit of the fixed-point number data output from each of the plurality of integer operators 32.
Here, the position of the most significant bit acquired by each of the statistics acquisition units 36 and 46 is an upper-side bit position where “1” first appears in the case where the sign bit is “0” (data is a positive value). Furthermore, the position of the most significant bit is an upper-side bit position where “0” first appears in the case where the sign bit is “1” (data is a negative value).
The statistical information aggregation unit 50 aggregates the position information indicating the position of the most significant bit received from the statistics acquisition units 36 and 46 to generate statistical information, and stores the generated statistical information in the statistical information storage unit 22. Then, as described above, the statistical information storage unit 22 stores the frequency distribution data indicating the distribution of the positions of the most significant bits of each of the operation result data (fixed-point number data). Examples of the statistical information aggregated by the statistical information aggregation unit 50 and stored in the statistical information storage unit 22 are described with reference to
In this embodiment, the compression/decompression unit 72 of the memory interface 70 compresses the fixed-point number data output from the register unit 20 on the basis of a compression method notified from a higher-level computer or the like that controls the arithmetic processing device 100, for example. Then, the compression/decompression unit 72 stores the compressed fixed-point number data in the data memory 218 (external memory). Here, the fixed-point number data output from the register unit 20 is the fixed-point number data with a changed bit position, which is output from the data conversion units 34 and 44 and stored in the register unit 20, and is, for example, middle layer data of training the deep neural network.
Furthermore, the compression/decompression unit 72 decompresses the compressed fixed-point number data read from the data memory 218, and stores the decompressed fixed-point number data in the register unit 20 or the like for use in DNN learning. By compressing the middle layer data of DNN training and storing the compressed data in the data memory 218, and decompressing the data read from the data memory 218 to the original data, a data transfer amount can be reduced as compared with a case where the data transfer amount to the data memory 218 is not compressed.
As a result, the time needed for data transfer between the register unit 20 and the data memory 218 can be shortened. Therefore, even in a case where a memory access speed is lower than an operation speed and a wait time occurs in the operator in DNN learning, the wait time can be reduced, the operation efficiency is improved, and the learning time can be shortened. For example, as an example in which the memory access speed is significantly lower than the operation speed in DNN learning, there is a layer that executes an operation for each element of a data array, or the like.
Moreover, by providing the compression/decompression unit 72 in the memory interface 70 located near the data memory 218, both the operation result data output from the vector unit 30 and the operation result data output from the scalar unit 40 can be compressed. Furthermore, the compression/decompression unit 72 can be mounted in the arithmetic processing device 100 without separating the unit into a compression unit and a decompression unit. As a result, a wiring region such as data lines related to data compression and decompression can be minimized, and a circuit scale of the arithmetic processing device 100 can be minimized.
The higher-level computer (computer) or the like or the arithmetic processing device 100 may include a compression method determination unit that estimates data amounts after compression of the operation result data by a plurality of compression methods, and determine the compression method with a minimum data amount on the basis of the statistical information stored in the statistical information storage unit 22. In this case, the compression/decompression unit 72 compresses the operation result data and decompresses the compressed operation result data by the compression method instructed by the compression method determination unit.
The server 200 includes an accelerator board 210 on which the arithmetic processing device 100 and a main memory 214 are mounted, a host 220, and a storage 230. The arithmetic processing device 100 and the host 220 are connected to each other by a communication bus such as a peripheral component interconnect express (PCIe) bus. Therefore, the arithmetic processing device 100 includes a PCIe interface (I/F) circuit 212, and the host 220 includes a PCIe interface (I/F) circuit 222.
The arithmetic processing device 100 includes a plurality of processing units PE (processing element) arranged in a matrix manner. For example, each processing unit PE is an arithmetic unit including the integer operator 32, the data conversion unit 34, the statistics acquisition unit 36 in
Although illustration is omitted, the arithmetic processing device 100 illustrated in
The host 220 includes a host CPU 224 and a memory 226 such as DRAM. The host CPU 224 is connected to the arithmetic processing device 100 via the PCIe interface circuit 222, and controls the arithmetic processing device 100 to cause the arithmetic processing device 100 to execute DNN learning.
For example, the host CPU 224 causes the arithmetic processing device 100 to execute DNN learning by executing an arithmetic processing program expanded in the memory 226. Furthermore, the host CPU 224 estimates the compression method that minimizes the data amount of the operation result data by executing the arithmetic processing program.
The host CPU 224 is connected to the hierarchically provided memory 226 and storage 230. For example, the storage 230 includes at least either a hard disk drive (HDD) or a solid state drive (SSD). Then, the host CPU 224 executes learning using learning data 232 stored in the storage 230 in DNN learning.
For example, DNN deep learning is executed for each mini-batch, which is a unit of processing. The mini-batch is an example of a batch. In
First, the host CPU 224 in
Then, the host CPU 224 executes forward processing from the Conv_1 layer to the fc2 layer using the divided input data in each mini-batch. Furthermore, the host CPU 224 executes backward processing from the fc2 layer to the Conv_1 layer using a forward processing result and correct answer data in each mini-batch. The host CPU 224 then updates the variable such as a weight using, for example, a gradient descent method.
In each mini-batch, the statistical information aggregation unit 50 of
Note that, in the first k-times of mini-batches, the compression/decompression unit 72 of
After the end of the k-times of mini-batches, the host CPU 224 determines the decimal point position of the fixed-point number data used in the next k-times of mini-batches, using the statistical information (frequency distribution data indicating the distribution of the positions of the most significant bits) stored in the statistical information storage unit 22. Furthermore, after the end of the k-times of mini-batches, the host CPU 224 determines the compression method to be used in the next k-times of mini-batches, using the statistical information stored in the statistical information storage unit 22. The host CPU 224 notifies the data conversion units 34 and 44 of the determined decimal point position, and notifies the compression/decompression unit 72 of the determined compression method. The data conversion units 34 and 44 update the decimal point position with the notified decimal point position. The compression/decompression unit 72 updates the compression method with the notified compression method.
Then, on or after the third round, the learning processing by the next k-times of mini-batches is repeatedly executed using the decimal point position and the compression method updated in the previous k-times of mini-batches. The learning processing is repeated until a difference from the correct answer data becomes equal to or less than a preset value.
By determining the compression method of the data to be stored in the data memory 218 using the statistical information stored in the statistical information storage unit 22 for each k-times of mini-batches, a transfer time of data to be read/write from/to the data memory 218 at the time of DNN learning can be reduced. In other words, the compression method that minimizes the transfer time of data to be read/write from/to the data memory 218 in the next k-times of mini-batches can be predicted using the statistical information stored in the statistical information storage unit 22.
The left side of
On the left side of
Then, the host CPU 224 notifies the data conversion units 34 and 44 of the bit precision (Q3.12). For example, the host CPU 224 determines the decimal point position of the fixed-point number data such that the 16 bits indicated by the bit precision are located at the center of the distribution. Here, the bit precisions (Q5.10) and (Q3.12) indicate notation of the fixed-point number data in a Q format.
The host CPU 224 may update the decimal point position such that (the number of overflowing data)/(the total number of data) becomes smaller than a predetermined value. Alternatively, the host CPU 224 may update the decimal point position according to (the number of underflowing data)/(the total number of data), or may update the decimal point position on the basis of the number of overflowing data and the number of underflowing data or a ratio thereof.
Meanwhile, on the right side of
For example, the data conversion unit 34 is notified in advance of the bit position (Q-2.9) of the 8-bit fixed point numbers from the host CPU 224. The lower left of
Note that the frequency distribution in
The right side of
The data string includes only 8-bit “non-0” operation result data with a sign bit. Hereinafter, the compression method illustrated in
The host CPU 224 in
The host CPU 224 estimates a compressed data amount for each region, which is the data amount after compression, for each of an upper-side range on the upper bit side of the bit range (expressible region), the bit range, and a lower-side range on the lower bit side of the bit range. Then, the host CPU 224 sets a sum total of three compressed data amounts for each region as a compression data amount.
The host CPU 224 calculates the compressed data amount for each region in the upper-side range by adding the product of the frequency at each bit position in the upper-side range and the data size (8 bits). Similarly, the host CPU 224 calculates the compressed data amount for each region in the bit range by adding the product of the sum total of frequencies at each of the bit positions and the data size (8 bits).
The host CPU 224 calculates the compressed data amount for each region in the lower-side range according to the sum total of products of a probability of having “1” or “−1” for each bit position (digit position) by the rounding processing, and the frequency at each bit position. Here, having “1” or “−1” by the rounding processing indicates the round up of the rounding processing to a bit range.
For example, the host CPU 224 has “1” or “−1” with a probability of 50% to 100% (exclusive of 100%) at the bit position of 2−10, so the host CPU 224 estimates having “1” or “−1” with the probability of 100%. Since it is rounded up with the probability of 100%, the host CPU 224 calculates the compression data amount at the bit position 2−10 by the product of the frequency and the data size (8 bits).
The host CPU 224 has “1” or “−1” with a probability of 25% to 50% (exclusive of 50%) at the bit position of 2−11, so the host CPU 224 estimates having “1” or “−1” with the probability of 50%. Therefore, the host CPU 224 calculates the compression data amount at the bit position 2−11 by the product of the frequency, the data size (8 bits), and “0.5”.
The host CPU 224 has “1” or “−1” with a probability of 12.5% to 25% (exclusive of 25%) at the bit position of 2−12, so the host CPU 224 estimates having “1” or “−1” with the probability of 25%. Therefore, the host CPU 224 calculates the compression data amount at the bit position 2−11 by the product of the frequency, the data size (8 bits), and “0.25”.
The host CPU 224 calculates the data amount at the bit position of 2−13 and below, and sets the sum total of the data amount at all the bit positions of the operation result data in the lower-side range as the compressed data amount for each region in the lower-side range. Then, the host CPU 224 calculates the compression data amount in the 0-skip compression method by adding the size of the flag string (the product of the total number of data and 1 bit) to the sum total of the compressed data amounts for each region in the upper-side range, the bit range, and the lower-side range. In the 0-skip compression method, the higher the ratio of “0” in the operation result data, the higher the compression efficiency.
The data string includes 8-bit operation result data with a sign bit that is none of “0”, “1”, and “−1”. Hereinafter, the compression method illustrated in
The host CPU 224 estimates the data amount of the compression data in the 01 compression method on the basis of the statistical information (the information illustrated by the frequency distribution in
The host CPU 224 estimates the compressed data amount for each region, which is the data amount after compression, for each of the upper-side range, the bit range, and the lower-side range. Then, the host CPU 224 sets the sum total of three compressed data amounts for each region as a compression data amount.
The host CPU 224 calculates the compressed data amount for each region in the upper-side range by adding the product of the frequency at each bit position and the data size (8 bits), similarly to
In
The number of flags is the same as the number of data in the original data string before compression. Each flag is 3 bits because each flag indicates the bit width of data from “0” to “127 (absolute value)” that can be expressed in the bit range (7 bits).
The data string includes a pair of data of the bit width indicated by the flag and the sign bit of each data. For example, the data “93” is represented by a sign bit S of “0” and the 7-bit “93”. The data “0” is represented only by the sign bit S of “0”. The data “−1” is represented by the sign bit S of “1” and the 1-bit “1”. The data “1” is represented by the sign bit S of “0” and the 1-bit “1”. The data “42” is represented by the sign bit S of “0” and the 6-bit “42”. Hereinafter, the compression method illustrated in
The host CPU 224 estimates the data amount of the compression data in the variable length compression method on the basis of the statistical information (the information illustrated by the frequency distribution in
The host CPU 224 estimates the data amount after compression for each of the upper-side range, the bit range, and the lower-side range. Then, the host CPU 224 sets the sum total of three compressed data amounts for each region as a compression data amount.
The host CPU 224 calculates the compressed data amount for each region in the upper-side range by adding the product of the frequency at each bit position and the data size (8 bits), similarly to
In the frequency distribution illustrated in
The host CPU 224 calculates the compressed data amount for each region in the lower-side range on the basis of the probability of having “−1” or “1” by the rounding processing (round up) and the probability of having “0” by the rounding processing (round down) at each bit position. For example, the host CPU 224 calculates the sum of twice the probability of having “1” or “−1” and one time the probability of having “0” for each bit position. Then, the host CPU 224 calculates the compressed data amount for each region in the lower-side range by adding the product of the calculated sum for each bit position and the frequency at each bit position.
Here, in the data string after compression, “1” or “−1” is expressed by 2 bits including the sign bit S, so the probability of having “1” or “−1” is doubled. In the data string after compression, “0” is expressed by 1 bit of the sign bit S only, so the probability of having “0” is multiplied by 1. In the variable length compression method, the higher the ratio of a value that is none of “0”, “1”, and “−1” in the operation result data, the higher the compression efficiency.
Note that the compression methods illustrated in
Note that the processing flow illustrated in
First, in step S100, the host CPU 224 determines the initial decimal point position, which is the initial value of the decimal point position. The host CPU 224 may determine the initial decimal point position of each variable by past experimental values, actual values, or user specification.
Next, in step S200, the host CPU 224 initializes the number of repetitions k of the mini-batch to “0”. Furthermore, the host CPU 224 initializes the variables that store the statistical information in the arithmetic processing program.
Next, in step S300, the host CPU 224 determines whether a condition for terminating learning is satisfied. The host CPU 224 terminates learning when, for example, an error in the fully connected layer (fc2) illustrated in
In step S400, the host CPU 224 causes the arithmetic processing device 100 to execute mini-batch learning, and accumulates the statistical information of each variable of each layer in the statistical information storage unit 22. Then, the host CPU 224 increases the number of repetitions k by “1” on the basis of completion of mini-batch learning, and executes step S500.
In step S500, the host CPU 224 determines whether the number of repetitions k has reached an update interval between the decimal point position of the fixed-point number data and the compression method. In the case where the number of repetitions k has not reached the update interval, the host CPU 224 returns to the processing of step S300, and in the case where the number of times k has reached the update interval, the host CPU 224 executes step S600.
In step S600, the host CPU 224 reads the statistical information accumulated in the statistical information storage unit 22 by executing the mini-batch. Then, the host CPU 224 updates the decimal point position of each variable of each layer as described with reference to
In step S410, the memory interface 70 reads the data (compressed operation result data) from the data memory 218. The compression/decompression unit 72 decompresses the data read from the data memory 218 and transfers the data to the register unit 20. For example, the memory interface 70 reads method information indicating the compression method from the data memory 218 together with the compression data. Then, the compression/decompression unit 72 decompresses the data corresponding to the compression method indicated by the method information read from the data memory 218. Next, in step S420, the integer operator 32 (or 42) executes the product-sum operation using the data stored in the register unit 20.
Next, in step S430, the data conversion unit 34 (or 44) changes the bit precision, which is the effective bit range of the data obtained by the product-sum operation, and executes the saturation processing and the rounding processing for the bit values outside the expressible effective range. For example, the data conversion unit 34 (or 44) changes the bit precision by using the operation result data stored in the register or the like.
Furthermore, the statistical information aggregation unit 50 acquires the statistical information (the position information indicating the position of the most significant bit) of the data obtained in the product-sum operation and acquired by the statistics acquisition unit 36 (or 46). The data to be processed by the data conversion unit 34 (or 44) and the statistics acquisition unit 36 (or 46) is result data of the product-sum operation for each of the output channels for all the input channels.
Next, in step S440, the memory interface 70 compresses the data with the effective bit range changed by the data conversion unit 34 (or 44), using the compression/decompression unit 72. For example, the compression/decompression unit 72 compresses the data using the compression method instructed by the host CPU 224. The host CPU 224 notifies the compression/decompression unit 72 of the compression method determined by the previous k-times of mini-batches in advance. For example, the memory interface 70 writes the compression data to the data memory 218 together with the method information indicating the compression method. Then, the product-sum operation processing for each input channel in steps S410 and S420, the data conversion processing for each output channel in steps S430 and S440, the statistical information acquisition processing, and the data compression processing are repeatedly executed.
First, in step S610, the host CPU 224 sets the data amount of the operation result data (uncompressed) in the k-times of mini-batch learning to the initial data amount, and selects non-compression as a candidate for the initial compression method. Then, the host CPU 224 sequentially predicts the compression data amount by all the compression methods by executing the processing of steps S620, S630, and S640, and determines the compression method with the smallest compression data amount as the compression method to be used for the next k-times of mini-batch learning. For example, a compression method a is one of the 0-skip compression method, the 01 compression method, or the variable length compression method illustrated in
In step S620, the host CPU 224 predicts the compression data amount in the case of using one of the compression methods. Next, in step S630, the host CPU 224 determines whether the compression data amount predicted in step S620 is smaller than the data amount. Here, the data amount to be compared is the data amount of non-compression set in step S610 in the first processing loop, and is the data amount determined in step S640 in the second or subsequent processing loop. In the case where the compression data amount is smaller than the data amount, the host CPU 224 executes the processing of step S640, and in the case where the compression data amount is equal to or larger than the data amount, the host CPU 224 returns to step S620 and predicts the compression data amount in the next compression method.
In step S640, the host CPU 224 sets the compression data amount predicted in step S620 to the data amount to be compared in the subsequent processing loop. Furthermore, the host CPU 224 sets the compression method predicted in step S620 as a candidate for the compression method. Then, the host CPU 224 determines the compression method remaining as a candidate for the compression method as the method to be used in the next k-times of mini-batches after completion of prediction of the compression data amount by all the compression methods, and terminates the processing in
By repeatedly executing steps S620 to S640 for each compression method, the compression method with the minimum compression data amount in all the compression methods is selected. Note that, in the case where the compression data amount in each compression method is equal to or larger than the data amount of non-compression, step S640 is never executed and non-compression is selected as a candidate for the compression method. In the case of compressing the operation result data by a fixed compression method, the operation result data is always compressed regardless of high or low compression efficiency. In this embodiment, since one of the plurality of compression methods or non-compression can be selected according to the compression data amount, a method having a small data amount, including non-compression, can be selected. That is, it is possible to avoid selection of a compression method having lower compression efficiency than non-compression.
First, in step S622, the host CPU 224 calculates the product of the number of data of the operation result data in the k times of mini-batch learning and the number of flag bits, and sets the product as the initial data amount to which the subsequent data amount is to be added. The number of flag bits is 1 bit in the 0-skip compression method of
In step S624, the host CPU 224 estimates the compressed data amount for each region in the upper-side range by calculating and accumulating the data amount for each digit from the most significant digit to x+7 digits (=−2). The processing in step S624 is common to the 0-skip compression method, the 01 compression method, and the variable length compression method. The host CPU 224 sequentially adds the product of the statistical information (d), which is the frequency (number of data) in the digit d, and 8 bits while updating the digit d as the data amount.
In step S626, the host CPU 224 calculates the data amount for each digit from x+6 digit (=−3) to x digit (=−9) and adds the data amount to the data amount accumulated in step S624. In step S626, the sum of the compressed data amount for each region in the upper-side range and the compressed data amount for each region in the bit range is estimated.
The host CPU 224 sequentially adds the product of the statistical information (d), which is the frequency of the digit d, and f(d, x) in each of the 0-skip compression method, the 01 compression method, and the variable length compression method while updating the digit d. f(d, x) is common to the 0-skip compression method and the 01 compression method, and is different in the variable length compression method from the 0-skip compression method and the 01 compression method.
In the 0-skip compression method and the 01 compression method, f(d, x) is set to “8”. In the variable length compression method, f(d, x) is set to “d−x+1+1”. In “d−x+1+1”, “d−x+1” indicates the number of digits of data, and the last “1” indicates the sign bit. Therefore, in the variable length compression method, for example, f(d, x) is set to “8” at the digit d=−3, “6” at the digit d=−5, and “2” at the digit d=−9.
In step S628, the host CPU 224 calculates the data amount for each digit from x−1 digit (=−10) to the least significant digit and adds the data amount to the data amount accumulated in step S626. In step S628, the host CPU 224 sequentially adds the product of the statistical information (d), which is the frequency of the digit d, and g(d, x) in each of the 0-skip compression method, the 01 compression method, and the variable length compression method while updating the digit d. Note that g(d, x) is different in the 0-skip compression method, the 01 compression method, and the variable length compression method.
In the 0-skip compression method, g(d, x) is set to “8*2{circumflex over ( )}(d−x+1)”. The sign “{circumflex over ( )}” represents a power. “8” in g(d, x) indicates that each data is 8 bits, and “2{circumflex over ( )}(d−x+1)” indicates the probability of data having “1” or “−1”. The probability is set to “2{circumflex over ( )}0=1 (=100%)” at the digit d=−10, “2{circumflex over ( )}−1=0.5 (=50%)” at the digit d=−11, and “2{circumflex over ( )}−2=0.25 (=25%) at the digit d=−12, as illustrated in
In the 01 compression method, as illustrated in
In the variable length compression method, g(d, x) is set to “2*2{circumflex over ( )}(d−x+1)+1*(1−2{circumflex over ( )}(d−x+1)”. The leading “2” indicates 2 bits, and “2{circumflex over ( )}(d−x+1)” indicates the probability of data of having “1” or “−1”. The leading “1” of “1*(1−2{circumflex over ( )}(d−x+1)” indicates 1 bit, and “(1−2{circumflex over ( )}(d−x+1)” indicates the probability of having “0”.
Note that the order of processing steps S624, S626, and S628 is arbitrary. Furthermore, only the compressed data amount for each region in the upper-side range may be accumulated in step S624, only the compressed data amount for each region in the bit range may be accumulated in step S626, and only the compressed data amount for each region in the lower-side range may be accumulated in step S628. Then, after step S628, the data amount in the flag string calculated in step S622 and the compressed data amount for each region accumulated in each of steps S624, S626, and S628 may be added to each other.
By compressing the middle layer data of DNN training and storing the compressed data in the data memory 218, and decompressing the data read from the data memory 218 to the original data, the data transfer amount to the data memory 218 can be reduced. As a result, in DNN learning, the wait time can be reduced in the operator, the operation efficiency is improved, and the learning time can be shortened. That is, by improving the compression efficiency of data used for learning the deep neural network to be transferred to the memory, the learning time can be reduced.
By providing the compression/decompression unit 72 in the memory interface 70 located near the data memory 218, both the operation result data output from the vector unit 30 and the operation result data output from the scalar unit 40 can be compressed. Furthermore, the compression/decompression unit 72 can be mounted in the arithmetic processing device 100 without separating the unit into a compression unit and a decompression unit. As a result, a wiring region such as data lines related to data compression and decompression can be minimized, and a circuit scale of the arithmetic processing device 100 can be minimized.
By determining the compression method for the data to be stored in the data memory 218 using the statistical information stored in the statistical information storage unit 22 for each predetermined amount of learning, the compression method that minimizes the transfer time of data to the data memory 218 in the next predetermined amount of learning can be predicted. Since one of the plurality of compression methods or non-compression can be selected according to the compression data amount, it is possible to select a method having a small data amount, including non-compression. That is, it is possible to avoid selection of a compression method having lower compression efficiency than non-compression.
Since the compression method that is expected to minimize the compression data amount can be selected from the plurality of compression methods, the optimum compression method estimated to minimize the compression data amount can be adopted in the next k-times of mini-batches according to the characteristics of the compression data that changes for each k-times of mini-batches.
For example, in the 0-skip compression method, the higher the ratio of “0” in the operation result data, the higher the compression efficiency. In the 01 compression method, the higher the ratio of “0”, “1”, and “−1” in the operation result data, the higher the compression efficiency. In the variable length compression method, the higher the ratio of a value that is none of “0”, “1”, and “−1” in the operation result data, the higher the compression efficiency.
By mounting the compression/decompression unit 72 on the memory interface 70, the common compression/decompression unit 72 can efficiently compress and decompress the operation result data even in the case of providing the plurality of integer operators 32 and 42.
The instruction control unit 10A has similar configuration and function to the instruction control unit 10 of
The scalar unit 40A has similar configuration and function to the scalar unit 40A of
The compression method determination unit 12A executes the processing of determining a compression method illustrated in
A server on which the arithmetic processing device 100A is mounted has similar configuration and function to the server 200 illustrated in
Note that the compression method determination unit 12A may be provided at another location in the arithmetic processing device 100A as long as statistical information stored in a statistical information storage unit 22 can be referred to. Furthermore, in
As described above, in this embodiment, effects similar to those of the above-described embodiment can be obtained. For example, by compressing middle layer data of DNN training and reading/writing the data to/from a data memory 218, the data transfer amount to the data memory 218 can be reduced and the learning time of the DNN can be shortened. That is, by improving the compression efficiency of data used for learning the deep neural network to be transferred to the memory, the learning time can be reduced.
Moreover, in this embodiment, the compression method can be determined in the arithmetic processing device 100A by providing the compression method determination unit 12A in the arithmetic processing device 100A. As a result, the communication amount and the communication time between the arithmetic processing device 100A and the host 220 (
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-207436 | Dec 2020 | JP | national |