This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2019-178727, filed on Sep. 30, 2019, the entire contents of which are incorporated herein by reference.
The embodiments relate to an information processing apparatus, an information processing method, and an information processing program.
A neural network (hereinafter referred to as NN), which is an example of machine learning, is a network in which an input layer, a plurality of hidden layers, and an output layer are arranged in order. Each layer has one or more nodes, and each node has a value such as input data. Then, nodes between one layer and the next layer are connected by edges, and each edge has parameters such as weight and bias.
Japanese Laid-open Patent Publication No. 07-084975, Japanese Laid-open Patent Publication No. 2012-203566, Japanese Laid-open Patent Publication No. 2009-271598, and Japanese Laid-open Patent Publication No. 2018-124681 are disclosed as related art.
According to an aspect of the embodiments, an information processing apparatus includes: a memory; and a processor coupled to the memory and configured to: execute a predetermined operation on each of a plurality of pieces of input data so as to generate a plurality of pieces of first operation result data that is a result of the predetermined operation; acquire statistical information regarding a distribution of digits of most significant bits that are unsigned for each of the plurality of pieces of first operation result data; store the plurality of pieces of first operation result data based on a predetermined data type in a register; execute a saturation process or a rounding process on the plurality of pieces of first operation result data based on, out of a first data type and a second data type that represent operation result data with a predetermined bit width, the second data type having a narrower bit width than the first data type, so as to generate a plurality of pieces of second operation result data; calculate a first sum total based on the statistical information by adding up a value acquired for every one of the digits by multiplying a number of data in which the most significant bits are distributed to the digits in the plurality of pieces of first operation result data by a value of the digit; calculate a second sum total based on the statistical information by adding up a value acquired for every one of the digits by multiplying a number of data in which the most significant bits are distributed to the digits in the plurality of pieces of second operation result data by a value of the digit; calculate a first quantization difference that is a difference between the first sum total and the second sum total; and store the plurality of pieces of second operation result data in the register when the calculated first quantization difference is less than a predetermined threshold value.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In the NN, the value of a node of each layer is acquired by executing a predetermined operation based on the value of a node of the preceding layer and an edge weight, and the like. Then, when input data is input to a node of the input layer, the value of a node of the next layer is acquired by a predetermined operation, and moreover, using data acquired by the operation as input data, the value of the node of the next layer is acquired by a predetermined operation of the layer. Then, the value of a node of the output layer, which is the last layer, becomes output data for the input data.
When inputting or outputting data, a value is represented in a predetermined data type and read from or written to a storage device. At this time, as the range of representable values of the data type representing the value or, for example, the representation range is wider, the desired bit width increases. For example, when a data type using a floating-point number is used, the desired bit width becomes large in compensation for the wide representation range, and the used capacity and the amount of operation of the storage device increase.
In order to reduce the amount of operation of the NN, a method called quantization is used, which uses a data type whose bit width desired for representing a value is narrow. For example, in a data type that uses a fixed-point number, a representation with a fixed decimal point position is used to reduce the bit width desired for the representation as compared to a floating-point number that needs representation of mantissa and exponent. However, because the data type of fixed-point numbers has a narrow representable range compared to floating-point numbers, if the number of digits in the value increases due to an operation, an overflow may occur that falls outside the representation range and high-order bits of an operation result value may be saturated, or an underflow may occur and lower bits may be rounded. In this case, accuracy of the operation result may decrease.
Therefore, in the operation of the NN, a dynamic fixed point has been proposed which dynamically adjusts the decimal point position of operation result data acquired by the operation. Furthermore, as a method for determining an appropriate decimal point position, there is known a method of acquiring statistical information of a most significant bit that is unsigned and setting a decimal point position that satisfies a condition using a predetermined threshold value based on the statistical information.
In the conventional quantization method of the NN, a user specifies a variable to be quantized before starting learning and inference. It is difficult to determine, with a specific layer or a specific variable, a variable that causes less deterioration in recognition rate of the NN even when it is quantized. This is because the variable changes non-linearly depending on design conditions of multiple NNs, such as the number and size of data input to the NN and the connection relation of layers. It is conceivable that the user determines, from an empirical rule, a variable as a quantization target by selecting a specific variable whose accuracy does not significantly decrease even when it is quantized.
Whether or not quantization is possible for a given variable depends on whether or not the distribution of values of elements included in a tensor representing operation result data, for example, the distribution of values of the operation result data can be covered even in a narrow representation range, and whether or not recognition accuracy of the NN can be maintained. If the distribution of values of the operation result data is narrow, the quantization is possible, but when the distribution is too wide, the error due to the quantization becomes large and the accuracy is significantly lowered, and thus the quantization may not be performed. For example, in an early stage of learning, the value of operation result data may change greatly and the value distribution of the operation result data may become wide. Thus, even if an optimum decimal point position is determined when a value represented by a floating-point number is represented by a fixed-point number, it is not possible to prevent recognition accuracy of the NN from decreasing.
In one aspect, an information processing apparatus, information processing method, and information processing program that reduce the amount of operation while maintaining recognition accuracy of the NN may be provided.
The first convolutional layer Conv_1 performs a product-sum operation of weights between nodes or the like, for example, of pixel data of an image input to the plurality of nodes in the input layer INPUT, and outputs pixel data of an output image having a feature of an image to the plurality of nodes in the first convolutional layer Conv_1. The same applies to the second convolutional layer Conv_2.
The first pooling layer Pool_1 is a layer whose node is a value determined from the local node of the first convolutional layer Conv_1, which is a previous layer, and absorbs a slight change in the image by, for example, taking the maximum value of a local node as a value of its own node.
The output layer OUTPUT finds a probability of belonging to each category from the value of the node using a softmax function or the like.
In the NN, a plurality of layers may be configured by hardware circuits, and the hardware circuits may execute the operations of the respective layers. Alternatively, the NN may cause a processor that executes an operation of each layer of the NN to execute a program that causes the operation of each layer to be executed. The NN process described in
As illustrated in
Furthermore, instead of repeating the processes S1 to S9 with the same combination of the input data and the teacher data until the predetermined number of times is reached, it is also performed to terminate the learning processing due to that an evaluation value of a learning result, for example, an error between the output data and the teacher data falls within a certain range.
In an example of the learning processing of the NN, the determination of a quantization target in S2 is performed by setting a variable specified as the quantization target by a user prior to learning. Furthermore, for S2, the variable as the quantization target may be changed according to the progress of repeated execution of the learning.
In the quantization process S4, the quantization process is performed on the variable determined as a quantization target in S2. For example, the input layer and the hidden layer use a data type of FP32 that represents a floating-point number in 32 bits, and the output layer uses a data type of INT8 that represents an integer in 8 bits to perform quantization.
In the forward propagation process S5, operations of the respective layers are sequentially executed from the input layer of the NN toward the output layer. Describing with the example of
Next, in error evaluation S6, the error between the teacher data and the output data of the NN is calculated. Then, the back propagation process S7 for propagating the error calculated in S6 from the output layer of the NN to the input layer is executed. In the back propagation process S7, the error is partially differentiated by a variable such as the weight of each layer by propagating the error from the output layer to the input layer. Then, in the variable update S8, the current variable is updated by a partial differential result of the error due to the variable acquired in S7, and the weight or the like of each layer is updated toward an optimum value.
The host processor 31 of the host machine 30 executes a program that is stored in the auxiliary storage device 35 and expanded in the main memory 33. The high-speed input-output interface 32 is an interface that connects the host processor 31 such as PCI Express and the NN execution machine 40, for example. The main memory 33 stores programs and data executed by the processor.
The internal bus 34 connects a peripheral device, which is slower than the processor, to the processor and relays communication between them. The low-speed input-output interface 36 is connected to a keyboard and a mouse of the user terminal 50 via a USB or the like, or is connected to an Ethernet (registered trademark) network, for example.
The auxiliary storage device 35 stores an NN learning program, input data, and teacher data. The host processor 31 executes the NN learning program and, for example, transmits the learning program, input data, and teacher data to the NN execution machine 40, and causes the NN execution machine 40 to execute the learning program.
The NN processor 43 executes a program based on the program and data transmitted from the host machine 30, and executes a learning process. The NN processor 43 has an NN processor 43_1 that executes fixed-point arithmetic and an NN processor 43_2 that executes floating-point arithmetic. However, the NN processor 43_2 that executes the floating-point arithmetic may be omitted.
The NN processor 43_1, which executes fixed-point arithmetic, has a statistical information acquisition circuit for acquiring statistical information regarding processes operation result data such as operation results calculated in the NN and variables updated by learning, and a valid most significant bit and a valid least significant bit of data in the memory, and the like. The NN processor 43_1, which executes fixed-point arithmetic, acquires statistical information of operation result data acquired by operation while performing learning, and adjusts a fixed-point position of operation result data to an optimum position based on the statistical information.
The high-speed input-output interface 41 is, for example, PCI Express and relays communication with the host machine 30.
The control unit 42 stores the program and data transmitted from the host machine 30 in the internal memory 45 and, in response to a command from the host machine 30, instructs the NN processor 43 to execute the program. The memory access controller 44 controls an access process to the internal memory 45 in response to an access request from the control unit 42 and an access request from the NN processor 43.
The internal memory 45 stores a program executed by the NN processor 43, processing target data, processing result data, and the like. The internal memory 45 is, for example, an SDRAM, a faster GDR5, a broadband HBM2, or the like.
In response to these transmissions, the NN execution machine 40 stores the input data and the learning program in the internal memory 45, and executes the learning program for the input data stored in the internal memory 45 in response to the learning program execution instruction (S40_1). The learning program is executed by the NN processor 43. The host machine 30 transmits input data for next one mini-batch (S32_2) and then waits until the execution of the learning program by the NN execution machine 40 is completed. In this case, two areas for storing input data are prepared in the NN execution machine 40.
When the execution of the learning program is completed, the NN execution machine 40 transmits a notification of end of the learning program execution to the host machine 30 (S41_1). The host machine 30 switches an input data area referenced by the learning program and transmits the learning program execution instruction (S33_2). Then, the NN execution machine 40 executes the learning program (S40_2) and transmits an end notification (S41_2). This process is repeated to proceed with the NN learning.
The learning of the NN has a process (variable update) to execute an operation (forward propagation process) of each layer in a forward direction of the NN, propagates an error between output data of the output layer and teacher data in a reverse direction of the NN and calculates a partial differential of the error by the variable of each layer (back propagation process), and updates the variable according to a partial differential result of the error by the variable of each layer. The whole learning processing of the NN may be executed by the NN execution machine 40, or a part of the processing may be executed by the host machine 30.
First, the NN processor 43 determines an initial decimal point position of each operation result data (operation result of each layer, variable, and the like) (S60). The determination of the initial decimal point position is performed by pre-learning with a floating-point number or by specification by the user. When performing pre-learning with a floating-point number, the operation result data in the NN is a floating-point number. Thus, an exponent part corresponding to the size of the operation result data is generated, and the decimal point position does not need to be adjusted like a fixed-point number. Then, an optimum decimal point position of the fixed-point number of each operation result data is determined based on the operation result data of the floating-point number.
Next, the NN processor 43 acquires and stores statistical information regarding the distribution of values of each operation result data while executing mini-batch learning (S61). The NN processor 43_1 that executes fixed-point arithmetic included in the NN processor 43 has a statistical information acquisition circuit that acquires statistical information such as a distribution of effective bits of operation results of the fixed-point arithmetic unit, or the like. By causing the NN processor 43 to execute an operation instruction with a statistical information acquisition process, the statistical information of operation result data may be acquired and stored during the mini-batch learning. S61 is repeated until the mini-batch learning is executed K times (62: NO). When the mini-batch learning is executed K times (S62: YES), the fixed-point position of each operation result data in the NN is adjusted based on the statistical information of each layer of the distribution of values of operation result data (S63).
The statistical information acquisition circuit in the NN processor 43 described above and a method of adjusting the fixed-point position based on the statistical information of each layer regarding the distribution will be described in detail later.
Then, the NN processor 43 repeats S61, S62, and S63 until the learning of all the mini-batches is completed (S64: NO). When the learning of all the mini-batches is completed (S64: YES), the process returns to the first S60 and repeats the leaning of all mini-batches until a predetermined number of times is reached (565: NO).
With the example of learning described in
Furthermore, in S63, the NN processor 43 determines and updates the optimum decimal point position of each operation result data of each layer based on the distribution of effective bits of the plurality of pieces of operation result data included in the stored statistical information.
On the other hand, in the back propagation process, the fixed-point arithmetic unit in the NN processor 43 calculates a partial differential δ0(5) to δj(5) . . . of layer L5 dose to the input layer from a partial differential result δ0(6) to δi(6) to δn(6) of an error between output data of the output layer and teacher data by a variable of layer L6 dose to the output layer. Then, an update data ΔWij of weight is calculated according to the value acquired by partially differentiating the partial differential δ0(5) to δj(5) . . . of the layer L5 with a variable such as the weight Wij. The operations in the layers L6, L5 are repeated from the output layer to the input layer.
Moreover, in the process of updating the variable in each layer in order, the update data ΔWij is subtracted from the existing weight Wij to calculate the updated weight Wij.
Input data Z0 to Zj . . . to the layer L2, the output data U0 to Uj . . . of the activation function, partial differential results δ0(6) to δi(6) to δn(6), and δ0(5) to δj(5) . . . in the layers L6, L5, and the weight update data ΔWij and the updated weight Wij illustrated in
The statistical information regarding the distribution of effective bits of the operation result data is as follows, for example.
(1) Distribution of positions of the most significant bits that are unsigned
(2) Distribution of positions of the least significant bits that are non-zero
(3) Maximum value of positions of the most significant bits that are unsigned
(4) Minimum value of positions of the least significant bits that are non-zero
(1) Positions of the most significant bits that are unsigned are positions of the most significant bits of effective bits of the operation result data. The unsigned is “1” when a sign bit is 0 (positive) and “0” when the sign bit is 1 (negative). (2) Positions of the least significant bits that are non-zero are positions of the least significant bits of effective bits of the operation result data. If the sign bit is 0 (positive), it is the position of the least significant bit of “1”, and if the sign bit is I (negative), it is also the position of the least significant bit of “1”. When the sign bit is 1, bits other than the sign bit are represented by the two's complement, and a process of converting the complement of 2 to the original number includes a process of subtracting 1 so as to invert 1, 0 to 0, 1. Therefore, the least significant bit of “1” becomes “0” by subtracting 1 and becomes “1” by bit inversion, which is the position of the least significant bit of the effective bits.
(3) Maximum value of positions of the most significant bits that are unsigned is the maximum position out of positions of the most significant bits of the effective bits of each of the plurality of pieces of operation result data. Similarly, (4) Minimum value of positions of the least significant bits that are non-zero is the minimum position out of positions of the least significant bits of the effective bits of each of the plurality of pieces of operation result data.
As an example,
On the other hand, the spread of the distribution of positions of the most significant bits that are unsigned (the number of histogram bins) changes depending on the plurality of pieces of operation result data. The spread of the distribution of the histogram in
On the other hand, in the histogram of
Accordingly, a method of determining the decimal point position based on the statistical information, which is a histogram, differs between a case where the width (number of bins) of the histogram exceeds 15 bits and does not fit in the representable area (15 bits) (
If the horizontal width (number of bins) 33 of the histogram in
In the example in
In the example of
The process is started upon completion of S62, and a maximum value ub of statistical information is acquired from the statistical information of each layer stored in S61 (S631). The maximum value ub of the statistical information corresponds to, for example, the maximum value of the positions of the above-mentioned most significant bits that are unsigned. Next, a minimum value lb of the statistical information is acquired from the statistical information of each layer stored in S61(S632). The minimum value lb of the statistical information corresponds to, for example, the minimum value of the positions of the most significant bits that are unsigned. Next, the spread ub−lb+1 of the distribution is acquired (S633). The spread ub−lb+1 indicates the width between the maximum value and the minimum value of the statistical information. Next, it is determined whether or not the spread ub−lb+1 of the distribution is larger than a bit width N excluding the sign bit (S634). This determination corresponds to case classifications in a case where the width (number of bins) of the histogram does not fit in the representable area (
If the spread ub−lb+1 of the distribution is not larger than the bit width N excluding the sign bit (S634: NO), the number n of digits in the integer part is determined based on the distribution center (ub−lb+1)/2 and the bit width center N/2 (S635). The number n of digits in the integer part corresponds to the n-bit integer part represented by the fixed-point number format Qn.m. When the spread of the distribution is larger than the bit width N excluding the sign bit (S634: YES), the number n of digits in the integer part is determined based on the function that acquires a digit whose overflow rate exceeds the default value r_max (S636). Next, the number m of digits in the fractional part is determined based on the number n of digits in the integer part and the bit width N acquired in S635 or S636 (S637). The number m of digits in the fractional part corresponds to the m-bit fractional part represented in the fixed-point number format Qn.m.
[Determination of Quantization Target in Learning According to First Embodiment]
A method of determining the data type of a variable as a quantization target in learning according to the first embodiment will be described. In the learning according to the first embodiment, it is determined whether or not quantization is performed for each variable in each layer of the NN or, for example, whether or not to use a data type having a narrow bit width for expressing a value. The learning according to the first embodiment has an effect of reducing the amount of operation of the NN while maintaining recognition accuracy of the NN.
The process is started upon completion of S1, and the host processor 31 determines a predetermined quantization range for the variable (S203). The quantization range may be determined by the method based on the statistical information of the distribution described in
Next, the host processor 31 calculates quantization errors of all variables when the quantization process is performed with the data type of narrow bit width and the quantization range determined in S203 based on the stored statistical information (S205). The quantization process includes performing the quantization process based on the quantization range determined in S203. The post-processor 31 selects the data type of narrow bit width from candidates of data types used when outputting data of variables. The candidates of data types are, for example, an INT8 data type that represents an integer in 8 bits and an FP32 data type that represents a floating-point number in 32 bits.
Next, the host processor 31 determines the predetermined threshold value (S206). The predetermined threshold value may be designated by the user or may be determined based on the statistical information stored in S61. When the predetermined threshold value is determined based on the statistical information, it is determined based on changes in the quantization errors calculated based on the statistical information. The predetermined threshold value may be determined based on, for example, the average value of all quantization errors. By determining based on the changes in the quantization errors calculated based on the statistical information, the threshold value for determining the variable as a quantization target corresponding to the input data may be adjusted, and thus it is possible to determine the quantization target with higher accuracy.
Next, the host processor 31 determines whether or not the quantization error calculated in S205 is less than the predetermined threshold value (S207). When the quantization error is less than the predetermined threshold value (S207: YES), it is determined to use the data type of narrow bit width used for the calculation of the quantization error for outputting the variable (S209). When the quantization error is not less than the predetermined threshold value (S207: NO), it is determined to use a data type having a wider bit width than the data type of narrow bit width for outputting the variable (S211).
Then, S206 to S211 are repeated until the data types of all variables are determined (S213: NO). When the data types of all variables are determined (S213: YES), the process proceeds to S3.
The process is started upon completion of S1, and a quantization range candidate when a variable is quantized with a data type of narrow bit width is determined (S2031).
Next, the quantization error of the variable when the quantization process is performed with the quantization range candidate determined in S2031 is calculated based on the statistical information stored in S61(S2033). The method of calculating the quantization error is similar to S205.
S2031 to S2033 are repeated until quantization errors are calculated for all the quantization range candidates (S2035: NO). When quantization errors have been calculated for all the quantization range candidates (S2035: YES), the process proceeds to S2037.
Then, the quantization range candidate for which the calculated quantization error becomes a minimum value is determined as the quantization range (S2037).
From
Here, the variable that can be quantized is a variable that does not cause a significantly large quantization error even when quantized with a data type of a narrow representation range. When the variable as the quantization target is determined by empirical rules or pre-learning, the variable that may be quantized is limited to a specific variable whose data value distribution is not too wide from the beginning of learning. On the other hand, for example, there is a variable having a tendency such that a change in values is large and the distribution of data values is wide in the initial stage of learning, but the spread of the distribution of the data values decreases as the learning proceeds. For example, in a layer that executes a multiplication of two variables, variations in the distribution may not change significantly before and after the operation.
By determining the quantization target in the learning according to the first embodiment, for example, it is possible to increase the variables to be the quantization target in accordance with the progress of learning, and both maintaining the recognition accuracy of the NN and reducing the amount of operation may be achieved.
Here, a case where the quantization is possible based on the distribution of values of the variable data and a case where the quantization is not possible will be described with reference to
[Mathematical Formula 1]
∥w−wQ∥=(a1b1+a2b2+a3b3+a4b4+ . . . +a8b8+a9b9+a10a10+a11b11)−(a3b1+a3b2+a3b3+a4b4+ . . . +a8b8+a9b9+a9b10+a9b11) (1)
Furthermore, the quantization error may be represented by approximation by the following formula (2) by calculating only data of a1, a2, a10, an which are out of the quantization range.
[Mathematical Formula 2]
∥w−wQ∥≅(a1b1+a2b2)−(a3b1+a3b2)+(a10b10+a11b11)−(a9b10+a9b11) (2)
Since an error within the representation range is sufficiently smaller than the error for data of a1, a2, a10, a11 which are out of the quantization range, by using an approximated quantization error, the amount of operation for quantization error operation may be reduced while maintaining the recognition accuracy.
Furthermore, a squared error may be used as the quantization error, which is represented by the following formula (3).
[Mathematical Formula 3]
∥w−wQ∥2 (3)
The magnitude of the quantization error illustrated in
Although the learning of the NN to the first embodiment has been described, it is not limited to the learning processing, and determining the data type based on the quantization error calculated based on the statistical information may also be applied to inference of the NN.
[Configuration of Fixed-Point NN Processor and Acquisition of Statistical Information]
Next, a configuration of the NN processor 43 according to the first embodiment and acquisition of statistical information will be described.
The NN processor 43 has an integer arithmetic unit INT that calculates a fixed-point number and a floating-point arithmetic unit FP that calculates a floating-point number in the vector operation unit VC_AR_UNIT. For example, the NN processor 43 has the NN processor 43_1 that executes a fixed-point arithmetic and the NN processor 43_2 that executes a floating-point arithmetic.
Furthermore, the NN processor 43 is connected to an instruction memory 45_1 and a data memory 45_2 via the memory access controller 44. The memory access controller 44 has an instruction memory access controller 44_1 and a data memory access controller 44_2.
The instruction control unit INST_CON has, for example, a program counter PC and an instruction decoder DEC. The instruction control unit INST_CON fetches an instruction from the instruction memory 45_1 based on an address of the program counter PC, and the instruction decoder DEC decodes the fetched instruction and issues it to an operation unit.
The register file REG_FL has a scalar register file SC_REG_FL and a scalar accumulation register SC_ACC used by the scalar operation unit SC_AR_UNIT. Moreover, the register file REG_FL has a vector register file VC_REG_FL and a vector accumulation register VC_ACC used by the vector operation unit VC_AR_UNIT.
The scalar register file SC_REG_FL includes scalar registers SRO to SR31 each of which is 32-bit for example, and scalar accumulation registers SC_ACC each of which is 32-bit+α-bit for example.
The vector register file VC_REG_FL has, for example, eight sets of REGn0 to REGn7, each having 32-bit registers by the number of eight elements. Furthermore, the vector accumulation register VC_ACC has, for example, A_REG0 to A_REG7 each having a 32-bit+a-bit register by the number of 8 elements.
The scalar operation unit SC_AR_UNIT has a set of integer arithmetic unit INT, a data converter D_CNV, and a statistical information acquisition unit ST_AC. The data converter D_CNV converts output data of a fixed-point number output by the integer arithmetic unit INT into a floating-point number. The scalar operation unit SC_AR_UNIT uses the scalar registers SRO to SR31 and the scalar accumulation register SC_ACC in the scalar register file SC_REG_FL to execute an operation. For example, the integer arithmetic unit INT calculates the input data stored in any of the scalar registers SRO to SR31 and stores output data thereof in another register. Furthermore, when executing a product-sum operation, the integer arithmetic unit INT stores the result of the product-sum operation in the scalar accumulation register SC_ACC. The operation result of the scalar operation unit SC_AR_UNIT is stored in any of the scalar register file SC_REG_FL, the scalar accumulation register SC_ACC, and the data memory 45_2.
The vector operation unit VC_AR_UNIT has eight elements of operation units EL0 to EL7. Each of the elements EL0 to EL7 has an integer arithmetic unit INT, a floating-point arithmetic unit FP, and a data converter D_CNV. The vector operation unit VC_AR_UNIT inputs, for example, any set of the eight-element registers REGn0 to REGn7 in the vector register file VC_REG_FL, executes operations in parallel by the eight-element arithmetic units, and stores operation results in another set of the eight-element registers REGn0 to REGn7.
Furthermore, the data converter D_CNV shifts, as a result of operation, fixed-point number data acquired as a result of reading from the data memory 45_2, or the like. The data converter D_CNV shifts the fixed-point number data by a shift amount S specified in the instruction fetched by the instruction decoder DEC. The shift by the data converter D_CNV corresponds to adjusting the decimal point position corresponding to the fixed-point number format. Furthermore, the data converter D_CNV executes the saturation process of high-order bits and the rounding process of lower-order bits of the fixed-point number data along with the shift. The data converter D_CNV, for example, inputs an operation result of 40 bits and includes a rounding processing unit that performs the rounding process with a low-order bit as a fractional part, a shifter that performs arithmetic shift, and a saturation processing unit that performs the saturation process.
Then, the data converter D_CNV maintains the sign of the high-order bit at the time of left shift, performs a saturation process of other than the sign bit or, for example, discards the high-order bit, and embeds 0 in the low-order bit. Furthermore, at the time of right shift, the data converter D_CNV embeds the sign bit in the high-order bits (bits lower than the sign bit). Then, the data converter D_CNV outputs the data acquired by the rounding process, the shift, and the saturation process as described above with the same bit width as the register of the register file REG_FL. The data converter is an example of a circuit that adjusts the decimal point position of fixed-point number data.
Furthermore, the vector operation unit VC_AR_UNIT executes a product-sum operation by each of the 8-element arithmetic units, and stores cumulative addition values of product-sum operation results in the 8-element registers A_REG0 to A_REG7 of the vector accumulation register VC_ACC.
In the vector registers REGn0 to REGn7 and the vector accumulation registers A_REG0 to A_REG7, the number of operation elements increases to 8, 16, 32 depending on whether the bit width of operation target data is 32 bits, 16 bits or 8 bits.
The vector operation unit VC_AR_UNIT has eight statistical information acquisition units ST_AC for respectively acquiring statistical information of output data of the 8-element integer arithmetic unit INT. The statistical information is position information of the most significant bits that are unsigned of output data of the integer arithmetic unit INT. The statistical information is acquired as a bit pattern BP described later with reference to
The statistical information register file ST_REG_FL has, for example, eight sets of statistical information registers STRn_0 to STRn_39 each having, for example, 32 bits×40 elements, as illustrated in
The scalar registers SRO to SR31 store, for example, addresses and variables of NNs, or the like. Furthermore, the vector registers REG00 to REG77 store input data and output data of the vector operation unit VC_AR_UNIT. Then, the vector accumulation register VC_ACC stores a multiplication result and an addition result of the vector registers with each other.
The statistical information registers STR0_0 to STR0_39 . . . STR7_0 to STR7_39 store the number of pieces of data belonging to a plurality of bins of a maximum of eight types of histograms. When the output data of the integer arithmetic unit INT is 40 bits, the number of data having the unsigned most significant bit in each of 40 bits is stored in, for example, the statistical information registers STR0_0 to STR0_39.
The scalar operation unit SC_AR_UNIT has four arithmetic operations, shift operations, branches, load-store, and the like. As described above, the scalar operation unit SC_AR_UNIT has the statistical information acquisition unit ST_AC that acquires the statistical information having the positions of the most significant bits that are unsigned from the output data of the integer arithmetic unit INT.
The vector operation unit VC_AR_UNIT executes a floating-point arithmetic, an integer operation, a product-sum operation using the vector accumulation register VC-ACC, and the like. Furthermore, the vector operation unit VC_AR_UNIT executes clearing of the vector accumulation register VC_ACC, product-sum operation, cumulative addition, transfer to the vector register file VC_REG_FL, and the like. Moreover, the vector operation unit VC_AR_UNIT also performs load and store. As described above, the vector operation unit VC_AR_UNIT has the statistical information acquisition unit SAC that acquires the statistical information having positions of the most significant bits that are unsigned from the output data of the integer arithmetic unit INT of each of the eight elements.
[Acquisition, Aggregation, and Storage of Statistical Information]
Next, acquisition, aggregation, and storage of the statistical information of operation result data by the NN processor 43 will be described. The acquisition, aggregation, and storage of the statistical information are commands transmitted from the host processor 31, and are executed by using a command executed by the NN processor 43 as a trigger. Therefore, the host processor 31 transmits, to the NN processor 43, an instruction to acquire, aggregate, and store the statistical information, in addition to an operation instruction of each layer of the NN. Alternatively, the host processor 31 transmits, to the NN processor 43, an operation instruction with a process of acquiring, aggregating, and storing statistical information for the operation of each layer.
Next, the statistical information aggregator ST_AGR_1 adds “1” of each bit of eight bit patterns and aggregates them (S171).
Moreover, the statistical information aggregator ST_AGR_2 adds the value added and aggregated in S171 to the value in the statistical information register in the statistical information register file ST_REG_FL, and stores it in the statistical information register (S172).
The above processes S170, S171, S172 are repeated every time the operation result data that is the result of operation of each layer by the eight elements EL0 to EL7 in the vector operation unit VC_AR_UNIT is generated.
In the learning process, when the acquisition, aggregation, and storage process of the statistical information described above is completed for a plurality of operation result data in K mini-batches, the statistical information register file ST_REG_FL generates statistical information that is the number of respective bins of the histogram of the most significant bits that are unsigned of a plurality of pieces of operation result data in the K mini-batches. Consequently, the sum of positions of the most significant bits that are unsigned of the operation result data in the K mini-batches is aggregated for each bit. The decimal point position of each operation result data is adjusted based on this statistical information.
The adjustment of the decimal point position of the operation result data of each layer is performed by the host processor 31 of the host machine 30, for example. The statistic information of each layer stored in the statistical information registers STR0_0 to STR0_39 is written in the data memory 45_2 of the host machine 30, and the host processor 31 performs an operation to execute the process described in
[Acquisition of Statistical Information]
As illustrated in
According to this truth table, the first two rows are examples in which all bits of the input in[39: 0] match the sign bits “1” and “0”, and the most significant bit out[39] of the output out[39: 0] is “1” (0x8000000000). The next two rows are examples in which 38 bits in[38: 0] of the input in[39: 0] are different from the sign bits “1” and “0”, and 38 bit out[38] of the output out[39: 0] is “1” and the others are “0”. The bottom two rows are examples in which the 0 bit in[0] of the input in[39: 0] is different from the sign bit “1” and “0”, the 0 bit out[0] of the output out[39: 0] is “1”, and others are “0”.
In the logic circuit diagram illustrated in
Furthermore, if the sign bit in[39] matches in[38] and does not match in[37], the output of the EOR 38 becomes “O”, the output of an EOR 37 becomes “1”, and the output out[37] becomes “1”. When the output of the EOR 37 becomes “1”, the other outputs out[39: 38] and out[36: 0] become “0” due to the logical sums OR36 to ORO, the logical products AND36 to AND0 and the inverting gate INV. The same applies below.
As can be understood from
[Aggregation of Statistical Information]
As Illustrated in the logic circuit of
The statistical information aggregator ST_AGR_1 can also directly output one bit pattern BP as it is acquired by the statistical information acquisition unit ST_AC in the scalar operation unit SC_AR_UNIT. For this purpose, it has a selector SEL that selects either the outputs of the addition circuits SGM_0 to SGM_39 or the bit pattern SP of the scalar operation unit SC_AR_UNIT.
The statistical information register file ST_REG_FL has, for example, eight sets of 40 32-bit registers STRn_39 to STRn_0 (n=0 to 7). Therefore, it is possible to store the number of 40 bins for each of eight types of histograms. Now, let us suppose that the statistical information to be aggregated is stored in the 40 32-bit registers STR0_39 to STR0_0 with n=0. The second statistical information aggregator ST_AGR_2 has adders ADD_39 to ADD_0 for adding the value of each of the aggregated values in[39: 0] aggregated by the first statistical information aggregator ST_AGR_1 for each of cumulative addition values stored in the 40 32-bit registers STR0_39 to STR0_0. Then, the outputs of the adders ADD_39 to ADD_0 are stored again in the 40 32-bit registers STR0_39 to STR0_0. Thus, the number of samples in each bin of the target histogram is stored in the 40 32-bit registers STR0_39 to STR0_0.
By hardware circuits of the statistical information acquisition unit ST_AC and the statistical information aggregator ST_AGR_1, ST_AGR_2 provided in the operation unit illustrated in
In addition to the distribution of positions of the most significant bits that are unsigned, the distribution of the least significant bits that are non-zero may be acquired by a hardware circuit of the NN processor 43 in a manner similar to the above. Moreover, the maximum value of positions of the most significant bits that are unsigned and the minimum value of the positions of the least significant bits that are non-zero may be acquired similarly.
Since the statistical information can be acquired by the hardware circuit of the NN processor 43, adjustment of the fixed-point position of operation result data in learning can be implemented with a slight increase in the number of steps.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2019-178727 | Sep 2019 | JP | national |