The invention relates to the technologies of artificial intelligence accelerators, and more specifically, to an artificial intelligence accelerator that includes split input bits and split weight blocks.
Applications of an artificial intelligence accelerator include, for example, functioning as something like a filter to identify a matching degree between a pattern represented by input data and a known pattern. For example, one of the applications is that the artificial intelligence accelerator identifies whether a photographed image includes an eye, a nose, a face, or other information.
Data to be processed by the artificial intelligence accelerator is, for example, data of all pixels of an image. To be specific, its input data is data that includes a large number of bits. After the data is input in parallel, a comparative operation is performed on various patterns stored in the artificial intelligence accelerator. The patterns are stored in a large number of memory cells in a weighted manner. An architecture of the memory cells is a 3D architecture, and includes a plurality of 2D memory cell layers. Each layer represents a characteristic pattern, and is stored in a memory cell array layer in a weighted value manner. A memory cell array layer to be processed is opened sequentially as controlled by a character line. The data is input by a bit line. A convolution operation is performed on the input data and a memory cell array to obtain a matching degree of a characteristic pattern corresponding to this memory cell array layer.
The artificial intelligence accelerator needs to handle a large amount of computation. If a plurality of memory cell array layers is integrated in one unit and are processed on a per-bit basis, an overall circuit thereof will be very large. In this way, an operation speed is lower and more energy is consumed. Considering that the artificial intelligence accelerator requires a high speed of processing for filtering and recognizing content of an input image, an operation speed, for example, generally needs to be further improved in designing a single-circuit chip.
Embodiments of the invention provide an artificial intelligence accelerator. The artificial intelligence accelerator includes split input bits and split weight blocks. Through a shifting and adding operation, parallel operated values are combined to restore an expected operation result of a single chip, thereby effectively improving a processing speed of the artificial intelligence accelerator and reducing power consumption.
In an embodiment, the invention provides an artificial intelligence accelerator, configured to receive a binary input data set and a selected layer of a plurality of layers of an overall weight pattern to perform a convolution operation. The input data set is divided into a plurality of data subsets. The artificial intelligence accelerator includes a plurality of processing tiles and a summation output circuit. Each of the processing tiles includes a receive-end component, configured to receive one of the data subsets. The weight storage unit is configured to store a part of the overall weight pattern, where the partial weight storage unit includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight pattern in order of bits, wherein a cell array structure of the weight storage unit, with respect to a corresponding one of the data sets, configured to perform a convolution operation on the data subset with each block part respectively to obtain a plurality of sequential weight operation values. The block-wise output circuit includes a plurality of shifters and a plurality of adders, and is configured to sum up the plurality of weight operation values through a multistage shifting and adding operation, so as to obtain a weight output value expected from a direct convolution operation performed on the data subset with the partial weight pattern. The summation output circuit includes a plurality of shifters and a plurality of adders, and is configured to sum up the plurality of weight output values through a multistage shifting and adding operation, so as to obtain a sum value expected from a direct convolution operation performed on the input data set with the overall weight pattern.
In an embodiment, for the artificial intelligence accelerator, the input data set includes i bits, and is divided into p data subsets, i and p are integers, and each of the data subsets includes i/p bits.
In an embodiment, for the artificial intelligence accelerator, the input data set includes i bits, and the quantity of the plurality of processing tiles is p, the input data set is divided into p data subsets, i and p are integers greater than or equal to 2, i is greater than p, and each of the data subsets includes i/p bits.
In an embodiment, for the artificial intelligence accelerator, the quantity of the plurality of weight blocks included in the partial weight storage unit is q, q is an integer greater than or equal to 2, the partial weight storage unit includes j bits, j and q are integers greater than or equal to 2, j is greater than q, and each of the weight blocks includes j/q memory cells.
In an embodiment, for the artificial intelligence accelerator, the block-wise output circuit includes at least one shifter and at least one adder in each stage of the shifting and adding operation. Two adjacent input values of a plurality of input values in each stage are one processing unit, after passing through the shifter, one input value in a higher bit location is added by the adder to the other input value in a lower bit location, and is output to a next stage, and in a last stage, a single value is output and used as the weight output value corresponding to the processing tile.
In an embodiment, for the artificial intelligence accelerator, a shift amount of the shifter in a first stage is j/q memory cells, and a shift amount of the shifter in a next stage is twice that of the shifter in a previous stage.
In an embodiment, for the artificial intelligence accelerator, the summation output circuit includes at least one shifter and at least one adder in each stage of the shifting and adding operation. Two adjacent input values of a plurality of input values in each stage are one processing unit, after passing through the shifter, one input value in a higher bit location is added by the adder to the other input value in a lower bit location, and is output to a next stage, and in a last stage, a single value is output and used as the sum value.
In an embodiment, for the artificial intelligence accelerator, a shift amount of the shifter in a first stage is i/p bits, and a shift amount of the shifter in a next stage is twice that of the shifter in a previous stage.
In an embodiment, the artificial intelligence accelerator further includes: a normalization processing circuit, configured to normalize the sum value to obtain a normalization sum value; and a quantization processing circuit, configured to quantize the normalization sum value into an integer value by using a base number.
In an embodiment, for the artificial intelligence accelerator, the processing circuit includes a plurality of sense amplifiers, respectively configured to sense each block part to perform a convolution operation to obtain a plurality of sensed values as the plurality of weight operation values.
In an embodiment, the invention further provides a processing method applied to an artificial intelligence accelerator. The artificial intelligence accelerator receives a binary input data set and a selected layer of a plurality of layers of an overall weight pattern to perform a convolution operation, where the input data set is divided into a plurality of data subsets. The processing method includes: using a plurality of processing tiles, where each of the processing tiles includes operations of: using a receive-end component to receive one of the data subsets; using a weight storage unit to store a part of the overall weight pattern, where the partial weight storage unit includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight pattern in order of bits, wherein a cell array structure of the weight storage unit, with respect to a corresponding one of the data sets, is configured to perform a convolution operation on the data subset with each block part respectively to obtain a plurality of sequential weight operation values; using a block-wise output circuit that includes a plurality of shifters and a plurality of adders to sum up the plurality of weight operation values through a multistage shifting and adding operation, so as to obtain a weight output value expected from a direct convolution operation performed on the data subset with the partial weight pattern; and using a summation output circuit that includes a plurality of shifters and a plurality of adders to sum up the plurality of weight output values through a multistage shifting and adding operation, so as to obtain a sum value expected from a direct convolution operation performed on the input data set with the overall weight pattern.
In an embodiment, for the processing method of the artificial intelligence accelerator, the input data set includes i bits, and is divided into p data subsets, i and p are integers, and each of the data subsets includes i/p bits.
In an embodiment, for the processing method of the artificial intelligence accelerator, the input data set includes i bits, and the quantity of the plurality of processing tiles is p, the input data set is divided into p data subsets, i and p are integers greater than or equal to 2, i is greater than p, and each of the data subsets includes i/p bits.
In an embodiment, for the processing method of the artificial intelligence accelerator, the quantity of the plurality of weight blocks included in the partial weight storage unit is q, q is an integer greater than or equal to 2, the partial weight storage unit includes j bits, j and q are integers greater than or equal to 2, j is greater than q, and each of the weight blocks includes j/q memory cells.
In an embodiment, for the processing method of the artificial intelligence accelerator, an operation of the block-wise output circuit includes using at least one shifter and at least one adder in each stage of the shifting and adding operation. Two adjacent input values of a plurality of input values in each stage are one processing unit, after passing through the shifter, one input value in a higher bit location is added by the adder to the other input value in a lower bit location, and is output to a next stage, and in a last stage, a single value is output and used as the weight output value corresponding to the processing tile.
In an embodiment, for the processing method of the artificial intelligence accelerator, a shift amount of the shifter in a first stage is j/q memory cells, and a shift amount of the shifter in a next stage is twice that of the shifter in a previous stage.
In an embodiment, for the processing method of the artificial intelligence accelerator, an operation of the summation output circuit includes using at least one shifter and at least one adder in each stage of the shifting and adding operation. Two adjacent input values of a plurality of input values in each stage are one processing unit, after passing through the shifter, one input value in a higher bit location is added by the adder to the other input value in a lower bit location, and is output to a next stage, and in a last stage, a single value is output and used as the sum value.
In an embodiment, for the processing method of the artificial intelligence accelerator, a shift amount of the shifter in a first stage is i/p bits, and a shift amount of the shifter in a next stage is twice that of the shifter in a previous stage.
In an embodiment, the processing method of the artificial intelligence accelerator further includes: using a normalized processing circuit to normalize the sum value to obtain a normalization sum value; and using a quantization processing circuit to quantize the normalization sum value into an integer value by using a base number.
In an embodiment, for the processing method of the artificial intelligence accelerator, the processing circuit includes a plurality of sense amplifiers, respectively configured to sense each block part to perform a convolution operation to obtain a plurality of sensed values as the plurality of weight operation values.
To make the features and advantages of the invention clear and easy to understand, the following gives a detailed description of embodiments with reference to accompanying drawings.
Embodiments of the invention provide an artificial intelligence accelerator that includes split input bits and split weight blocks. With the split input bits being parallel to the split weight blocks, parallel operated values are combined through a shifting and adding operation to restore an expected operation result of a single chip, thereby effectively improving a processing speed of the artificial intelligence accelerator and reducing power consumption.
Several embodiments are provided below to describe the invention, but the invention is not limited to the embodiments.
Through a cell array structure 56 with respect to the input data of the artificial intelligence accelerator 20 by a routing arrangement, a weight pattern stored in a memory cell may be subjected to a convolution operation performed together with input data 50 received and converted by a receive-end component 52. For example, the convolution operation is generally a multiplication operation on a matrix to obtain an output value. Output data 58 is obtained by performing a convolution operation on a weight pattern layer through the cell array structure 56. The convolution operation may be based on the usual way in the art without specifically limitation. The operation in detail is not further described in the embodiments. The output data 58 may represent a matching degree between the input data 50 and the weight pattern. In terms of performance, each weight pattern layer is similar to a filtering layer of an object and implements a recognition function by recognizing the matching degree between the input data 50 and the weight pattern.
A direct convolution operation may be performed by using a single bit and a single weight one by one. However, because the amount of data to be processed is very large, an overall memory unit is very large and constitutes a considerably large processing chip. The speed of operation may be relatively slow. In addition, power (heat) consumption generated by operation of a large-sized chip is also relatively large. Expected functions of the artificial intelligence accelerator require a relatively high recognition speed and lower power consumption of operation.
According to the architecture in
With respect to a splitting manner in
Each of the input data subsets 102_1, 102_2, . . . is subjected to a convolution operation performed by a corresponding one of the processing tiles 100_1, 100_2, . . . . The convolution operation of the processing tiles 100_1, 100_2, . . . is a part of the overall convolution operation. Each of the input data subsets 102_1, 102_2, . . . received by each corresponding processing tile 100_1, 100_2, . . . is processed respectively in parallel. Through the receive-end component 66, the input data subsets 102_1, 102_2, . . . enter memory cells associated with a memory unit 90.
In an embodiment, the quantity of memory cells storing weight values in a row is, for example, j, where j is a large integer. That is to say, there are j memory cells corresponding to one bit line. Each memory cell stores a weight value. Herein, a memory cell row may also be referred to as a selection line. In an embodiment, j memory cells may be split into, for example, q weight blocks 92. In an embodiment where j is divisible by q, one weight block includes j/q memory cells. From an output-side perspective, a memory cell is also a bit equivalent to a binary string. In order of weights, q weight blocks 92 ranging from 0 to j−1 are generated out of splitting.
From the overall convolution operation, a sum value needs to be obtained. The sum value is denoted by Sum, as shown in a formula (1):
Sum=Σa*W (1)
where “a” represents an input data set, and W represents a two-dimensional array of a selected layer of weight in the memory unit.
For the input data set that is input, if the input data set includes data of eight bits, for example, the input data set is denoted by a binary string [a0a1 . . . a7]. The binary string is, for example, [10011010], and corresponds to a decimal value. Similarly, a weight block is also denoted by a bit string. For example, the first weight block includes [W0 . . . Wj/q-1]. Sequentially, the last weight block is denoted by [W(q-1)*j/q . . . Wj-1]. Each weight block also represents a decimal value.
In this way, the overall convolution operation is denoted by a formula (2):
SUM=(W0 . . . Wj/q-1*20+ . . . +W(q-1)*j/q . . . Wj-1*2j*(q-1)/q)*20*a0 . . . ai/p-1+(W0 . . . Wj/q-1*20+ . . . +W(q-1)*j/q . . . Wj-1*2j*(q-1)/q)*2i/p*ai/p . . . a2*i/p-1+ . . . +(W0 . . . Wj/q-1*20+ . . . +W(q-1)*j/q . . . Wj-1*2j*(q-1)/q)*2i*(p-1)/p*a(p-1)*i/p . . . ai-1 (2)
For a weight pattern stored in a two-dimensional array of i*j shown in
A processing circuit 70 is also disposed for each of the processing tiles 100_1, 100_2, . . . to perform a convolution operation. In addition, a block-wise output circuit 80 is also disposed for the processing tiles 100_1, 100_2, . . . and includes a multistage shifting and adding operation. For parallel zero-stage output data, corresponding data such as [W0 . . . Wj/q-1], . . . is obtained in order of bits (memory cells). A final overall convolution operation result is obtained also by performing a shifting and adding operation between the processing tiles.
In this configuration above, the operation on one weight block in one processing tile needs a storage amount of 2(i/p+j/q). To the whole operation, it includes p processing tiles and each processing tile includes q weight blocks. The total storage amount as needed may be reduced to p*q*2(i/p+j/q).
The following describes in detail how to obtain an overall operation result based on split weight blocks and split processing tiles.
It should be noted that weight blocks of one weight pattern layer may also be distributed onto a plurality of different processing tiles based on planning and combination of the weight blocks. To be specific, weight blocks stored in one processing tile do not require the same layer of weight data. On the other hand, weight blocks of one weight data layer are distributed to a plurality of processing tiles. Therefore, the processing tiles may be operated in parallel. That is, each of the plurality of processing tiles performs operations for only block layers to be processed, and then combines operation data of the same layer.
The following describes a shifting and adding operation in which a plurality of processing tiles is integrated.
Similar to the scenario in
The sum value (Sum) at this stage is a preliminary value. In practical applications, the sum value needs to be normalized. For example, a normalization circuit 400 normalizes the sum value to obtain a normalization sum value. The normalization circuit includes, for example, an operation of a formula (3):
where a constant α 404 is a scaling value, and adjusts the sum value (Sum) through a multiplier 402 first, and then adjusts an offset β 408 through the adder 406.
The normalization sum value is processed by a quantization circuit 500, where the sum value is quantized by a divider 502 by dividing a base number d 504, as shown in a formula (4):
where 0.5 represents a rounding-off operation. Generally, the more the input data set matches a characteristic pattern of this layer, the larger the quantization value a′ thereof will be.
After completion of the convolution operation for one weight pattern layer, a convolution operation for a next weight pattern layer is selected by using a word line.
An embodiment of the invention further provides a processing method of an artificial intelligence accelerator.
Referring to
Based on the foregoing, in the embodiment of the invention, the weight data of the memory unit is split and subjected to a convolution operation performed by a plurality of processing tiles. In addition, the memory unit of each processing tile is also split into a plurality of weight blocks to perform processing respectively. Thereafter, a final overall value may be obtained through a shifting and adding operation. Because a circuit of the processing tile is relatively small, an instruction cycle can be increased, and energy consumed (for example, heat generated) during the processing of the processing tile can be reduced.
Although the invention has been described with reference to the above embodiments, the embodiments are not intended to limit the invention. A person of ordinary skill in the art may make variations and improvements without departing from the spirit and scope of the invention. Therefore, the protection scope of the invention should be subject to the appended claims.