The present disclosure relates to the technical field of data processing, and in particular, to a data processing device, a data processing method, and a chip.
In various scientific computations and engineering applications, the multiply-accumulate (MAC) operation is fundamental. It combines a result of a multiplication operation with another operand and then adds them together to obtain the final result. In the traditional technology, MAC operations are primarily implemented using multiplying accumulators within data processing devices. One widely used form of MAC operations is convolution, which involves input features, weights, biases, and output features. In traditional convolution, the input features are multiplied and added to a set of weights, followed by the addition of a bias term to produce the output features.
The present disclosure provides a data processing device, a data processing method, and a chip to enhance the computational power of data processing devices when handling MAC operations.
A first aspect of the present disclosure provides a data processing device. The data processing device includes a multiplying accumulator. The multiplying accumulator is configured to obtain an input tensor and a sparse weight tensor. The sparse weight tensor is obtained by performing a sparse processing on at least one of a first dimension and a second dimension of an original weight tensor. The multiplying accumulator is further configured to perform a multiply-accumulate (MAC) operation on the sparse weight tensor and the input tensor.
In one embodiment of the first aspect, the input tensor includes data from an original input tensor corresponding to reserved positions of the sparse processing.
In one embodiment of the first aspect, the data processing device further includes an input circuit. The input circuit is configured to obtain a position index that contains reserved positions of the sparse processing; and read, based on the position index, data from an original input tensor corresponding to the reserved positions as the input tensor.
In one embodiment of the first aspect, the data processing device further includes an input circuit. The input circuit is configured to read an original input tensor as the input tensor. The multiplying accumulator is configured to obtain a position index that contains reserved positions of the sparse processing; and perform, based on the position index, a MAC operation on data from the input tensor corresponding to the reserved positions and the sparse weight tensor.
In one embodiment of the first aspect, the data processing device further includes a controller. The controller is configured to perform the sparse processing on the original weight tensor to obtain the sparse weight tensor.
In one embodiment of the first aspect, the input circuit is configured to read the sparse weight tensor, and the sparse weight tensor is obtained by performing the sparse processing on the original weight tensor.
In one embodiment of the first aspect, the sparse weight tensor is obtained by performing a first sparse processing on the original weight tensor in the first dimension, and wherein the first sparse processing adopts a first sparse granularity.
In one embodiment of the first aspect, the first sparse granularity is m:n, and every set of n data points of the original weight tensor in the first dimension is reduced into m data points through sparsification, wherein m and n are both positive integers and n>m.
In one embodiment of the first aspect, the first sparse processing uses m×ceil(log2 n) bits of position indexes to represent sparse positions, and each data point obtained through sparsification uses ceil(log2 n) bits of position indexes to represent its position, wherein ceil is a round-up function.
In one embodiment of the first aspect, the first sparse processing uses ceil(log2(Cnm)) bits to represent Cnm sparsification manners, and selects one of the Cnm sparsification manners for execution, wherein ceil is a round-up function.
In one embodiment of the first aspect, the first sparse processing uses ceil(log2 N) bits to represent N sparsification manners, and selects one of the N sparsification manners for execution, wherein N is an integer smaller than Cnm, ceil is a round-up function, and the N sparsification manners are selected from the Cnm sparsification manners.
In one embodiment of the first aspect, the sparse weight tensor is obtained by performing a second sparse processing on the original weight tensor in the second dimension, and wherein the second sparse processing adopts a second sparse granularity.
In one embodiment of the first aspect, the second sparse granularity is s:r, and every set of r data points of the original weight tensor in the second dimension is reduced into s data points through sparsification, wherein r and s are both positive integers and r>s.
In one embodiment of the first aspect, an intermediate tensor is obtained by performing one of a first sparse processing and a second sparse processing on the original weight tensor in one of the first dimension and the second dimension, and the sparse weight tensor is obtained by performing the other of the first sparse processing and the second sparse processing on the intermediate tensor in the other of the first dimension and the second dimension, wherein the first sparse processing adopts a first sparse granularity, and the second sparse processing adopts a second sparse granularity.
In one embodiment of the first aspect, the first dimension is a channel direction of the original weight tensor, and the second dimension is a kernel direction of the original weight tensor.
In one embodiment of the first aspect, the sparse weight tensor is obtained by further performing a third sparse processing on the original weight tensor in a third dimension, and the third sparse processing adopts a third sparse granularity.
A second aspect of the present disclosure provides a data processing method. The data processing method includes obtaining an input tensor and a sparse weight tensor, wherein the sparse weight tensor is obtained by performing a sparse processing on at least one of a first dimension and a second dimension of an original weight tensor; and performing a MAC operation on the sparse weight tensor and the input tensor.
A third aspect of the present disclosure provides a chip. The chip includes a data processing device. The data processing device comprises: a multiplying accumulator, configured to: obtain an input tensor and a sparse weight tensor, wherein the sparse weight tensor is obtained by performing a sparse processing on at least one of a first dimension and a second dimension of an original weight tensor; and perform a MAC operation on the sparse weight tensor and the input tensor.
In the present disclosure, the multiplying accumulator performs the MAC operation using the sparse weight tensor and the input tensor, obtaining a MAC result of the original input tensor and the original weight tensor. The sparse weight tensor is obtained by performing the sparse processing on the original weight tensor, resulting in fewer data points within the sparse weight tensor compared to the original weight tensor. Compared to directly using the original input tensor and original weight tensor for MAC operations, the present disclosure reduces the computational workload. As a result, the data processing device's overall computational power is enhanced without increasing the number of multiplying accumulators.
Furthermore, the data processing device of the present disclosure achieves a smaller design footprint and lower power consumption compared to related technologies while maintaining consistent computational performance.
Additionally, in the present disclosure, the multiplying accumulator only needs to read the sparse weight tensor and index, reducing the weight read bandwidth.
Lastly, the flexibility to configure different sparse parameters makes the data processing device of the present disclosure suitable for various application scenarios.
The embodiments of the present disclosure will be described below. Those skilled can easily understand disclosure advantages and effects of the present disclosure according to contents disclosed by the specification. The present disclosure can also be implemented or applied through other different specific embodiments. Various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present disclosure. It should be noted that the following embodiments and the features of the following embodiments can be combined with each other if no conflict will result.
It should be noted that the drawings provided in this disclosure only illustrate the basic concept of the present disclosure in a schematic way, so the drawings only show the components closely related to the present disclosure. The drawings are not necessarily drawn according to the number, shape and size of the components in actual implementation; during the actual implementation, the type, quantity and proportion of each component can be changed as needed, and the components' layout may also be more complicated.
In practical applications, different tensors are multiplied by various weights and then summed to perform a class of widely used operations known as multiply-accumulate, or MAC, operations. These operations include matrix multiplication, convolution, and deconvolution. Taking convolution as an example, a traditional convolution operation involves input features, weights, biases, and output features. During this operation, the input features are multiplied and added to a set of weights, followed by the addition of a bias term to produce the output features. The specific formula can be expressed as follows.
wherein out (Ni,Cout
In the field of artificial intelligence (AI), convolutional operations are typically implemented using the multiplying accumulators within neural-network processing units (NPUs). However, as AI technology becomes more widespread and algorithms continue to evolve, the computational demands on NPUs are increasing. The area and power consumption required by NPUs are also growing.
In view of the above-mentioned shortcomings, the present disclosure provides a data processing solution for NPUs. Within the data processing device, the multiplying accumulator performs a MAC operation using a sparse weight tensor and an input tensor, obtaining a MAC result of an original input tensor and an original weight tensor. The input tensor can include various types of data, including image tensors, video tensors, audio signal tensors, and/or text tensors used in natural language processing. The corresponding output tensor is a feature map or feature representation obtained after processing the input tensor. The sparse weight tensor is obtained by performing sparse processing on the original weight tensor, resulting in fewer data points within the sparse weight tensor compared to the original weight tensor. Consequently, compared to directly using the original input tensor and original weight tensor for MAC operations, the present disclosure reduces the computational workload. As a result, the data processing device's overall computational power is enhanced without increasing the number of multiplying accumulators.
In an embodiment of the present disclosure, in order to increase a computational power of the multiplying accumulator 11, the multiplying accumulator 11 is configured to perform a data processing method shown in
Step S21 includes obtaining an input tensor and a sparse weight tensor. The sparse weight tensor is obtained by performing a sparse processing on at least one of a first dimension and a second dimension of an original weight tensor, resulting in fewer data points within the sparse weight tensor compared to the original weight tensor. The first dimension and the second dimension are different.
Step S22 includes performing a MAC operation on the sparse weight tensor and the input tensor to obtain a MAC result of an original input tensor and the original weight tensor.
In some possible embodiments, the input tensor includes data from the original input tensor corresponding to reserved positions of the sparse processing. Specifically, the data from the original input tensor corresponding to the reserved positions of the sparse processing is reserved, and data other than this is deleted, thereby obtaining the input tensor.
In some possible embodiments, the data processing device 1 may further include an output circuit 14 and a cache 15. The cache 15 is configured to store the input tensor, the sparse weight tensor, and the MAC result. The multiplying accumulator 11 is configured to read the input tensor and the sparse weight tensor from the cache 15 to perform a MAC operation, and store the MAC result into the cache 15. The output circuit 14 is configured to output the MAC result in the cache 15.
According to an embodiment of the present disclosure, the data processing device 1 may further include an input circuit 12. For example, the input circuit 12 may perform data reading by read direct memory access (RDMA) technologies. In an embodiment of the present disclosure, the input circuit 12 is configured to perform steps S31 and S32 shown in
Step S31 includes obtaining a position index. The position index includes reserved positions of the sparse processing. In some possible embodiments, the position index may be represented by a position index matrix. In the position index matrix, element 1 represents a reserved position, and element 0 represents a sparse position.
Step S32 includes reading, based on the position index, the data from the original input tensor corresponding to the reserved positions as the input tensor.
Next, the above process will be described in detail through a specific embodiment.
In some possible embodiments, the data processing device 1 may be configured to implement a convolution operation.
It can be seen from the above description that, in the embodiments of the present disclosure, the input tensors have fewer data compared to the original input tensor, which helps to reduce the volume of data stored in the cache 15, improve the utilization rate of the cache 15, and lessens the read bandwidth of the input tensors.
According to an embodiment of the present disclosure, the input circuit 12 is configured to read the original input tensor as the input tensor. The multiplying accumulator 11 is configured to perform steps S41 and S42 shown in
Step S41 includes obtaining a position index. The position index includes reserved positions of the sparse processing.
Step S42 includes performing, based on the position index, a MAC operation on data from the input tensor corresponding to the reserved positions and the sparse weight tensor. Specifically, the multiplying accumulator 11 may obtain the sparse positions based on the position index. Based on this, in step S42, the multiplying accumulator may read the data from the input tensor corresponding to the reserved positions and skip data corresponding to the sparse positions based on the position index, and perform a MAC operation on the data read by the multiplying accumulator and the sparse weight tensor. In this way, the multiplying accumulator 11 can equivalently compute the volume of data computed before sparsification, increasing the computational power of the multiplying accumulator 11.
According to an embodiment of the present disclosure, the input circuit 12 is further configured to read the sparse weight tensor, and write the sparse weight tensor into the cache 15. The sparse weight tensor is obtained by performing the sparse processing on the original weight tensor. For example, in some possible embodiments, the sparse weight tensor may be obtained by performing an off-line sparse processing on the original weight tensor. Specifically, when the MAC operation is performed by using the original input tensor and the original weight tensor, redundant data exists in the original weight tensor. In the embodiments of the present disclosure, off-line tools and sparse algorithms can be used to perform a de-redundancy operation on the original weight tensor through pre-trained manners, so as to obtain the sparse weight tensor and ensure that a MAC result obtained by using the sparse weight tensor and the input tensor is substantially the same as an operation result of the original weight tensor and the original input tensor. In addition, the off-line tools can be further used to obtain the location index.
In some possible embodiments, off-line tools can be used to determine a sparse granularity through the sparse algorithms. Weights that need to be sparse and weights that need to be reserved in the original weight tensor can be determined based on the sparse granularity. The sparse weight tensor is obtained by performing the sparse processing on the original weight tensor.
It should be noted that, the obtaining of the sparse weight tensor through off-line tools is merely illustrative, and the present disclosure is not limited thereto. In some other embodiments, the sparse weight tensor may also be obtained in other manners.
According to an embodiment of the present disclosure, the data processing device 1 further includes a controller 13. The controller 13 is configured to perform the sparse processing on the original weight tensor to obtain the sparse weight tensor.
According to an embodiment of the present disclosure, the sparse weight tensor is obtained by performing a first sparse processing on the original weight tensor in the first dimension, and the first sparse processing adopts a first sparse granularity. In the convolution operation, the first dimension may be an output kernel quantity, an input channel, a kernel height, or a kernel width. For example, for an original weight tensor whose key features are evenly distributed in the input channel, the first sparse processing may be performed in the dimension of the input channel to obtain a result of first sparse processing, and the sparse granularity may be determined based on the channel size and the information thereof through the result of first sparse processing.
In some possible embodiments, the first sparse granularity is m:n. Every set of n data points of the original weight tensor in the first dimension is reduced into m data points through sparsification, and m and n are both positive integers and n>m.
Next, the first sparse processing will be described by a specific embodiment. The channel direction (also called channel dimension) is configured as the first dimension, and the first sparse granularity is configured to be 2:4 (i.e., n=4, m=2).
In some possible embodiments, the first sparse processing uses m×ceil(log2 n) bits of position indexes to represent sparse positions, and each data point obtained through sparsification uses ceil(log2 n) bits of position indexes to represent its position, wherein ceil is a round-up function. For example, if n=4, ceil(log2 n)=2, and if n=5, ceil(log2 n)=3. Specifically, in order to restore the sparse positions, each position of the reserved weights will be stored. Each position of the reserved weights needs to be stored by using ceil(log2 n) bits of position indexes, and the first sparse processing needs to store m×ceil(log2 n) position indexes. Taking the sparse processing shown in
In some other possible embodiments, the first sparse processing uses ceil(log2(Cnm)) bits to represent Cnm sparsification manners, and selects one of the Cnm sparsification manners for execution, wherein ceil is a round-up function. Specifically, every set of n data points is reduced into m data points through sparsification, and there are a total of Cnm sparsification manners, and each of the Cnm sparsification manners corresponds to one combination of reserved positions and sparse positions. Taking the sparse processing shown in
In some other possible embodiments, the first sparse processing uses ceil(log2 N) bits to represent N sparsification manners, and selects one of the N sparsification manners for execution. N is an integer smaller than Cnm, ceil is a round-up function, and the N sparsification manners are selected from the Cnm sparsification manners. Specifically, every set of n data points is reduced into m data points through sparsification, and there are a total of Cnm sparsification manners, and each of the Cnm sparsification manners corresponds to one combination of reserved positions and sparse positions. In these embodiments, the Cnm sparsification manners is further screened to remove Cnm−N sparsification manners accounting for a low proportion, thereby obtaining the N sparsification manners. The screening of the Cnm sparsification manners may be performed through off-line analysis manners, but the present disclosure is not limited thereto. The value of N may be an integer power of 2. Taking the sparse processing shown in
It can be learned from the foregoing description that in the embodiments of the present disclosure, the sparse weight tensor is obtained by performing the sparse processing on the original weight tensor, and the sparse weight tensor is smaller in size compared to the original weight tensor. For example, in the convolution operation, for a weight of 8×64×3×3 (the four dimensions represent the output kernel quantity, the input channel, the kernel height, and the kernel width, respectively), if the sparse processing is performed in the dimension of the input channel through a sparse granularity of 2:4 (i.e., n=4, m=2), the sparse weight tensor input into the multiplying accumulator 11 is 8×32×3×3 and is halved in size compared to the original weight tensor. For another example, for a weight of 8×64×3×3, if the sparse processing is performed in the dimension of the output kernel quantity (or, kernel direction) with a sparse granularity of 2:4, the sparse weight tensor input into the multiplying accumulator 11 is 4×64×3×3 and is halved in size compared to the original weight tensor. In this way, the computational workload executed by the multiplying accumulator 11 can be reduced, the utilization rate of the multiplying accumulator 11 is improved, and the read bandwidth of partial weights can be lessened.
According to an embodiment of the present disclosure, the sparse weight tensor is obtained by performing a second sparse processing on the original weight tensor in the second dimension, and the second sparse processing adopts a second sparse granularity. In the convolution operation, the second dimension may be the output kernel quantity, input channel, kernel height, or kernel width.
In some possible embodiments, the second granularity is s:r. Every set of r data points of the original weight tensor in the second dimension is reduced into s data points through sparsification, and r and s are both positive integers and r>s.
It should be noted that the second sparse processing is similar to the first sparse processing, which will not be described here.
In one embodiment of the first aspect, an intermediate tensor may be obtained by performing the first sparse processing on the original weight tensor in the first dimension, and the sparse weight tensor may be obtained by performing the second sparse processing on the intermediate tensor in the second dimension. In another embodiment, an intermediate tensor may be obtained by performing the first sparse processing on the original weight tensor in the second dimension, and the sparse weight tensor may be obtained by performing the second sparse processing on the intermediate tensor in the first dimension. In yet another embodiment, an intermediate tensor may be obtained by performing the second sparse processing on the original weight tensor in the second dimension, and the sparse weight tensor may be obtained by performing the first sparse processing on the intermediate tensor in the first dimension. In still another embodiment, an intermediate tensor may be obtained by performing the second sparse processing on the original weight tensor in the first dimension, and the sparse weight tensor may be obtained by performing the first sparse processing on the intermediate tensor in the second dimension. The first sparse processing adopts the first sparse granularity, and the second sparse processing adopts the second sparse granularity. The first dimension and the second dimension may be the same or different from each other.
Next, the above process will be described in detail by a specific embodiment. This specific embodiment takes a convolution operation as an example, and the channel dimension is configured as the first dimension, and the dimension of the output kernel quantity is configured as the second dimension. For an original weight tensor of 8×64×3×3, the obtaining of the sparse weight tensor includes: the sparse processing (i.e., the first sparse processing) is performed on the original weight tensor in the channel dimension (i.e., the dimension of the input channel) with a sparse granularity of 2:4 to obtain the intermediate tensor, then the sparse processing (i.e., the second sparse processing) is performed on the intermediate tensor in the dimension of the output kernel quantity with a sparse granularity of 2:4 to obtain the sparse weight tensor. The sparse weight tensor is 4×32×3×3 and is only one quarter in size compared to the original weight tensor. In this way, the computational workload executed by the multiplying accumulator 11 can be further reduced. In addition, in order to restore the sparse positions, log2n+log2n=4 bits of position indexes need to be stored for each of the reserved weights.
In some possible embodiments, the second sparse processing may use s×ceil(log2 r) bits of position indexes to represent sparse positions, and each data point obtained through sparsification uses ceil(log2 r) bits of position indexes to represent its position.
In some other possible embodiments, the second sparse processing may use ceil(log2(Crs)) bits to represent Crs sparsification manners, and selects one of the Crs sparsification manners for execution.
In some other possible embodiments, the second sparse processing may use ceil(log2 M) bits to represent M sparsification manners, and selects one of the M sparsification manners for execution, wherein M is an integer smaller than Crs, and the M sparsification manners are selected from the Crs sparsification manners. According to an embodiment of the present disclosure, the channel direction of the original weight tensor may be configured as the first dimension, and the kernel direction of the original weight tensor may be configured as the second dimension.
According to an embodiment of the present disclosure, the sparse weight tensor is obtained by further performing a third sparse processing on the original weight tensor in a third dimension, and the third sparse processing adopts a third sparse granularity. The third dimension is a dimension other than the first dimension and the second dimension. For example, in the convolution operation, the third dimension may be the output kernel quantity, input channel, kernel height, or kernel width.
The embodiments of the present disclosure further provide a non-transitory computer-readable storage medium storing a computer program. The computer program is executed to implement the data processing method described above.
In some possible embodiments, any combination of one or more storage medium may be used. The storage medium may be a computer-readable signal medium or a computer-readable storage medium. Examples of non-transitory computer-readable storage medium comprise systems, devices, or components associated with electricity, magnetism, light, electromagnetism, infrared, or semiconductors, and any suitable combination thereof. More specific examples of non-transitory computer-readable storage media comprise electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compressed disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, and any suitable combination thereof. In the present disclosure, the non-transitory computer-readable storage medium may be any tangible medium containing or storing programs that may be used by or in combination with instruction executing systems, devices, or components.
The embodiments of the present disclosure further provide a chip. The chip includes the data processing devices described above. In some embodiments, the chip may be neural-network processor chips.
In summary, the multiplying accumulator in the data processing device of the present disclosure performs the MAC operation using the sparse weight tensor and the input tensor, obtaining the MAC result of the original input tensor and the original weight tensor. The sparse weight tensor is obtained by performing the sparse processing on the original weight tensor, resulting in fewer data points within the sparse weight tensor compared to the original weight tensor. Compared to directly using the original input tensor and original weight tensor for MAC operations, the present disclosure reduces the computational workload. As a result, the data processing device's overall computational power is enhanced without increasing the number of multiplying accumulators. Furthermore, the data processing device of the present disclosure achieves a smaller design footprint and lower power consumption compared to related technologies while maintaining consistent computational performance. Additionally, in the present disclosure, the multiplying accumulator only needs to read the sparse weight tensor and index, reducing the weight read bandwidth. Lastly, the flexibility to configure different sparse parameters makes the data processing device of the present disclosure suitable for various application scenarios. Therefore, the present disclosure overcomes various shortcomings of the prior art and has a high industrial value.
The above-mentioned embodiments are merely illustrative of the principle and effects of the present disclosure instead of limiting the present disclosure. Those skilled in the art can make modifications or changes to the above-mentioned embodiments without going against the spirit and the range of the present disclosure. Therefore, all equivalent modifications or changes made by those who have common knowledge in the art without departing from the spirit and technical concept disclosed by the present disclosure shall be still covered by the claims of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2023106920957 | Jun 2023 | CN | national |