This non-provisional application claims priority under 35 U.S.C. § 119(a) on Patent Application No(s). 202011293554.7 filed in China on Nov. 18, 2020, the entire contents of which are hereby incorporated by reference.
This disclosure relates to a convolution neural network accelerator, and more particularly to a method of dividing data for transmission and merging in a convolution operation using tiled processing.
Convolutional neural networks (CNNs) are now considered one of the most widely used machine learning techniques in computer vision and image processing. Its primary operation is the convolution between kernels (weights) and feature maps (activations), which can consume lots of power through MAC operations and memory accesses.
Compared with the energy wasted through redundant operations, data access power is arguably more critical for future accelerator designs because memory bandwidth has been growing slower than the speed of PEs. That is, an algorithm can become increasingly memory bound for future architectures. Newer networks tend to adopt smaller convolution kernels with deeper layers, which further reduces operation count at the cost of increased memory usage. According to the statistics, accessing the feature map in Dynamic Random Access Memory (DRAM) consumes more power than other operations with the evolution of neural network model.
Current CNN generally adopts tiled processing, and that is, a processing element loads one block from an external storage device for an operation at a time. For example, the data block stored in the external storage device DRAM is directly loaded to a Static Random Access Memory (SRAM) near the processing element as a cache data without being compressed. However, the above method consumes considerable power and memory bandwidth when accessing DRAM for switching the processes data block. For example, the data stored in DRAM may be divided into multiple subtensors with the same size. These subtensors are compressed and then transmitted to SRAM for being decompressed. The processing element loads the required data block from SRAM for computation. Data block compression may save the power and memory bandwidth during data transmission. However, the SRAM may store the data that is not used in the current processing if the size of a subtensor is too large, and thus, the storage space of the SRAM is wasted. Furthermore, additional time is spent decompressing the file in a large size for obtaining complete block data, but only a small amount of data is required. On the other hand, if the size of a subtensor is too small, additional memory bandwidth is required to load a large quantity of pointers for decompressing the original data block in a correct order.
Accordingly, the present disclosure provides an efficient, hardware-friendly data storage scheme for sparse CNN feature maps. The present disclosure divides data into uneven-sized subtensors and stores them in a compressed yet randomly accessible format using few pointers. This design enables modern CNN accelerators to fetch and decompressed sub-tensors on-the-fly in a tiled processing manner. The present disclosure is suitable for architectures that favor aligned, coalesced data access, and only requires minimal changes to the overall architectural design.
According to one or more embodiment of this disclosure, a method of transmitting and merging data adapted to a sender and a receiver that are in communication with each other, wherein the method of transmitting and merging data comprises: a sending stage comprising: transmitting a first block data, a second block data and a third block data to the receiver by the sender; obtaining a fourth block data and a fifth block data by the sender; and transmitting the third block data, the fourth block data and the fifth block data to the receiver by the sender; and a receiving stage comprising: receiving the first block data, the second block data and the third block data by the receiver; merging the first block data, the second block data and the third block data to perform a convolution operation by the receiver; receiving the third block data, the fourth block data and the fifth block data by the sender; and merging the third block data, the fourth block data and the fifth block data to perform another convolution operation.
In view of the above, the present disclosure proposes an efficient storage scheme for sparse feature map for reducing external memory bandwidth, which is aligned to the memory access patterns in modern CNN accelerator architectures. Given a specific CNN layer and an accelerator configuration, the present disclosure may convert a sparse tensor into multiple subtensors with different sizes. Existing accelerators can be integrated with the present disclosure to improve an overall performance with a minimum of hardware modification and overhead.
The present disclosure will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only and thus are not limitative of the present disclosure and wherein:
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawings.
The present disclosure is adapted to any field with convolution. The present disclosure proposes a method of transmitting and merging data, including a method of dividing an input feature map, which may prevent accessing partially compressed subtensors, and minimize the number of subtensors to prevent data fragmentation.
As shown in
Step S1 shows that “transmitting a first pointer, a first block data B1, a second block data B2 and a third block data B3 to the receiver by the sender”. In practice, before transmitting the first block data B1, the second block data B2 and the third block data B3 to the receiver by the sender, the method further comprises the step of compressing the first block data B1, the second block data B2 and the third block data B3 by a compressor, so as to reduce the bandwidth occupied during the transmission of the first block data B1, the second block data B2 and the third block data B3. The first pointer is configured to indicate a starting address of the first block data B1, a size of the first block data B1, a size of the second block data B2, and a size of the third block data B3.
Step S2 shows that “obtaining a fourth block data B4 and a fifth block data B5 by the sender”. Specifically, the control circuit of the sender can divide the consecutive three feature maps Fi−1, Fi and Fi+1 that are adjacent to each other in space into the first block data B1, the second block data B2, the third block data B3, the fourth block data B4 and the fifth block data B5 according to an arrangement method described below.
Step S3 shows that “transmitting a second pointer, the third block data B3, the fourth block data B4 and the fifth block data B5 to the receiver by the sender”. In practice, before transmitting the third block data B3, the fourth block data B4 and the fifth block data B5 to the receiver by the sender, the method further comprises the step of compressing the third block data B3, the fourth block data B4 and the third block data B5 by a compressor, so as to reduce the bandwidth occupied during the transmission of the third block data B3, the fourth block data B4 and the fifth block data B5. The second pointer is configured to indicate a starting address of the third block data B3, the size of the third block data B3, a size of the fourth block data B4, and a size of the fifth block data B5.
Step S4 shows that “receiving the first pointer, the first block data B1, the second block data B2 and the third block data B3 by the receiver”. As shown in
Step S5 shows that “merging the first block data B1, the second block data B2 and the third block data B3 to perform a convolution operation by the receiver according to the first pointer”. In practice, the method further comprises the step of decompressing the first block data B1, the second block data B2 and the third block data B3 by a decompressor before performing the convolution operation. These three pieces of block data B1-B3 are stored in SRAM after being decompressed. The processing element can obtain the first starting address of the first block data B1 in SRAM according to the first pointer, calculate the second starting address of the second block data B2 in SRAM according to the first starting address and the size of the first block data B1, and calculate the third starting address of the third block data B3 in SRAM according to the second starting address and the size of the second block data B2.
Step S6 shows that “receiving the second pointer, the fourth block data B4 and the fifth block data B5 by the sender”. As shown in
Step S7 shows that “merging the third block data B3, the fourth block data B4 and the fifth block data B5 to perform another convolution operation according to the second pointer by the receiver”. In practice, the method further comprises step of decompressing the third block data B3, the fourth block data B4 and the fifth block data B5 by the decompressor before performing the convolution operation. These three pieces of block data B3˜B5 are stored in SRAM after being decompressed. The processing element may obtain the third starting address of the third block data B3 in SRAM according to the second pointer, calculate the fourth starting address of the fourth block data B4 in SRAM according to the third starting address and the size of the third block data B3, and calculate the fifth starting address of the fifth block data B5 in SRAM according to the fourth starting address and the size of the fourth block data B4.
As shown in
In the first-time processing, the proposed method fetches a 10×10 input tile from the left corner of the input feature map. As shown in
In the second-time processing, the proposed method steps forward the right by 8 elements on the feature map to fetch the next input tile.
Since the step size is constant within one layer of CNN processing, the left boundaries and the right boundaries of input tiles fetched every time form two arithmetic progression, denoted as Bl={−1, 7, 15, . . . } and Br={9, 17, 25, . . . }, wherein Bl represents the left boundary and Br represents the right boundary. The arrangement proposed by the present disclosure is the division formed by these two boundaries, namely the union, G=Bl∪Br. In this example, G={1, 7} (mod 8).
Because 7−1=6 (mod 8), 1−7=2 (mod 8), and the arrangement described above is adapted to the division in a horizontal direction or in a vertical direction, each input feature map may be divided into two uneven sizes of 2 and 6, which results in four subtensor shapes: 6×6, 2×6, 6×2, and 2×2.
A 10×10 window is then composed of one 6×6, two 2×6 and 6×2, and four 2×2 subtensors.
In addition, since the halo only appears in the spatial dimension, this division process is not necessary along the channel dimension.
The second embodiment will be described below. In this embodiment, computation of every layer of CNN may be defined with the following parameters:
Kernel size is denoted as 2k+1 since kernel size tends to be odd integers.
Two output elements convolving two windows with a stride of s. When s>1, it means a smaller output feature map and thus less computation cost.
Dilated CNN convolves strided input elements for one output element to enlarge the equivalent window size, and the present disclosure denotes this stride as d.
The output tile size is denoted as th×tw.
Based on the above representatives of parameters,
For dilated CNN shown in
G={−kd,kd−s+1}(mod stw)
According to the aforementioned arrangements proposed by the present disclosure, it can be noticed that the configuration for mod N is also valid for mod N′ when N is divisible by N′ (N′|N).
Taking AlexNet CONV1 as an example, whose configuration, (k, s, tw)=(5, 4, 8), corresponds to a configuration, G1={27, 2} (mod 32), of the present disclosure. Therefore, another configuration, G2={3, 2} (mod 8), is also valid to AlexNet CONV1.
It is thus possible to use a single N across all CNN layers to keep the hardware implementation simple, and in an embodiment of the present disclosure, N=8 can be a suitable choice for most cases.
Multiple subtensors divided according to a given configuration of the present disclosure have to be stored in a data structure that complies the memory alignment requirement to maximize the benefits of compression. Since subtensors may have different compressed size, the present disclosure has to store the extra pointers separately from the compressed subtensors.
Because the pointer of the present disclosure has not to correspond to each subtensor, the total size of pointers may be effectively reduced.
The present disclosure proposes a hardware-friendly method for storing and accessing compressed, sparse feature maps. The present disclosure divides the feature maps into uneven subtensors, and in the process, avoids wasteful fetches of partial subtensors and partial cache lines. Furthermore, the present disclosure only requires a small metadata indexing overhear to keep track of the locations of the compressed subtensors. The present disclosure can be a simple-yet-effective modification for existing CNN accelerators since it is mostly independently of the compression algorithms and requires changes only to the existing feature map division methods. The present disclosure can save a large amount memory bandwidth during the data transmission.
In view of the above, the present disclosure proposes an efficient storage scheme for sparse feature map for reducing external memory bandwidth, which is aligned to the memory access patterns in modern CNN accelerator architectures. Given a specific CNN layer and an accelerator configuration, the present disclosure may convert a sparse tensor into multiple subtensors with different sizes. Existing accelerators can be integrated with the present disclosure to improve an overall performance with a minimum of hardware modification and overhead.
Number | Date | Country | Kind |
---|---|---|---|
202011293554.7 | Nov 2020 | CN | national |