The invention relates to deep neural network (DNN), and more particularly, to a method and an integrated circuit for convolution calculation in deep neural network, in order to achieve high energy efficiency and low area complexity.
Deep neural network is a neural network with a certain level of complexity, a neural network with more than two layers. DNNs use sophisticated mathematical modeling to process data in complex ways. Recently, there is escalating trend to deploy the DNNs on mobile or wearable devices, the so-called Al-on-the-edge or Al-on-the-sensor, for versatile real-time applications, such as automatic speech recognition, objects detection, feature extraction, etc. MobileNet, an efficient network aimed for mobile and embedded vision application, achieves significant reduction in convolution loading by using combined depthwise convolutions and large amount of 1*1*M pointwise convolution when compared to the network performing normal/regular convolutions with the same depth. This results in light weight deep neural networks. However, the massive data movements to/from external DRAM still cause huge power consumption when realizing MobileNet, because the power consumption is 640 pico-Joules (pJ) per 32-bit DRAM read, which is much higher than that of MAC operations (ex. 3.1 pJ for 32-bit multiplications).
SOCs (system on chip) generally integrate a lot of functions, and thus are space and power consuming. Considering limited battery power and space on edge/mobile devices, a power-efficient and memory-space-efficient integrated circuit as well as method for convolution calculation in DNN are indispensable.
In view of the above-mentioned problems, an object of the invention is to provide an integrated circuit applied in a deep neural network, in order to reduce the size and the power consumption of the integrated circuit and to eliminate the use of external DRAM.
One embodiment of the invention provides an integrated circuit applied in a deep neural network. The integrated circuit comprises at least one processor, a first internal memory, a second internal memory, at least one MAC circuit, a compressor and a decompressor. The at least one processor is configured to perform a cuboid convolution over decompression data for each cuboid of a first input image fed to any one of multiple convolution layers. The first internal memory is coupled to the at least one processor. The at least one MAC circuit is coupled to the at least one processor and the first internal memory and configured to perform multiplication and accumulation operations associated with the cuboid convolution to output a convoluted cuboid. The second internal memory is used to store multiple compressed segments only. The compressor coupled to the at least one processor, the at least one MAC circuit and the first and the second internal memories is configured to compress the first convoluted cuboid into one compressed segment and store it in the second internal memory. The decompressor coupled to the at least one processor, the first internal memory and the second internal memory is configured to decompress data from the second internal memory on a compressed segment by compressed segment basis to store the decompression data in the first internal memory. The input image is horizontally divided into multiple cuboids with an overlap of at least one row for each channel between any two adjacent cuboids. The cuboid convolution comprises a depthwise convolution followed by a pointwise convolution.
Another embodiment of the invention provides a method applied in an integrated circuit for use in a deep neural network. The integrated circuit comprises a first internal memory and a second internal memory, The method comprises: (a) decompressing a first compressed segment associated with a current cuboid of a first input image and outputted from the first internal memory to store decompressed data in the second internal memory; (b) performing cuboid convolution over the decompressed data to generate a 3D pointwise output array; (c) compressing the 3D pointwise output array into a second compressed segment to store it in the first internal memory; (d) repeating steps (a) to (c) until all the cuboids associated with a target convolution layer are processed; and, (e) repeating steps (a) to (d) until all of multiple convolution layers are completed. The input image is fed to any one of the convolution layers and horizontally divided into multiple cuboids with an overlap of at least one row for each channel between any two adjacent cuboids. The cuboid convolution comprises a depthwise convolution followed by a pointwise convolution.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context.
In deep learning, a convolutional neural network (CNN) is a class of deep neural networks, most commonly applied to analyzing visual imagery. In general, a CNN has three types of layers: convolutional layer, pooling layer and fully connected layer. The CNN usually includes multiple convolutional layers. For each convolutional layer, there are multiple filters (or kernels) used to convolute over an input image to obtain an output feature map. The depths (or the numbers of channels) of the input image and one filter are the same. The depth (or the number of channels) of the output feature map is equal to the number of the filters. Each filter may have the same (or different) width and height, which are less than or equal to the width and height of the input image.
A feature of the invention is to horizontally split an output feature map for each convolutional layer into multiple cuboids of the same dimension, sequentially compress the data for each cuboid into an individual compressed segment and store the compressed segments in a first internal memory (e.g. ZRAM 115) of an integrated circuit for a mobile/edge device. Another feature of the invention is to fetch the compressed segments from the first internal memory on a compressed segment by compressed segment basis for each convolution layer, de-compress one compressed segment into decompressed data in a second internal memory (e.g. HRAM 120), perform cuboid convolution over the decompressed data to produce a 3D pointwise output array, compress the 3D pointwise output array into an updated compressed segment and store the updated compressed segments back to the ZRAM 115. Accordingly, with proper cuboid size selection, only decompression data for a single cuboid of an input image for each convolution layer are temporarily stored in the HRAM 120 for cuboid convolution while the compressed segments for the other cuboids are still stored in the ZRAM 115. Consequently, the use of external DRAM is eliminated; besides, not only the sizes of the HRAM 120 and the ZRAM 115 but also the size and power consumption of the integrated circuit 100 are reduced.
Another feature of the invention is to use the cuboid convolution, instead of conventional depthwise separate convolution, over the de-compressed data with filters to produce a 3D pointwise output array for each cuboid of an input image fed to anyone of the convolution layers of a light weight deep neural network (e.g., MobileNet). The cuboid convolution of the invention is split into a depthwise convolution and a pointwise convolution. Another feature of the invention is to apply a row repetitive value compression (RRVC) scheme to each channel of each cuboid in the output feature map for MobileNet layer 1 and to each 2D pointwise output array (p(1)˜p(N) in
For purposes of clarity and ease of description, the following embodiments and examples are described in terms of MobileNet (including multiple convolutional layers); however, it should be understood that the invention is not so limited, but is generally applicable to any type of deep neural network that allows to perform the conventional depthwise separate convolution.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The term “input image” refers to the total data input fed to either the first layer or each convolution layer of Mobilenet. The term “output feature map” refers to the total data output generated from either the normal/regular convolution for the first layer or the cuboid convolutions of all cuboids for each convolution layer in Mobilenet.
According to the programs in the data/program internal memory 141, the DSPs 140 are configured to perform all operations associated with the convolution calculations that includes the regular/normal convolutions and the cuboid convolutions, and to enable/disable the MAC circuits 111, the neural function unit 112, the compressor 113 and the de-compressor 114 via a control bus 142. The DSPs 140 are further configured to control the input/output operations of the HRAM 120 and the ZRAM 115 via the control bus 142. An original input image from an image/sound acquisition device (e.g., a camera) (not shown) are stored into the HRAM 120 via the sensor interface 170. The original input image may be a normal/general image with multiple channels or a spectrogram with a single channel derived from an audio signal (will be described below). The flash memory 150 pre-stores the coefficients forming the filters for layer 1 and each convolution layer in MobileNet. Prior to any convolution calculation for layer 1 and each convolution layer in MobileNet, the DSPs 140 read its corresponding coefficients from the flash memory 150 via the flash control interface 130 and temporarily store them in HRAM 120. During the convolution operation, the DSPs 140 instruct the MAC circuits 111 via the control bus 142 according to the programs in the data/program internal memory 141 to perform related multiplications and accumulations over the image data and coefficients in HRAM 120.
The neural function unit 112 is enabled by the DSP 140 via the control bus 142 to apply a selected activation function over each element from the MAC circuits 111.
After the selected activation function is applied to the outputs of the MAC circuits 111, the DSPs 140 instruct the compressor 113 via the control bus 142 to compress data from the neural function unit 112 cuboid by cuboid into multiple compressed segments for multiple cuboids with any compression method, e.g., row repetitive value compression (RRVC) scheme (will be described below). The ZRAM 115 is used to store the compressed segments associated with the output feature map for the first layer and each convolution layer in MobileNet. The decompressor 114 is enabled/instructed by the DSP 140 via the control bus 142 to decompress compressed segments on a compressed segment by compressed segment basis for the following cuboid convolution with any decompression method, e.g., row repetitive value (RRV) decompression scheme (will be described below). The control bus 142 is used to control the operations of the MAC circuits 111, the neural function unit 112, the compressor 113 and the de-compressor 114, the ZRAM 115 and the HRAM 120 by the DSPs 140. In one embodiment, the control bus 142 includes six control lines that originate from the DSPs 140 and are respectively connected to the MAC circuits 111, the neural function unit 112, the compressor 113, the de-compressor 114, the ZRAM 115 and the HRAM 120.
Step S202: Perform a regular/standard convolution over the input image using corresponding filters to generate an output feature map. In one embodiment, according to MobileNet spec, apply a regular convolution on the input image in HRAM 120 with corresponding filters to generate the output feature map for layer 1 in MobileNet (which is also an input image for the following convolution layer) by the DSPs 140 and the MAC circuits 111. Here, the input image has at least one channel.
Step S204: Divide the output feature map into multiple cuboids of the same dimension, compress the data for each cuboid into a compressed segment and sequentially store the compressed segments in ZRAM 115.
Please also note that the data for the M channels of the output feature map for MobileNet layer 1 in
Step S206: Read the compressed segment i for cuboid i from ZRAM 115 and then de-compress the compressed segment i for the following cuboid convolution in convolution layer j to store its decompression data in HRAM 120. In an embodiment, the DSPs 140 instruct the decompressor 114 via the control bus 142 to read the compressed segments in ZRAM 115 on a compressed segment by compressed segment basis and de-compress the compressed segment i with a decompression method, such as RRV decompression scheme (will be described below), corresponding to the compression method in step S204 to store the decompression data for cuboid i in HRAM 120. Without using any external DRAM, a small storage space of HRAM 120 is sufficient for decompression data of a single cuboid to perform its cuboid convolution operation since the compressed segments for the other cuboids are stored in ZRAM 115 at the same time.
In an alternative embodiment, the regular convolution (steps S202˜S204) and the cuboid convolution operations (steps S206˜S212) are performed in pipelined manner. In other words, performing the cuboid convolution operations (steps S206˜S212) does not need to wait for all the image data of the output feature map for layer 1 in MobileNet to be compressed and stored in ZRAM 115 (steps S202˜S204). Instead, as soon as all the data for cuboid 1 in the output feature map for layer 1 are produced, the DSPs 140 directly perform the cuboid convolution over the data of cuboid 1 for layer 2 (or convolution layer 1) without instructing the compressor 131 to compress the cuboid 1. In the meantime, the compressor 113 proceeds to compress the data of the following cuboids in the output feature map for layer 1 into compressed segments and sequentially store the compressed segments in ZRAM 115 (step S204). After the cuboid convolution associated with cuboid 1 for layer 2 (or convolution layer 1) is completed, the compressed segments of the other cuboids from ZRAM 115 are read on a compressed segment by compressed segment basis and decompressed for cuboid convolution (step S206).
Step S208: Perform depthwise convolution over the decompressed data in HRAM 120 for cuboid i using M filters (Kd(1)˜Kd(M)). According to the invention, the cuboid convolution is a depthwise convolution followed by a pointwise convolution.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The term “input array” refers to a channel of a cuboid in an input image. In the example of
Step S210: Perform pointwise convolution over the 3D depthwise output array in HRAM 120 using N filters (Kp(1)˜Kp(N)) to generate a 3D pointwise output array.
Please note that the result of each convolution operation is always applied with a corresponding activation function in the method for convolution calculation of the invention. For purposes of clarity and ease of illustration of the invention, only convolution operations are described and their corresponding activation functions are omitted in
Step S212: Compress the 3D pointwise output array for cuboid i into a compressed segment and store the compressed segment in ZRAM 115. In one embodiment, the DSPs 140 instruct the compressor 113 via the control bus 142 to compress the 3D pointwise output array for cuboid i with RRVC into a compressed segment and store the compressed segment in ZRAM 115. At the end of this step, increase i by one.
Step S214: Determine whether i is greater than K. If Yes, the flow goes to step S216; otherwise, the flow returns to step S206.
Step S216: Increase j by one.
Step S218: Determine whether j is greater than T. If Yes, the flow is terminated; otherwise, the flow returns to step S206.
The method in
Due to spatial coherence, there exists repetitive value between adjacent rows of either each 2D pointwise output array p(n) for each convolution layer in MobileNet or each channel of each cuboid in the output feature map for layer 1, where 1<=n<=N. Thus, the invention provides the RRVC scheme that mainly performs bitwise XOR operations on adjacent rows of each 2D pointwise output array p(n) or each channel of a target cuboid for reducing the number of bits for storage.
Step S402: Set parameters i and j to 1 for initialization.
Step S404: Divide a 2D pointwise output array p(i) of a 3D pointwise output array associated with cuboid f into a number R of a*b working subarrays A(j), where R>1, a>1 and b>1. In the example of
Step S406: Form a reference row 51 according to a reference phase and the first to the third elements of row 1 of working subarray A(j). In an embodiment, the compressor 113 sets the first element 51a (i.e., reference phase) of the reference row 51 to 128 and copy values of the first to the third elements of row 1 of working subarray A(j) to the second to the fourth elements in the reference row 51.
Step S408: Perform bitwise XOR operations according to the reference row and the working subarray A(j). Specifically, perform bitwise XOR operations on two elements sequentially outputted either from the reference row 51 and the first row of the working subarray A(j) or from any two adjacent rows of the working subarray A(j) to produce corresponding rows of a result map 53. According to the example of
Step S410: Replace non-zero (NZ) values of the result map 53 with 1 to form a NZ bitmap 55 and sequentially store original values that reside at the same location in the subarray A(j) as the NZ values in the result map 53 into the search queue 54. The original values in the working subarray A(j) are fetched in a top-down and left-right manner and then stored in the search queue 54 by the compressor 113. The search queue 54 and the NZ bitmap 55 associated with the working subarray A(j) are a part of the above-mentioned compressed segment to be stored in ZRAM 115. In the example of
Step S412: Increase j by one.
Step S414: Determine whether j is greater than R. If Yes, the flow goes to step S416; otherwise, the flow returns to step S406 for processing the next working subarray.
Step S416: Increase i by one.
Step S418: Determine whether i is greater than N. If Yes, the flow goes to step S420; otherwise, the flow returns to step S404 for processing the next 2D pointwise output array.
Step S420: Assembly the above NZ maps 55 and search queues 54 into a compressed segment for cuboid f. The flow is terminated.
Step S462: Set parameters i and j to 1 for initialization.
Step S464: Fetch a search queue 54′ and a NZ bitmap 55′ from a compressed segment for cuboid f stored in ZRAM 115. The search queue 54′ and the NZ bitmap 55′ correspond to a restored working subarray A′(j) of a 2D restored pointwise output array p′(i) of a 3D restored pointwise output array associated with cuboid f. Assume that the restored working subarray A′(j) has a size of a*b, there are a number R of restored working subarrays A′(j) for each 2D restored pointwise output array p′(i) and there are a number N of 2D restored pointwise output arrays for each 3D restored pointwise output array, where R>1, a>1 and b>1. In the example of
Step S466: Restore NZ elements residing at the same location in the restored working subarray A′(j) as the NZ values in the NZ bitmap 55′ according to values in the search queue 54′ and the NZ bitmap 55′. As shown in
Step S468: Form a restored reference row 57 according to a reference phase and the first to the third elements of row 1 in the restored working subarray A′(j). In an embodiment, the decompressor 114 sets the first element 57a (i.e., reference phase) to 128 and copies values of the first to the third elements of row 1 of the restored working subarray A′(j) to the second to the fourth elements of the restored reference row 57. Assume that b1˜b3 denote blanks in the first row of the restored working subarray A′(j) in
Step S470: Write zeros at the same location in the restored result map 58 as the zeros in the NZ bitmap 55′. Set x equal to 2.
Step S472: Fill in blanks in the first row of the restored working subarray A′(j) according to the known elements in the restored reference row 57 and the first row of the restored working subarray A′(j), the zeros in the first row of the restored result map 58 and the bitwise XOR operations over the restored reference row 57 and the first row of A′(j). Thus, we obtain b1=128, b2=222, b3=b2=222 in sequence.
Step S474: Fill in blanks in row x of the restored working subarray A′(j) according to the known elements in row (x−1) and row x of the restored working subarray A′(j), the zeros in row x of the restored result map 58 and the bitwise XOR operations over rows (x−1) and row x of A′(j). For example, if b4˜b5 denote blanks in the second row of the restored working subarray A′(j), we obtain b4=222, b5=b3=222 in sequence.
Step S476: Increase x by one.
Step S478: Determine whether x is greater than a. If Yes, the restored working subarray A′(j) is completed and the flow goes to step S480; otherwise, the flow returns to step S474.
Step S480: Increase j by one.
Step S482: Determine whether j is greater than R. If Yes, the 2D restored pointwise output array p′(i) is completed and the flow goes to step S484; otherwise, the flow returns to step S464.
Step S484: Increase i by one.
Step S486: Determine whether i is greater than N. If Yes, the flow goes to step S488; otherwise, the flow returns to step S464 for the next 2D restored pointwise output array.
Step S488: Form a 3D restored pointwise output array associated with cuboid f according to the 2D restored pointwise output arrays p′(i), where 1<=i<=N. The flow is terminated.
Please note that the working subarrays A(j) in
As well known in the art, an audio signal can be transformed into a spectrogram by an optical spectrometer, a bank of band-pass filters, Fourier transform or a wavelet transform. The spectrogram is a visual representation of the spectrum of frequencies of the audio signal as it varies with time. Spectrograms are used extensively in the fields of music, sonar, radar, speech processing, seismology, and others. Spectrograms of audio signals can be used to identify spoken words phonetically, and to analyse the various calls of animals. Since the formats of the spectrograms are the same as those of grayscale images, a spectrogram of an audio signal can be regarded as an input image with a single channel in the invention. Thus, the above embodiments and examples are applicable not only to general grayscale/color images, but also to spectrograms of audio signals. As normal grayscale or color images, a spectrogram of an audio signal is transmitted into HRAM 120 via the sensor interface 170 in advance for the above regular convolution and cuboid convolution.
The above embodiments and functional operations in
In an alternative embodiment, the DSPs 140, the MAC circuits 111, the neural function unit 112, the compressor 113 and the de-compressor 114 are implemented with a general-purpose processor and a program memory (e.g., the data/program memory 141). The program memory is separate from the HRAM 120 and the ZRAM 115 and stores a processor-executable program. When the processor-executable program is executed by the general-purpose processor, the general-purpose processor is configured to function as: the DSPs 140, the MAC circuits 111, the neural function unit 112, the compressor 113 and the de-compressor 114.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.
This application claims priority under 35 USC 119(e) to U.S. provisional application No. 62/733,083, filed on Sep. 19, 2018, the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62733083 | Sep 2018 | US |