The present disclosure relates to a configurable convolution neural network processor.
Neuro-inspired coding algorithms have been applied to various types of sensory inputs, including audio, image, and video, for dictionary learning and feature extraction in a wide range of applications including compression, denoising, super-resolution, and classification tasks. Sparse coding implemented as a spiking recurrent neural network can be readily mapped to hardware to achieve high performance. However, as the input dimensionality increases, the number of parameters becomes impractically large, necessitating a convolutional approach to reduce the number of parameters by exploiting translational invariance.
In this disclosure, a configurable convolution neural network processor is presented. The configurable convolution processor has several advantages: 1) the configurable convolution processor is more versatile than fixed architectures for specialized accelerators; 2) the configurable convolution processor employs sparse coding which produces sparse spikes, presenting opportunities for significant complexity and power reduction; 3) the configurable convolution processor preserves structural information in dictionary-based encoding, allowing downstream processing to be done directly in the encoded, i.e., compressed, domain; and 4) the configurable convolution processor uses unsupervised learning, enabling truly autonomous modules that adapt to inputs.
This section provides background information related to the present disclosure which is not necessarily prior art.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
A configurable convolution processor is presented. The configurable convolution processor includes a front-end processor and a plurality of neurons. The front-end processor is configured to receive an input having an array of values and a convolutional kernel of a specified size to be applied to the input. The plurality of neurons are interfaced with the front-end processor. Each neuron includes a physical convolution module with a fixed size. Each neuron is configured to receive a portion of the input and the convolutional kernel from the front-end processor, and operates to convolve the portion of the input with the convolutional kernel in accordance with a set of instructions for convolving the input with the convolutional kernel, where each instruction in the set of instructions identifies individual elements of the input and a particular portion of the convolutional kernel to convolve using the physical convolution module.
In one embodiment, the front-end processor determines the set of instructions for convolving the input with the convolutional kernel and passes the set of instructions to the plurality of neurons. The front-end processor further defines a fixed block size for the input based on the specified size of the convolutional kernel and size of the physical convolution module, divides the input into segments using the fixed block size and cooperatively operates with the plurality of neurons to convolve each segment with the convolutional kernel. Convolving each segment with the convolutional kernel includes: determining a walking path for scanning the physical convolution module in relation to a given input segment, where the walking path aligns with center of each pixel of the convolutional kernel when visually overlaid onto the convolutional kernel and the walking path aligns with center of the input segment when visually overlaid onto the given input segment; and at each step of the walking path, computing a dot product between a portion of the convolutional kernel and a portion of the given input segment and accumulating result of the dot product into an output buffer.
In some embodiments, the front-end processor implements a recurrent neural network with feedforward operations and feedback operations performed by the plurality of neurons.
In some embodiments, neurons in the plurality of neurons are configured to receive a portion of the input during a first iteration and configured to receive a reconstruction error during subsequent iterations, where the reconstruction error is difference between the portion of input and a reconstructed input from a previous iteration. The neurons in the plurality of neurons may generate a spike when a convolution result exceeds a threshold, accumulates spikes in a spike matrix, and creates the reconstructed input by convolving the spike matrix with the convolutional kernel. The reconstructed input may be accompanied by a non-zero map, such that non-zero entries are represented by a one and zero entries are represented by zero in the non-zero map. Non-zero map of multiple reconstructed input segments may be accompanied by another non-zero map, forming a hierarchical non-zero map.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
Example embodiments will now be described more fully with reference to the accompanying drawings.
In the above arrangement, the configurable convolution processor 10 implements recurrent networks by iterative feedforward and feedback. In a feedforward operation, each neuron convolves its input or reconstruction errors 12, i.e., the differences between the input 13 and its reconstruction 14, with a kernel 15. The convolution results are accumulated, and spikes are generated and stored in a spike map 16 when the accumulated potentials exceed a threshold. In a feedback operation, neuron spikes are convolved with kernel 15 to reconstruct the input. Depending on application, 10 to 50 iterations are required to complete one inference. The inference output, in the form of neuron spikes, are passed to a downstream post-processor 18 to complete various tasks.
For demonstration, a configurable convolution processor chip is built in 40 nm CMOS. The configurable convolution architecture is more versatile than fixed architectures for specialized accelerators. The design optimally exploits the inherent sparsity using zero-patch skipping to make convolution up to 40% more efficient than the state-of-the-art constant-throughput zero masking convolution. A sparse spike-driven approach is adopted in feedback operations to minimize the cost of implementing recurrence by eliminating multipliers. In this example, the configurable convolution processor contains 48 convolutional neurons with configurable kernel size up to 15×15, which are equivalent to 10,800 non-convolutional neurons in classic implementations. Each neuron operates at an independent clock and communicates using asynchronous interfaces, enabling each neuron to run at the optimal frequency to achieve load balancing. Going beyond conventional feature extraction tasks, the configurable convolution processor 10 is applied to stereo images to extract depth information as illustrated in
To implement a recurrent neural network for sparse coding, a modular hardware architecture is designed as shown in
In an example embodiment, the modular hardware architecture 30 for the configurable convolution processor is comprised of a front-end processor, or hub, 31 and a plurality of neurons 32. The front-end processor 31 is configured to receive an input and a convolution kernel of a specified size to be applied to the input. In one example, the input is an image having an array of values although other types of inputs are contemplated by this disclosure. Upon receipt of the input, the front-end processor 31 determines a set of instructions for convolving the input with the convolution kernel and passes the set of instructions to the plurality of neurons.
A plurality of neurons 32 are interfaced with the front-end processor 31. Each neuron 32 includes a physical convolution module implemented in hardware. The physical convolution module can perform a 2-dimensional (2D) convolution of a fixed size Sp×Sp. In the example embodiment, the physical convolution size is 4×4. It follows that the physical convolution module includes 16 multipliers, 16 output buffers and a group of configurable adders as seen in
Each neuron 32 is configured to receive a portion of the input and a convolution kernel of a specified size from the front-end processor 31. Each neuron in turn operates to convolve the portion of the input with the convolution kernel in accordance with the received set of instructions for convolving the input with the convolution kernel, where each instruction in the set of instructions identifies particular pixels or elements of the input and a particular portion of the convolution kernel to convolve using the physical convolution module.
In performing a feedforward operation, a neuron convolves a typically non-sparse input image (in the first iteration) or sparse reconstruction errors (in subsequent iterations) with its kernel. The feedforward convolution is optimized in three ways: 1) highest throughput for sparse input by exploiting sparsity, 2) highest throughput for non-sparse input by fully utilizing the hardware, and 3) efficient support of variable kernel size. To achieve high throughput and efficiency, a sparse convolver can be used to support zero-patch skipping as will be described in more detail below. To achieve configurability, variable-sized convolution is divided into smaller fixed-sized sections and a traverse path is designed for the physical convolution module to assemble the complete convolution result. The design of the configurable sparse convolution is described further below.
In one embodiment, each neuron supports a configurable kernel of size up to 15×15 using a compact latch-based kernel buffer, and variable image patch size up to 32×32. An input image larger than 32×32 is divided into 32×32 sub-images that share overlaps to minimize edge artifacts.
In a feedback operation, neuron spikes are convolved with their kernels to reconstruct the input image. A direct implementation of this feedback convolution is computationally expensive and would become a performance bottleneck. Taking advantage of the binary spikes, all multiplications in this convolution are replaced by additions. The design also makes use of the high sparsity of the spikes (typically >90% sparsity) to design a sparsely activated spike-driven reconstruction to save computation and power. This design is also detailed below.
With continued reference to
Next, the input block size is defined at 53 based on the kernel size and the size of the physical convolution module. Specifically, the input block size is set to (Sk+Sp−1)×(Sk+Sp−1). In the example embodiment, this equates to an input block size of 8×8.
Lastly, the input is convolved with the convolutional kernel. In most instances, the size of the overall input is much larger than the input block size. When the size of the overall input is greater than the input block size, then the input is divided into segments at 54, such that each segment is equal to the input block size or a set of segments can be combined to match the input block size, and each segment or a set of segments is convolved with the convolutional kernel at 55. The segments may or may not overlap with each other. For example, starting from the top left corner, convolve a first segment with the convolutional kernel. Next, move Sp columns to the right (e.g., 4) and convolve this second segment with the convolutional kernel as shown in
Convolving a given segment of the input with the convolutional kernel is described in relation to
At each step of the walking path, a dot product is computed between a portion of the convolutional kernel and a portion of the given input segment. The result of the dot product is then accumulated into an output buffer. For ease of explanation, this convolution process is described using a 4×4 image convolved with a 3×3 kernel to produce a 2×2 output as seen in
In this example, the input segment is scanned in nine steps starting with the top left portion of the input segment. In step 1, the dot product is computed for a 4×4 sub-kernel and a 4×4 block of the input segment as seen in
For these steps, the instructions sent by the front-end processor are as follows: 1*E+2*F+4*I+5*J, down one row; 1*F+2*G+4*J+5*K, right one column; and 1*B+2*C+4*F+5*G, up one row.
Referring to
In step 7, a 1×1 sub-kernel is applied to a set of four 1×1 input segments as seen in
Lastly, a 1×2 sub-kernel is applied to a set of two 1×2 input row segments as seen in
To maximize throughput, the multipliers in the physical convolution module need to be fully utilized if possible, so the two 2×1 input column segments are processed together by the physical convolution module in steps 5 and 6. Similarly, four 1×1 input segments are processed together in step 7, and two 1×2 input row segments are processed together in steps 8 to 9. The physical convolution module is preferably equipped with a configurable adder tree to handle various forms of accumulation in different steps.
To maximize locality of reference, kernel sections are fetched once and reused until done, and image segments are shifted by one row or column between steps. Such a carefully arranged sequence results in a maze-walking path that maximizes hardware utilization and data locality. An optimal path exists for every kernel size; yet, to minimize storage, paths for larger kernels are created with multiple smaller paths, for example as described above in relation to
In one aspect of this disclosure, the configurable convolution processor supports sparse convolution for a sparse input to increase throughput and efficiency. It has been observed that it is more likely to have a patch of zeros than a line of zeros in the input, so skipping zero patches is more effective. The configurable convolution processor readily supports zero-patch skipping with the help of an input non-zero (NZ) map, wherein a NZ bit is 1 if at least one nonzero entry is detected in an area covered by a patch of the same size as the physical convolution module.
Triggered by a neuron's spike, the front-end processor performs reconstruction by retrieving the neuron's kernel from the kernel memory and accumulating the kernel in the image memory, with the kernel's center aligned to the spike location. Like in the configurable convolution, a kernel is also divided into sections to support variable kernel size in the spike-driven reconstruction. The NZ map of the reconstructed image is computed by OR'ing the NZ map of the retrieved kernels, saving both computation and latency compared to the naüve way of scanning the reconstructed image. The spike-driven reconstruction eliminates the need to store spike maps. In one embodiment of the design, a 16-entry FIFO is sufficient for buffering spikes, cutting the storage by 2.5×.
In the example embodiment, the configurable convolution processor 10 implements globally asynchronous communication between the front-end-processor and neurons to achieve scalability by breaking a single clock network with stringent timing constraints into small ones with relaxed constraints. The globally asynchronous scheme further enables load balancing by allowing the front-end processor and individual neurons to run at the optimal clock frequencies based on workload. Following feed-forward operations, neurons send 10-bit messages to identify neuron spikes to the hub via a token-based asynchronous FIFO. Following a feedback operation, the hub sends 128-bit messages that contain reconstructed image and NZ map to the neurons. To avoid routing congestion from the hub to the neurons, a broadcast asynchronous FIFO is designed, which is identical to the token-based asynchronous FIFO except for the FIFO full condition check logic.
The asynchronous FIFO design is shown in
As a proof of concept, a 4.1 mm2 test chip is implemented in 40 nm CMOS, and the configurable convolution processor 10 occupies 2.56 mm2. A mixture of 80.5% high-VT and 19.5% low-VT cells is used to reduce the chip leakage power by 33%. Dynamic clock gating is applied to reduce the dynamic power by 24%. A balanced clock frequency setting for the hub and neurons further reduces the overall power by an average of 22%. A total of 49 VCOs are instantiated, with each VCO occupying only 250 um2 area. The test chip achieves 718 GOPS at 380 MHz with a nominal 0.9V supply at room temperature. An OP is defined as an 8-bit multiply or a 16-bit add.
Two sample applications are used to demonstrate the configurable convolution processor: extracting sparse feature representation of images and extracting depth information from stereo images. The feature extraction task is entirely done by the front-end processor and neurons; and the depth extraction task requires an additional local matching post-processing programmed on the on-chip Open RISC processor. When performing feature extraction using 7×7 kernels, 10 recurrent iterations, and a target sparsity of approximately 90%, the configurable convolution processor 10 achieves 24.6M pixel/s (equivalent to 375 256×256 frames per second), while consuming 195 mW (shown in dashed lines in
Compared to state-of-the-art inference processors based on feedforward only networks, the configurable convolution processor 10 realizes a recurrent network, supports unsupervised learning, and demonstrates expanded functionalities including depth extraction from stereo images, while still achieving competitive performance and efficiency in power and area.
In this application, including the definitions below, the term “module” or the term “controller” may be replaced with the term “circuit.” The term “module” may refer to, be part of, or include: an Application Specific Integrated Circuit (ASIC); a digital, analog, or mixed analog/digital discrete circuit; a digital, analog, or mixed analog/digital integrated circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor circuit (shared, dedicated, or group) that executes code; a memory circuit (shared, dedicated, or group) that stores code executed by the processor circuit; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
The module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. The term shared processor circuit encompasses a single processor circuit that executes some or all code from multiple modules. The term group processor circuit encompasses a processor circuit that, in combination with additional processor circuits, executes some or all code from one or more modules. References to multiple processor circuits encompass multiple processor circuits on discrete dies, multiple processor circuits on a single die, multiple cores of a single processor circuit, multiple threads of a single processor circuit, or a combination of the above. The term shared memory circuit encompasses a single memory circuit that stores some or all code from multiple modules. The term group memory circuit encompasses a memory circuit that, in combination with additional memories, stores some or all code from one or more modules.
The term memory circuit is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are nonvolatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks, flowchart components, and other elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory, tangible computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language) or XML (extensible markup language), (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective C, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5, Ada, ASP (active server pages), PHP, Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, and Python®.
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
This invention was made with government support under grants HR0011-13-3-0002 and HR0011-13-2-0015 awarded by the U.S. Department of Defense/Defense Advanced Research Projects Agency (DARPA). The government has certain rights in this invention.