The present disclosure relates to a sparse video inference processor for action classification and motion tracking.
Sparse coding is a class of unsupervised machine learning algorithms that attempt to both learn and extract the unknown features that exist within an input dataset under the assumption that any given input can be described by a sparse set of features that it learns. Sparse coding helps reduce the search space of the classifiers by modeling high-dimensional data as a combination of only a small number of active features and, hence, can reduce the computation required for classification. Sparse coding can be implemented in a recurrent network of spiking leaky integrate-and-fire neurons, where a neuron's potential increases due to input excitation, known as potentiation, and decreases due to inhibition by neighboring neurons. In this disclosure, a sparse coding algorithm, called locally competitive algorithm (LCA), is considered for inference. The LCA algorithm is described by equation (1).
Δu=η[ΦTx−(ΦTΦ−I)a−u]
a=Tλ(u) (1)
where u is the neuron potential, Δu is the potential update; η is the update step size; Φ is the receptive fields (RF) of neurons, also known as the dictionary; x is the input; a is the neuron activation; and I is the identity matrix. Tλ( ) is a binary threshold function and it outputs 1 if its input exceeds λ, or 0 otherwise. The threshold λ is learned from training data using an optimization method, such as stochastic gradient descent, to maximize encoding accuracy and sparsity of neuron activations, i.e., number of zeros in neuron activations. While reference is made to LCA, other sparse coding algorithms are also contemplated by this disclosure.
In performing inference on video inputs, an input is divided to 3D segments for processing. For example, x is a series of T X×Y×D consecutive and overlapping video segments, as shown in
The inference described by equation (1) consists of four functional steps: charge, compete, leak and activate. In the charge step, input x is projected to the feature space as described by ΦTx. The projection can be understood as encoding the input x in STRFs, i.e., extracting STRFs from the input. The projection increases, or charges, the neuron potential.
To maintain sparse activation, active neurons suppress other neurons in the compete step. The inhibition weight between a pair of neurons is computed by correlating their STRFs, i.e., ΦTΦ. Self inhibition is removed by subtracting I. The closer the two neurons' STRFs, the stronger the inhibition between the two neurons. Neuron activations trigger inhibitions as described by −(ΦTΦ−I)a.
In the leak step, neuron potential decreases over time, and the leakage is proportional to the potential. In the activate step, neuron potential is thresholded to generate binary spikes.
The four steps above constitute one iteration of inference. Given an input x, the inference is preferably done by iterating the four steps until convergence. It is common to use a fixed number of iterations I. The baseline implementation is illustrated in
The implementation complexity of one iteration of inference is analyzed and the results are listed below in Table 1. The dictionary storage requires VN entries. The inhibitory weights are computed by ΦTΦ−I. The N2 weights can be computed once and stored in memory.
In every iteration of inference, the charge step requires NVT MACs. The step is done one per inference, and the result accumulated in subsequent iterations of the inference. The compete step is driven by neuron activations, requiring N2T MACs per iteration for I iterations.
Typically the number of neurons (N) ranges from hundreds and more for practical applications and video inference can be particularly challenging due to its large dimensionality, size and STRFs. A realistic implementation calls for a large chip size and a high processing power.
This section provides background information related to the present disclosure which is not necessarily prior art.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
A video inference processor is presented that extracts features from a video. The video inference processor includes a residual layer, a charge layer and an activation layer as well as a plurality of neurons. The plurality of neurons are interconnected to form a recurrent neural network implemented in hardware, such that each neuron is configured to store in memory a receptive field. The residual layer is configured to receive a video input and output from an activation layer. During operation, the residual layer reconstructs the video input from the output from the activation layer, subtracts the reconstructed input from the video input to yield a residual and quantizes values of the residual. The charge layer is configured to receive the quantized values of the residual from the residual layer and operates to project the quantized values of the residual onto the plurality of receptive fields, thereby yielding potential update values for the plurality of neurons. The activation layer is configured to receive the potential update values for the plurality of neurons from the charge layer and operates to accumulate the potential update values and threshold potential values for the plurality of neurons to generate a set of binary outputs, wherein the set of binary outputs is fed back to the residual layer.
In one aspect, the residual layer reconstructs the input video using only select accumulate operations and without multiplication operations. For example, the residual layer is implemented in hardware using multiplexers, adders and registers. Likewise, the charge layer projects quantized values of the residual into the plurality of receptive fields using only select accumulate operations and without multiplication operations. The charge layer may also be implemented in hardware using multiplexers, adders and registers.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
Example embodiments will now be described more fully with reference to the accompanying drawings.
Video data is large, but it also contains high redundancy, especially from frame to frame. The redundancy offers opportunities for significant complexity reduction in storage and compute through compression and rectification. While reference is made throughout this disclosure to video input, the broader aspects of the classification scheme are applicable to other types of inputs as well.
In this disclosure, the LCA equation is reformulated by factoring the term ΦT in (1).
Δu=η[ΦT(x−Φa)+a−u]
a=Tλ(u) (2)
The reformulated inference can be interpreted as having four steps: residual, charge, leak and activate. The leak and activate steps are identical to the original formulation. The residual and charge steps are described below.
To reduce the compute complexity, a min/max rectification is applied to the residuals to quantize the residuals to ternary spikes as shown below.
The residual rectification is done by applying thresholds of λr and −λr to quantize the residuals to 1, 0, and −1. Similar to how the threshold λ is learned from training, the threshold λr can also be learned from training using the same optimization method. The optimization is formulated to maximize sparse encoding accuracy and sparsity of residuals, i.e., number of zeros in residuals.
A key advantage of quantizing the residuals to binary or ternary spikes is that the multiplication by these quantized values followed by accumulation, which is needed in the charge step, no longer requires an expensive multiplier. Instead, a simpler select-accumulate (SA) can be used. For example, suppose a is a binary (0 or 1), multiplying a by b followed by accumulation can be done using an SA that is implemented using a select-add as shown in
The residuals tend to reduce in magnitude over iterations, resulting in increasing sparsity over time in an inference. By appropriately choosing λr, the residuals can be further sparsified. By appropriately designing hardware to take advantage of sparsity, significant performance improvement and power savings can be achieved.
Similar to the residual rectification, neuron activation can be viewed as the rectification of neuron potentials to produce sparse, binary spikes. Binary spikes allow the reconstruction in the residual step to be implemented using SAs, presenting another opportunity for significant complexity and power reduction.
Taking advantage of both residual rectification and neuron activation, the sparse, all-spiking approach can be implemented as shown in
With reference to
The video inference processor 50 includes a plurality of neurons interconnected to form a recurrent neural network. Each neuron in the plurality of neurons is implemented in hardware. Each neuron is configured to store in memory a receptive field which represents a possible feature in a video. The plurality of receptive fields are collectively referred to as a dictionary. Likewise, in the example embodiment, each receptive field represents a time series of video segments of similar size to the video inputs. In the example embodiment, the video inference processor 50 uses a dictionary of 192 STRFs (N=192), each of size 6×6×8, to encode video slices using the STRFs. STRF weights are quantized to 8 bits. Based on the STRFs extracted from video, classification tasks, such as action classification, can be performed.
In one example embodiment, 54 KB memory is needed on the chip to store the dictionary. The density of neuron activations and residuals can be optimally set to Sa=1% and Sr=3%, respectively, in processing the KTH Human Action Dataset, to maximize sparsity without sacrificing action classification accuracy. The number of iterations is tunable up to 32. The sparse, all-spiking approach reduces the number of operations per inference from approximately 200M MACs to 4M SAs, which translates to a significant reduction in complexity and power consumption.
The video inference processor 50 is further comprised of three layers: residual layer 54, charge layer 55 and activation (activate) layer 56. That is, the residual step is mapped to the residual layer; the charge step is mapped to the charge layer (the leak step is absorbed as part of the charge layer); and the activate step is mapped to the activation layer. The residual layer and the charge layer are the workhorse of the video inference processor 50. Each layer is nonblocking, and data is streamed through the residual layer, the charge layer, the activation layer and back to the residual layer for the next iteration.
The residual layer 54 is configured to receive a video input (in the initial iteration of an inference) as well as output from the activation layer 56 (in subsequent iterations of an inference). Briefly, the residual layer 54 reconstructs the video input from the output from the activation layer 56. More specifically, the residual layer 54 reconstructs the video input by summing the receptive fields that are activated in the output from the activation layer. It is noted that the residual layer 54 reconstructs the input video using only select accumulate operations and without multiplication operations. The residual layer 54 then subtracts the reconstructed input from the video input to yield a residual and quantizes values of the residual.
The charge layer 55 is configured to receive the quantized values of the residual from the residual layer 54. The charge layer 55 operates to project the quantized values of the residual onto the plurality of receptive fields and thereby yield potential update values for the plurality of neurons. Likewise, the charge layer 55 projects quantized values of the residual onto the plurality of receptive fields using only select accumulate operations and without multiplication operations. In one embodiment, the charge layer 55 compresses the quantized values of the residual by aggregating quantized values of a given pixel across video segments as further described below.
The activation layer 56 is configured to receive the potential update values for the plurality of neurons from the charge layer 55. The activation layer 56 operates to accumulate the potential update values and threshold potential values for the plurality of neurons to generate a set of binary outputs. The set of binary outputs are in turn fed back to the residual layer for the next iteration of processing. Further details for the example embodiment are provided below.
The dictionary Φ and its transpose ΦT are accessed by the residual layer and the charge layer, respectively. Since the residual layer and the charge layer operate concurrently in a streaming pipeline and the dictionary elements' access orders are different, both Φ and ΦT are stored on chip, requiring 108 KB of memory in the example embodiment. Due to the high access bandwidth needed for highly parallel processing, the dictionary memory is divided into banks, sacrificing the storage efficiency. In the example embodiment, the dictionary memory occupies 2.5 mm2 chip area in a 40 nm CMOS technology.
In the example embodiment, each dictionary element is a 6×6×8 8-bit STRF that is essentially a sequence of 8 6×6 frames. Redundancy exists between consecutive frames, making it possible to compress each STRF to save memory, chip size and power. In
The similarity between consecutive frames makes it possible to delta encode of STRFs by storing the first 6×6 8-bit frame as the anchor frame, and subsequent frames as 4-bit pixel-by-pixel deltas to the previous frame. The delta encoding reduces the dictionary storage by 43%.
Although 4 bits are sufficient to cover 95% of the deltas, a better result requires a larger range. To keep deltas to 4 bits while increasing the range of coverage, non-uniform quantization of deltas is proposed as shown in
The delta-encoded dictionary elements need to be decompressed before being used in computations with reference to
In architecting the residual layer, an array of V SAs (V=288 in the example embodiment) is employed as the compute engine, as illustrated in
As illustrated in
The spike detector skips zeros to enable improvements in both performance and power. If the entire column j of a is zero, the column j of the reconstruction x is also zero, and the majority of the residual layer processing is skipped. This approach is called layer skipping. Experiments with the KTH Dataset show that layer skipping is effective in reducing the residual layer processing latency by 6.3× and its power consumption by 3.5× in the example embodiment.
Clock gating can be used in conjunction with spike detection to save additional power. When no spikes are present, the compute is idle and the clock is gated to save clocking power. Clock gating is especially effective when processing sparse data, as evidence in the example embodiment where power is reduced by 4.2×.
In architecting the charge layer, an array of N SAs (N=192 in the prototype design) is employed, as illustrated in
Note that r is a collection of ternary spikes {0, −1, 1}, and the majority of the entries are 0. Each column of r represents a (X×Y×D) frame. To increase performance, we pool D frames to one. If at least one pixel among all the pixels at the same location in the XY-plane across the D frames is nonzero, pooling will output 1 for the pixel. After pooling, each entry of ra represents an “aggregated” pixel i (in the XY-plane) across D frames, as shown in
A key benefit of pooling is that it enables aggregated processing to increase performance. As shown in
Potential updates Δu (N×T) are accumulated in the activation layer to compute new neuron potentials. Au is received column by column, and the activation layer uses N accumulators (N=192 in the prototype design) to update one column of potentials at a time. The potentials are thresholded to obtain binary activations a.
The activations a (N×T) are binary and sparse. As described above, a is fed to a spike detector to locate the nonzero entries for processing in the residual layer. The spike detector can be used to encode a in a compressed column storage (CCS), referring to storing only the addresses of nonzero entries in every column, as illustrated in
In the example embodiment, we limit the storage to 8 nonzero entries of a in a column (based on the average density of Sa=1% and N=192 and 4×margin). Additional nonzeros entries are dropped with negligible impact on the accuracy due to the extremely low likelihood of occurrence. CCS effectively reduces the storage by 64% to 84% in the example embodiment when processing the KTH Dataset.
As proof of concept, a prototype chip was designed to demonstrate the efficient designs for video inference applications. The system level design is shown in
Through the OpenRISC processor, the video inference processor is configurable with several settings: 64, 128 or 192 neurons (N), frame size (X×Y) from 1 to 36 and depth (D) from 1 to 8. Inputs are streamed in to the frame load queue, and dictionary elements are recovered from their compressed storage prior to performing compute.
The video inference chip is implemented in 40 nm CMOS, occupying 3.98 mm2. The chip microphotograph is shown in
The 6-class KTH Human Action Dataset is used for action classification testing (600 samples with train/test split ration of 5:1). With the core extracting the activated STRFs, a soft-max classifier implemented on the OpenRISC processor achieves a 76.7% classification accuracy. Using the same outputs, an off-chip support vector machine (SVM) classifier achieves an 82.8% accuracy as shown below in Table IV.
Motion tracking is also prototyped using a simple bounding box regression method based on the core outputs. Compared to state-of-the-art vision processors, this design offers enhanced capabilities of action classification and motion tracking using a recurrent network. The design exploits sparse spikes to effectively reduce workload, demonstrating competitive performance and efficiency. The sparse video inference processor is suitable for a range of cognitive processing tasks.
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
This application claims the benefit of U.S. Provisional Application No. 62/515,683, filed on Jun. 6, 2017. The entire disclosure of the above application is incorporated herein by reference.
This invention was made with government support under Grant No. HR0011-13-2-0015 awarded by the Defense Advanced Research Projects Agency. The Government has certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
9082079 | Coenen | Jul 2015 | B1 |
20130325766 | Petre et al. | Dec 2013 | A1 |
Entry |
---|
M. Baccouche et al “Spatio-Temporal Convolutional Sparse Auto-Encoder For Sequence Classification”; www.bmva.org/bmvc/2012/BMVC/paper 124/paper 124 (2012). |
P. Dollár, et al “Behavior Recognition Via Sparse Spatio-Temporal Features” 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005). |
S. Nishimoto et al “A Three-Dimensional Spatiotemporal Receptive Field Model Explains Responses of Area MT Neurons to Naturalistic Movies” Journal of Neuroscience, 31 (41) (Oct. 12, 2011). |
Number | Date | Country | |
---|---|---|---|
20180349764 A1 | Dec 2018 | US |
Number | Date | Country | |
---|---|---|---|
62515683 | Jun 2017 | US |