The present disclosure relates generally to sparse coding algorithms, and more particularly, to a hardware architecture for implementing a sparse coding algorithm.
A key component in many classification algorithms used in computer- or machine-based object or speech recognition/classification systems involves developing and identifying relevant features from raw data. For some raw data types, e.g., image pixels and audio amplitudes, there is often a set of features that more naturally describe the data than other features. Sparse feature coding/encoding helps reduce the search space of the classifiers by modeling high-dimensional data as a combination of only a few active features, and hence, can reduce the computation required for classification.
Sparse coding is a class of unsupervised learning algorithms that attempt to both learn and extract unknown features that exist within an input dataset under the assumption that any given input can be described by a sparse set of learned features. For example, a Sparsenet algorithm, which is an early sparse coding algorithm, attempts to find sparse linear codes for natural images by developing a complete family of features that are similar to those found in the primary visual cortex of primates. (“Features” are also known as “receptive fields,” and may be used herein interchangeably.) A sparse-set coding (SSC) algorithm forms efficient visual representations using a small number of active features. A locally competitive algorithm (LCA) implements sparse coding based on neuron-like elements that compete to represent the received input. A sparse and independent local network (SAILnet) algorithm implements sparse coding using biologically realistic rules involving only local updates, and has been demonstrated to learn the receptive fields or features that closely resemble those of the primary visual cortex simple cells.
Recently developed sparse coding algorithms are capable of extracting biologically relevant features through unsupervised learning, and to use inference to encode an input (e.g., image) using a sparse set of features. More particularly, the algorithm learns features through training a biologically-inspired network of computational elements or model neurons that mimic the activity of neurons of a mammalian brain (e.g., neurons of the visual cortex), and infers the sparse representation of the input using the most salient features. Inference based on the learned features enables the efficient encoding of the input and the detecting of features and/or objects thereof using, for example, a weighted sum of features of the model neurons (See
Implementation of an energy-efficient, high-throughput sparse coding algorithm on a single chip may be necessary and/or advantageous for low-power and real-time cognitive processing of images, videos, and audio in applications from, for example, mobile telephones or other like electronic devices to unmanned aerial vehicles (UAVs). Such an implementation is not without its challenges, however. For example, the number of on-chip interconnects and the amount of memory bandwidth required to support parallel operations of hundreds (or more) model neurons/computational elements are such that conventional hardware designs often resort to costly and slow off-chip memory and processing.
Accordingly, an objective of the present disclosure is to provide a hardware architecture that, in at least one embodiment, is contained on a single chip (e.g., an application specific integrated circuit (ASIC)) that implements a sparse coding algorithm for an application in learning and extracting features from, for example, images, videos, or audio inputs, and does so in a high-performance and low-power consumption manner. Such an architecture may have a number of applications, for example, emerging embedded vision applications ranging from personal mobile devices to micro unmanned aerial vehicles, to cite only a few possibilities; and may be used in image encoding, feature detection, and as a front end to a object recognition system. Other applications may include non-visual classification tasks such as speech recognition.
According to one embodiment, there is provided a sparse coding system. The system comprises a neural network including a plurality of neurons each having a respective feature associated therewith and each being configured to be electrically connected to every other neuron in the network and to a portion of an input dataset. The plurality of neurons are arranged in a plurality of neuron clusters each comprising a respective subset of the plurality of neurons, and the neurons in each cluster are electrically connected to one another in a bus structure, and the plurality of clusters are electrically connected together in a ring structure.
According to another embodiment, there is provided a sparse coding system. The system comprises an inference module configured to extract features from an input image containing an object, wherein the inference module comprises an implementation of a sparse coding algorithm. The system further comprises a classifier configured to classify the object in the input image based on the extracted features. In an embodiment, the inference module and classifier are integrated on a single chip.
Preferred exemplary embodiments will hereinafter be described in conjunction with the appended drawings, wherein like designations denote like elements, and wherein:
In accordance with one aspect of the present disclosure, an architecture implementing a sparse coding algorithm is provided. More particularly, in an embodiment, the present disclosure relates to a sparse coding neural network implemented in a hardware architecture that may comprise a single-chip architecture (e.g., a “system-on-a-chip”) that includes on-chip learning for feature extraction and encoding. For purposes of illustration, the description below will be primarily with respect to the implementation of the sparse and independent local network (SAILnet) algorithm, and with respect to the use of the algorithm/architecture for visual object classification. It will be appreciated, however, that the same or similar architecture may be useful in implementing other sparse coding algorithms, for example, a locally competitive algorithm (LCA), and/or for applications other than visual object classification. As such, the present disclosure is not intended to be limited to any particular sparse coding algorithm(s) and/or application(s).
More particularly, and as at least briefly described above, a sparse coding algorithm tries to find a sparse set of vectors known as receptive fields or features to represent an input dataset, for example and without limitation, an input image. The sparse coding algorithm is naturally mapped to the neuron network 12 of system 10, and one feature is associated with one neuron 14. The sparse coding hardware system 10 is capable of performing two primary operations: learning and inference. In the learning operation, the features associated with the neurons 14 are first initialized to random values, and through iterative stochastic gradient descent and using a plurality of training images, the algorithm converges to a dictionary of features that allows for an accurate representation of images similar to those used in the training/learning process using a small number of the learned dictionary elements. Learning is done in the beginning to set up weight values and occasionally afterwards to update the weights if the quality of a model of new input data modeled using the dictionary is unsatisfactory, so no real-time constraint is placed on learning.
However, inference needs to be done in real time. In inference, the algorithm generates neuron spikes to indicate the activated features from an input. Generally, the library size, or alternatively the number of neurons 14 needed by this algorithm, is no less than the number of pixels in the input image, as an over-complete library tends to capture more intrinsic features and the sparsity of model neuron activity improves with an over-complete library.
With reference to
Neural network 12 develops the Q and W weights through learning. After learning converges, the Q weights of the feed-forward connections from a particular neuron 14 represent one feature in the dictionary. The \V weights represent the strength of directional inhibitions between neurons 14, which allow neurons 14 to dampen the response of other neurons if their features are all highly correlated with each other. The lateral inhibition forces the neurons 14 to diversify and differentiate their features and minimizes the number of neurons 14 that are active at once.
To illustrate the architecture and implementation of a sparse coding algorithm, the description below will be with respect to the SAILnet algorithm. In an embodiment, the SAILnet algorithm is based on, and thus the neurons 14 comprise, leaky integrate-and-fire neurons. A depiction of an illustrative model of a neuron 14 is shown in
where Xk denotes an input pixel value, sj(t) represents the spike train generated by neuron j (i.e., sj(t)=1 if neuron j fires at time t, otherwise, sj(t)=0). As shown in
Neuron voltage (i.e., voltage Vi(t) across the capacitor C in
When the voltage of the neuron (i.e., Vi(t)) exceeds a threshold voltage θ set by the diode illustrated
The neuron activity within network 12 with respect to an input image is represented by the firing rate of the neurons 14. The synchronous digital description of a neuron's operation is given by the following equation (4):
V
i
[n+1]=Vi[n]+η(Σk=1N
where: Vi is the voltage of neuron i; n is a time index; η is an update step size, NP is the number of pixels in the input image or particular patch thereof; N is the number of neurons 14 in network 12; Xk is the value of pixel k in the input image, and yj is the binary output of neuron j. Again, Q is a N×NP matrix that stores the feed-forward connection weights, and Qik stores the weight of the feed-forward connection between neuron i and pixel k; and W is a N×N matrix that stores feedback connection weights, and Wij stores the weight of the feedback connection from neuron j to neuron i (directional).
In an embodiment, Q weights, W weights, and the voltage threshold θ for each neuron 14 are learned parameters. In practice, a batch of training images are given as inputs to generate neuron spikes. The spike counts, si, where i is an index of the neuron, are then used in parameter updates. For example, using the SAILnet sparse coding algorithm, the updates may be calculated using equations (5)-(7):
Q
ij
(m+1)
=Q
ij
(m)
+γs
i(Xk−siQij(m)) (5)
W
ij
(m+1)
=W
ij
(m)+β(sisj−p2) (6)
θi(m+1)=θi(m)+α(si−p) (7)
where m is the update iteration number, and α, β, and γ are tuning parameters to adjust the learning speed and convergence, p is the target firing rate in units of number of spikes per input image per neuron used to adjust the sparsity of neuron spikes. An advantage of the learning rules for the SAILnet algorithm is their locality. Q and θ updates for any particular neuron only involve the spike count and firing threshold of that particular neuron, and W updates only involve the pair of neurons that are part of the relevant lateral connection. Note that the locality property of the SAILnet learning rules is a unique feature of the SAILnet algorithm, and it is not shared by all sparse coding algorithms.
As discussed above, the SAILnet algorithm can be mapped to a fully connected neural network consisting of homogeneous neurons, and a binary spiking and discrete-time neuron model such as that described above makes it possible to design a simple digital neuron, such as, for example, that illustrated in
One way in which the neurons of a neural network may be connected together and communicate is through a bus structure. In a conventional bus structure, communication is advantageously a one-to-many broadcast, and the bus structure has low latency for small networks. However, a bus structure does not scale well with network size, and the high fan-out and wire loading of a bus structure may lead to relatively large RC delays. Larger neural networks also produce more spikes and thus bus structures may have a relative high spike collision probability. To account for this, spike collisions must be arbitrated with an arbiter, and, to serve many simultaneous spikes in a large network, the bus needs to run at a higher speed than the neurons, resulting in increased power consumption.
Another way in which neurons of a network may be connected together and communicate is through a ring structure. In conventional ring structures, the on-chip interconnects are all local, spikes generated by the neurons propagate serially, and spike collisions are eliminated. Since there are no spike collisions, no arbitration is needed, and fan-out in such structures is low, and the local wire capacitance does not grow with the network size. Therefore, a ring structure is highly scalable. However, the serial communication along a ring incurs high latency and may alter algorithm dynamics. Significant communication latency degrades image encoding quality and may yield unacceptable results.
With reference to
At a first or lower layer, the plurality neurons 14 of network 12 are grouped or divided into a plurality of different clusters 16 (i.e., 16a, 16b, 16c, etc.) with each cluster 16 containing a respective subset of the total number of neurons 14 in network 12, and the neurons 14 of each cluster 16 being connected together by a bus structure. As will be described below, the bus structure may comprise a single bus (e.g., flat bus) or may comprise a multi-dimensional (e.g., two-dimensional) bus or grid comprised of a plurality of buses. The number of neurons 14 in each cluster 16 (N1)—and thus the size of the bus—may be chosen to keep the fan-out and wire loading of the bus structure low so that a low-latency broadcast bus structure can be achieved. A smaller number neurons and smaller bus size also keeps the spike collision probability low so that spike collisions can be discarded and arbitration removed with minimal to no impact on the image reconstruction error. In other words, the bus structure in this arrangement comprises an arbitration-free bus structure.
At a second or upper layer, a ring structure 20, for example, a systolic ring, is used to connect and facilitate communication between the plurality of clusters 16. The length of ring structure 20 (N2)—and thus the number of clusters 16—is chosen to keep a low communication latency.
In an embodiment, the sizes of the first and second layers N1, N2 need to meet the requirement that N1N2=N, where N is the size of network 12 (or number of neurons 14 in network 12). There is a tradeoff between N1 and N2. A large N1 and small N2 increases the reconstruction error due to spike collisions, while a large N2 and small N1 increases communication latency. In an illustrative embodiment wherein network 12 includes or contains 256 neurons (i.e., N=256), it was found through empirical, software simulation that the tradeoff may be balanced when N1=64 and N2=4; in other words, when network 12 includes four (4) neuron clusters each containing 64 neurons.
In an embodiment, the bus structure of each cluster 16 is further optimized into a multi-dimensional bus structure or grid structure (e.g., an A×B grid structure comprised of A rows and B columns, wherein in an embodiment A=B). In at least some implementations, the grid structure comprises A horizontal buses each connecting B neurons in a row, and B vertical buses each connecting A neurons in a column. For example, in an embodiment, wherein network 12 includes 256 neurons that are grouped into four (4) clusters of 64 neurons each, the grid structure for one cluster may comprise an 8×8 grid structure having eight (8) horizontal buses each connecting eight (8) neurons in a row, and eight (8) vertical buses each connecting eight (8) neurons in a column. In any event, in an embodiment wherein each cluster 16 is arranged as a grid structure, the fan-out and wire loading seen by each neuron 14 may be substantially (e.g., quadratically) reduced compared to a flat bus structure. Even though there are more buses, the buses are shorter with fewer neurons connected to each bus. A shorter bus has a lower capacitance, and so the delay in transmitting spikes between neurons is shorter.
In another embodiment, rather than each grid being constructed of discrete buses, in an embodiment such as that illustrated in
In addition to the aforementioned complications relating to the interconnection of the neurons in a network, another complication is that the memory required to store Q weights grows at O(NpN) and W weights grows at O(N2), where, again, Np is the number of pixels in the input image or at least a portion thereof, and N is the number of neurons 14 in the network 12. As a result, the memory costs significant area and power for a sufficiently large neural network. To account for this, the word length of the weights is optimized to reduce the memory storage and partition a memory device of system 10 into, for example, two (2) parts so that, during real-time inference, only one part of memory 18 is powered “on” to reduce power consumption.
More particularly, in an embodiment wherein network 12 comprises 256 neurons, the network may require a 64 K-word Q memory to store Q weights and a 64 K-word W memory to store W weights. To reduce the word length, sparse coding algorithms can be quantized. For example, an empirical analysis of the fixed-point quantization effects on the image reconstruction error for the SAILnet algorithm was performed using software simulations. Given that the input pixels are quantized to 8-bits, results showed that the word length could be reduced to 13-bits per Q weight and 8-bits per W weight for a good performance, as shown in
Through software simulations, it was also found that the word length required by learning and inference differ significantly for sparse coding algorithms. Learning requires a relatively long word length, e.g., for the particular implementation of the SAILnet algorithm, 13-bits per Q weight and 8-bits per W weight to allow for a small enough incremental weight update to ensure convergence, whereas the word length for inference can be reduced to 4-bits per Q weight and 4-bits per W weight for a good image reconstruction error, as shown in
The access bandwidth of the core and auxiliary memory also differ. The core memory is needed for both inference and learning. In every inference step, a neuron spike triggers the simultaneous core memory access by all neurons to the same address corresponding to the NID of the spike. Therefore, the core memory of all neurons in a local grid is consolidated to support the wide parallel memory access by all neurons.
The auxiliary memory is powered “on” only during learning. Since learning does not need to be in real time, it is implemented in a serial way. Moreover, approximate learning may be used to update weights and thresholds only for the most active neurons, so the fully parallel random access to the auxiliary memory is unnecessary. Hence, the auxiliary memory of all neurons in a local grid is consolidated into a larger address space to improve area utilization.
As described elsewhere above, network 12 may be used to perform an inference function, and as such, may be considered to be part of an inference module. More specifically, in illustrative embodiment, neurons 14 of network 12 (e.g., 256 neurons in the examples described above) are used to perform parallel leaky integrate and fire to generate spikes for inference. Inference is done over a number of inference steps ns, that is chosen based on the neuron time constant r and the inference step size η, i.e., ns=w/(ητ), where w is the inference period. For a low image reconstruction error, w is chosen to be sufficiently long, e.g., w=2τ, and the inference step size is chosen to be sufficiently small, e.g., η=1/32 (in an instance wherein network 12 includes 256 neurons), resulting in the number of inferences steps being ns=64.
The leaky integrate and fire operation of neurons 14 described by equation (4) above has two main parts, excitation, Σk=1N
The inhibition computation is driven by spike events over the inference steps. Since the yi[n] term in equation (4) is binary, the inhibition computation is implemented with an accumulator, requiring no multiplication. The inhibition computation is triggered by neuron spikes, i.e., after receiving a spike NID. In an embodiment, it may take up to (N2−1) clock cycles (e.g., 3 clock cycles in an embodiment such as that described above wherein the ring comprises N2=4 stages) for an NID to travel along an N2-size ring to be received by every neuron 14, so a cycle-accurate implementation halts the inference for (N2−1) cycles after an NID is transmitted. In this way, the inhibition computation over the 64-step inference described above requires up to 4×64=256 cycles, assuming one spike per inference step. To reduce the latency, the halt may be removed to implement approximate inference. In approximate inference, an NID will be received by neurons 14 in different grids/clusters 16 at different times, triggering inhibition computations at different times. Excessive spike latency may worsen the image encoding quality. However, since the latency is limited to (N2−1) cycles, the fidelity is maintained as shown in
The inference operation of system 10 is divided into two phases: loading and inference. In an embodiment, loading for a 16×16 (pixel) still image may take 256 cycles and inference may take 64 cycles. In the case of streaming video, however, consecutive 16×16 frames may be well approximated by only updating a subset (e.g., 64) of the 256 pixels. Accordingly, each step may be done in 64 cycles, so that the two steps can be interleaved.
With reference to
Of the three types of parameter updates done in learning, Q, W, and θ, Q update is the most costly computationally, as it involves updating the Q weights of all feed-forward connections from the active spiking neurons. To simplify the control of parameter updates, a message-passing approach may be used. In the Q update phase, the snooping core 22 sends a Q update message for each of the most active neurons 14 recorded in the cache 24. The message may take the form of {[1-bit REQ] [8-bit NID] [4-bit SC]}, where REQ acts as a message valid signal and SC is the spike count. Messages are passed around the ring structure 20 and broadcasted through the grids/clusters 16. A small Q update logic is placed inside each grid/cluster 16 to calculate the Q weight update based on equation (5) above when the NID of the message belongs to the grid. The updated weight is saved in, for example, the 9-bit wide auxiliary memory. Occasional carry out bit from the update will result in an update of, for example, the 4-bit wide Q core memory. The Q updates in all of the grids/clusters 16 can execute in parallel to speed up the updates.
W weight update involves calculating the correlation of spike counts between pairs of the active spiking neurons. The snooping core 22 implements W update by generating a W update message for each active spiking neuron pair. The W update message may be in the form of {[1-bit REQ] [8-bit NID1] [8-bit NID2] [4-bit SC1] [4-bit SC2]}, where NID1 and NID2 are the pair of active spiking neurons, and SC1 and SC2 are the respective spike counts. A small W update logic in the snooping core 22 calculates the W weight update based on equation (6) above. The updated weight is saved in, for example, the 4-bit wide W auxiliary memory, and the carry out bit is written to the 4-bit wide W core memory. Similarly, θ update is implemented by passing a θ update message in the form of {[1-bit REQ] [8-bit NID] [4-bit SC]}. θ updates are done by the respective neurons in parallel.
To demonstrate the operation and functionality of system 10, the architectural and algorithmic features described above were incorporated in an ASIC test chip implemented in TSMC 65 nm CMOS technology.
The test chip was limited in the number of input and output pads, therefore the input image was scanned bit-by-bit into the SRAM. After the scan was complete, the chip operated at its full speed. It is envisaged that this ASIC chip would be integrated with an imager so that the image input can be provided directly on-chip, and not limited by expensive off-chip input and output.
Bench testing of the hardware prototype demonstrated that the test chip was fully functional. The measured inference power consumption is plotted in
The measured learning power is shown in
It will be appreciated that values of the performance characteristics and parameters set forth above comprise test data relating to the specific implementation of the test chip. Because one or more of the performance characteristics and parameter values may differ depending on the particular implementation of the chip, it will be further appreciated that the present disclosure is not intended to be limited to any particular values for the performance characteristics/parameters.
An interesting aspect of sparse coding algorithms is their resilience to errors in the stored memory weights. This resilience stems from the inherent redundancy of the neural network and the ability to correct errors through on-line learning. The benefit of this error tolerance was explored by looking at supply voltage over-scaling of the core memory in inference for potential energy savings to exploit the potential trade-off possible with this system. To do so, memory bit error was measured using the scan chain interface to first write and verify the correct known values at the nominal 1.0V supply voltage, and then lower the supply voltage, run inference, and read out the values for comparison.
In any event, in an illustrative embodiment, system 10 comprises a 256-neuron architecture or system for sparse coding. A two-layer network is provided to link four (4) 64-neuron grids/clusters in a ring to balance capacitive loading and communication latency. The sparse neuron spikes and the relatively small grid keep the spike collision probability low enough that collisions are discarded with only slight effect on the image reconstruction error. To reduce memory area and power, a memory is provided that is partitioned or divided into a core memory and an auxiliary memory that is powered down during inference to save power. The parallel neural network 12 permits a high inference throughput. Parameter updates in learning are serialized to save the implementation overhead, and the number of updates is reduced by an approximate approach that considers only the most active neurons. A message passing mechanism is used to run parameter updates without costly controls.
In an embodiment, the functionality and operation of system 10 that has thus far been described may comprise a sparse feature extraction inference module (IM) that may be integrated with a classifier 26 (e.g., a task-driven dictionary classifier) to form a object recognition processor. In other words, the IM may comprise a front end and the classifier a back end to an end-to-end object recognition processor disposed on a single chip. Accordingly, in an embodiment, system 10 may further include a classifier for performing an object classification/recognition function in addition to the components described above.
As at least briefly described above and as illustrated in
To demonstrate the operation and functionality of system 10 having both an IM and a classifier integrated thereon, a test chip was fabricated in TSMC 65 nm CMOS technology and bench tested.
It is to be understood that the foregoing description is of one or more embodiments of the invention. The invention is not limited to the particular embodiment(s) disclosed herein, but rather is defined solely by the claims below. Furthermore, the statements contained in the foregoing description relate to the disclosed embodiment(s) and are not to be construed as limitations on the scope of the invention or on the definition of terms used in the claims, except where a term or phrase is expressly defined above. Various other embodiments and various changes and modifications to the disclosed embodiment(s) will become apparent to those skilled in the art.
As used in this specification and claims, the terms “e.g.,” “for example,” “for instance,” “such as,” and “like,” and the verbs “comprising,” “having,” “including,” and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open-ended, meaning that the listing is not to be considered as excluding other, additional components or items. Further, the term “electrically connected” and the variations thereof is intended to encompass both wireless electrical connections and electrical connections made via one or more wires, cables, or conductors (wired connections). Other terms are to be construed using their broadest reasonable meaning unless they are used in a context that requires a different interpretation.
This application claims the benefit of U.S. Provisional Application No. 62/172,527 filed Jun. 8, 2015, the entire contents of which are hereby incorporated by reference.
This invention was made with government support under HR0011-13-2-0015 awarded by the Department of Defense/DARPA. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62172527 | Jun 2015 | US |