The present disclosure relates to improving deep neural network architectures through patterned filter clustering and computation reuse.
Hyperdimensional Computing (HDC) is a brain-inspired learning paradigm based on the observation that brains perform cognitive tasks by mapping sensory inputs to high-dimensional neural representation. The paradigm enables the brain to carry out simple, low-power, error-resilient, and parallelizable operations all in the hyperspace. Such characteristics of HDC make it appealing for a wide variety of applications such as IoT domain that generates an increasing amount of data with tight resource and energy constraints. Conventional processing platforms such as CPUs and GPUs may not take full advantage of the highly-parallel bit-level operations of HDC. Furthermore, existing HDC encoding techniques often do not cover a broad range of applications to make a custom design plausible.
The increasing effectiveness of Deep Neural Networks (DNNs) across various application areas is paralleled by an expansion in both the size and computational requirements of their models. To address the challenges related to the memory and computational demands of DNNs, considerable research efforts have been dedicated to developing compression techniques. These techniques include weight quantization, pruning, clustering, and filter pruning, with particular emphasis on enhancing hardware efficiency through hardware-aware quantization and structured pruning. In the context of weight quantization, it can include the assignment of network parameters to a predefined set of values, such as in uniform quantization.
Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. Weight clustering consolidates weights into clusters, assigning a single value to all weights within a cluster. This allows for the storage of just the cluster index or ID for each weight in an index table, accompanied by a smaller table mapping these indexes to actual weight values. Prior studies have demonstrated that maintaining approximately 16 unique weights can preserve model accuracy, effectively doubling memory efficiency by replacing 8-bit weight representations with 4-bit index values.
Some embodiments of the present disclosure relate to encoding techniques that can enhance accuracy for a wide array of applications. Disclosed herein is an Application-Specific Integrated Circuits (ASIC) accelerator system that leverages the encoding techniques and can be optimized for edge computing environments. The ASIC accelerator system can support classification (e.g., encompassing both training and inference) and clustering for unsupervised learning, demonstrating an adaptability to various application requirements and hypervectors dimensionality. Such adaptability can enable the ASIC accelerator system to dynamically adjust between accuracy and energy/performance efficiency on demand. In some cases, the ASIC accelerator system can be augmented with application-opportunistic power-gating and voltage over-scaling strategies, exploiting the inherent error resilience of Hyperdimensional Computing (HDC) for further reductions in energy consumption. The encoding techniques described herein can significantly improve prediction accuracy over existing HDC and machine learning techniques, setting a new standard in the field. Further, the ASIC accelerator system can offer substantial improvements in energy efficiency over previous solutions, marking a significant advancement in ASIC accelerator technology for edge computing applications.
Some embodiments of the present disclosure relate to techniques and architectures for encoding data within a hyperdimensional computing (HDC) framework, enabling the transformation of input data into high-dimensional vector space representations. Embodiments herein facilitate the segmentation of data into multiple windows, selection of level hypervectors corresponding to data elements, application of permutation operations for positional encoding, and execution of binary operations to synthesize window hypervectors. The aggregation of such window hypervectors yields an encoded hypervector that encapsulates a representation of the original data in HDC space. This process can include the use of exclusive OR (XOR) operations for binary execution, predefined sets of level hypervectors for quantization, or unique identifier hypervectors for incorporating global sequence information. The disclosed embodiments are adept at handling various data types, including textual, image, voice, or sensor data, providing for broad applicability and adaptability in encoding for hyperdimensional computing applications.
Some embodiments of the present disclosure relate to a pattern clustering system, which can be designed to enforce shared clustering topologies on filters, thereby leading to a significant reduction in memory usage through the reuse of index information. The pattern clustering system can effectively factorize input activations and post-process unique weights, substantially decreasing the requirement for multiplication operations. In some cases, the pattern clustering system can reduce the number of addition operations by leveraging the fact that filters sharing a clustering pattern have identical factorized terms. Some embodiments of the present disclosure relate to techniques for determining and assigning clustering patterns, as well as for training a network to adhere to these target patterns. Some embodiments of the present disclosure relate to an efficient accelerator based on the patterned filters. The pattern clustering system can reduce both the memory footprint and the operation count, while maintaining accuracy comparable to that of baseline models. Furthermore, the accelerator for the pattern clustering system can significantly enhance energy efficiency, surpassing the performance of conventional technologies and setting a new benchmark in the field.
Throughout the drawings, reference numbers can be re-used to indicate correspondence between referenced elements. The drawings are provided to illustrate embodiments of the present disclosure and do not to limit the scope thereof.
Hyperdimensional Computing (HDC) often uses algorithms to encode raw inputs to a high-dimensional representation of hypervectors with Dhυ≈2-5 K dimensions. The encoding can take place by deterministically associating each element of an input with a binary or bipolar (±1) hypervector and bundling (element-wise addition) the hypervectors of all elements to create the encoded hypervector. Training can involve bundling all encoded hypervectors of the same category. For inference, the query input can be encoded to a hypervector in the same or similar fashion and compared with all class hypervectors using a simple similarity metric, such as cosine.
In some cases, the bit-level massively parallel operations of HDC do not accord well with conventional CPUs/GPUs due to, e.g., memory latency and data movement of large vectors or the fact that these devices are over-provisioned for majorly binary operations of HDC. Furthermore, solutions for custom HDC accelerators often suffer from limitations such as supporting only a narrow range of applications, achieving lower accuracy compared to baseline ML algorithms, or consuming significantly more energy.
Disclosed herein are inventive concepts that address these or other problems. Some inventive concepts herein relate to an ASIC accelerator system (sometimes referred to as a highly efficient learning engine on edge using hyperdimensional computing or GENERIC) for efficient and accurate trainable classification and clustering. The ASIC accelerator system can be compact and low-power (e.g., to meet year-long battery-powered operation) and/or can be fast during training and burst inference, e.g., when it serves as an IoT gateway.
Some inventive concepts herein relate to an HDC encoding that yields high accuracy in various benchmarks. Some inventive concepts herein relate to an ASIC accelerator system that can implement accurate HDC-based trainable classification and clustering. The ASIC accelerator system can benefit from extreme energy reduction techniques such as, but not limited to, application-opportunistic power gating, on-demand dimension reduction, and error-resilient voltage over-scaling. The ASIC accelerator system can improve the classification accuracy (e.g., by 3.5% over previous HDC techniques and 6.5% over ML techniques). The ASIC accelerator system can improve energy consumption (e.g., by 4.1× and 15.7× compared to previous HDC accelerators).
The similarity of hypervectors indicates their proximity, which can be used to cluster data in the hyperspace. Initially, k encoded hypervectors are selected as clusters centroids. At each iteration, all encoded inputs are compared with the centroids and added to the closest (highest score) centroid hypervector. In classification, the model is updated right away. However, in clustering, the model is fixed and used for finding the similarities, and a new model is created from scratch, which replaces the current model in the next iteration.
Encoding can be an important step of HDC. Some encoding techniques map the inputs to high dimensional space. Most encodings associate hypervectors with the raw input features (elements), called level hypervector (see
Encoding of an input can be accomplished by aggregating the level hypervectors of its elements. To handle the positional order of elements, which can be important in most datasets such as image or voice, HDC can use variants of binding. The permutation encoding of
Conventional encoding techniques can achieve low accuracy for certain datasets such as language identification which generally need extracting local subsequences of consecutive features, without considering the global order of these subsequences. Some previous studies use ngram encoding for such datasets. Ngram encoding extracts all subsequences of length n (usually n∈{3-5}) in a given input, encodes all these subsequences and aggregates them to produce the encoded hypervector. However, ngram encoding may achieve very low accuracy for datasets such as images or voices in which the spatio-temporal information of should be taken into account. Disclosed herein is a new encoding that can advantageously cover a more versatile set of applications.
Equation (1) outlines an example encoding process, in accordance with aspects of the inventive concept. In Equation (1), ρ(j) indicates permutation by j indexes, π H multiplies (XOR in binary) the levels of ith window, idi applies the binding id, and Σ adds up the window hypervector for all windows of d elements.
In this example, n=3 as it achieved the highest accuracy (on average) for the examined benchmarks. However, the value of n can vary across embodiment. In some cases, the ASIC accelerator system can adjust the value of n for every application.
As shown in Table 1, eleven datasets were compiled from different domains, including certain benchmarks, seizure detection by skull surface EEG signals, and user activity recognition by motion sensors. In this example, the HDC algorithms were implemented using an optimized Python implementation that leverages SIMD operations. For ML techniques, a Python scikit-learn library was used. Some of results of logistic regression and k-nearest neighbors were discarded, as they achieved lower accuracy. For DNN models of benchmarks, an AutoKeras library for automated model exploration was used.
Table 1 summarizes the accuracy results (RP: random projection, MLP: multi-layer perceptron, SVM: support vector machine, RF: random forest). As shown, in this example, the disclosed ASIC accelerator system encoding achieves 3.5% higher accuracy than the best baseline HDC (level-id), 6.5% higher than best baseline ML (SVM), and 1.0% higher than DNN. The RP encoding fails in time-series datasets that require temporal information (e.g., EEG). In some cases, the ngram encoding does not capture the global relation of the features, so it fails in datasets such as speech (ISOLET) and image recognition (MNIST). In some cases, except for the ngram and the disclosed ASIC accelerator system, other HDC techniques fail in the LANG (text classification) as they enforce capturing sequential information and ignore subsequences.
HDC's operations can be simple and highly parallelizable. However, conventional processors may not be optimized for binary operations such as one-bit accumulation. Also, the size of hypervectors in most settings can become larger than the cache size of low-end edge processors, which may impose significant performance overhead. The HDC and ML algorithms can be implemented on the datasets on a Raspberry Pi 3 embedded processor and NVIDIA Jetson TX2 low-power edge GPU, and also a desktop CPU (Intel Core i7-8700 at 3.2 GHz) with a larger cache. A Hioki 3334 power meter was used to measure the power of the Raspberry Pi.
The controller 510, e.g., by using spec data, handles the programmability of the ASIC accelerator system 500 and orchestrates the operations. For instance, the encoder generates m=16 (architectural constant) partial dimensions after each iteration over the stored input, where the variable hυ signals the end of encoding to finalize the search result, d denotes the number of input memory rows to be proceeded to fetch features (i.e., the exit condition for counter), nC indicates the number of class memory rows that need to be read for dot-product and so on. The class memory layout of the ASIC accelerator system 500 can allow a tradeoff between the hypervectors length hυ and supported classes nC. By default, the ASIC accelerator system class memories can store hυ=4K for up to nC=32 classes. For an application with less than 32 classes, higher number of dimensions can be used (e.g., 8 K dimensions for 16 classes). These application-specific input parameters enable the ASIC accelerator system 500 the flexibility to implement various applications without requiring a complex instruction set or reconfigurable logic.
Features can be fetched one by one from the input memory 520 and quantized to obtain the level bin, and accordingly, m (16) bits of the proper level hypervector are read. The levels are stored as m-bit rows in the level memory 530. The stacked registers (reg n to 1) facilitate storing and on-the-fly sliding of level hypervectors of a window. Each pass over the input features generates m encoding dimensions, which can be used for dot-product with the classes. The class hypervectors are distributed into m memories (CM 1 to CM m) to enable reading m consecutive dimensions at once. The dot-product of partial encoding with each class can be summed up in the pipelined adder 516, and accumulated with the dot-product result of previous/next m dimensions in the score memory 517.
After
iterations, all dimensions are generated, and the dot-product scores are finalized. The system 500 can use cosine similarity metric between the encoding vector H and class
The system 500 can normalize the dot-product result with L2 norms. The ∥∥2 can be removed from the denominator as it is a constant and does not affect the rank of classes. In addition, to eliminate the square root of ∥Ci∥2, the system 500 can modify the metric to
without affecting the predictions. The norm2 memory 518 stores the squared L2 norms of classes, and similarly, the squared score is passed to the divider 519. The system 500 can use an approximate log-based division.
In the first round of training, e.g., model initialization, encoded inputs of the same class/label are accumulated. It can be done through the adder 514 and mux 513 of all class memories. The controller 510 uses the input label and the iteration counter to activate the proper memory row. In the next retraining epochs, the model is examined and updated in case of misprediction (see
cycles. Training may also require calculating the squared L2 norm of classes in the norm2 memory 518. As it can be seen in
The ASIC accelerator system 500 selects the first k encoded inputs as the initial cluster centroids and initializes k centroids in the class memories. The system allocates two sets of memory rows for temporary data; one for the incoming encoding generated in the encoding module and another for the copy centroids (clustering generates a new copy instead of direct update). Similarity checking of the encoding dimensions with the centroids is done pipelined similar to inference, but the encoded dimensions are stored to be added to the copy centroid after finalizing the similarity checking. After finding the most similar centroid, the copy centroid is updated by adding the stored hypervector (similar to retraining). The copy centroids serve as the new centroids in the next epoch.
The ASIC accelerator system 500 can enable energy efficiency. The following elaborates energy-saving techniques that benefit the ASIC accelerator system 500. These techniques can also be applied to other HDC accelerators.
The id memory naturally needs 1K×4K=512 KB (for up to 1K features per input, and hυ=4 K dimensions) which occupies a large area and consumes huge power. However, the ASIC accelerator system 500 generates ids on-the-fly using a seed id vector, where kth id is generated by permuting the seed id by k indexes. Therefore, the id memory shrinks to 4 Kbit, i.e., 1024× reduction. Permutation preserves the orthogonality. It is implemented by the tmp register 512, by which, for a new window, the reg id is right-shifted and one bit of tmp is shifted in. The tmp register helps to avoid frequent access to the id memory by reading m (16) bits at once and feeding in the next m cycles.
For an application with nC classes and using hυ dimensions, the ASIC accelerator system 500 stripes the dimensions 1 to m (16) of its 1st class vector in the 1st row of m class memories, the 2nd class vector in the 2nd row, and so on. The next m dimensions of the 1st class vector are therefore written into nc+1th row, followed by the other classes. Thus, in some cases, the ASIC accelerator system 500 always uses the first
portion of class memories. The applications can fill 28% of the class memories (minimum 6% for EEG/FACE, and maximum 81% for ISOLET) using hυ˜=4K dimensions. Accordingly, the ASIC accelerator system 500 can partition each class memory into four banks and power gates the unused banks. With four banks, 1.6 out of four banks are activated on average, leading to 59% power saving. With more fine-grained eight banks, 2.7 banks (out of eight) become active, saving 66% power. However, eight banks impose 55% area overhead compared to 20% of four banks. In some cases, the four bank configuration yields the minimum areaxpower cost. Since the power gating is static (permanent) for an application, no wake-up latency or energy is involved.
The ASIC accelerator system 500 can trade the energy consumption and performance with accuracy. Recall that the ASIC accelerator system 500 generates m dimensions of the encoding per iteration over the features. By feeding a new D_hυ value as input, the ASIC accelerator system 500 can seamlessly use the new dimension count by updating the counter exit condition, so smaller hypervectors of the encoding and class hypervectors will be used. Nevertheless, the ASIC accelerator system stores 500 the squared L2 norms of the whole classes for similarity metric
while for arbitrary reduced encoding dimensions, only the corresponding elements (and their L2 norms) of the classes are needed.
The ASIC accelerator system 500 can use 16-bit class dimensions to support training. As a result, the large class memories consume ˜80% of the total power. HDC exhibits notable tolerance to the bit-flip of vectors, which can be leveraged to over-scale the memory voltage without performance loss.
Voltage over-scaling also depends on the application's sensitivity to dimension reduction and its workload. For instance, FACE has a higher tolerance to voltage scaling than dimension reduction (see
The ASIC accelerator system 500 was implemented at the RTL level in SystemVerilog and verified the functionality in Modelsim. Synopsys Design Compiler was used to synthesize The ASIC accelerator system 500 targeting 500 MHz clock with 14 nm Standard Cell Library of GlobalFoundries. Artisan memory compiler was used to generate the SRAM memories. The level memory 530 has a total size of 64×4K=32 KB for 64 bins, the feature memory is 1024×8b, and class memories are 8K×16b (16 KB each). The power consumption was obtained using Synopsys Power Compiler. The ASIC accelerator system 500 occupies an area of 0.30 mm2 and consumes a worst-case static power of 0.25 mW when all memory banks are active. For datasets of Section 3.2, the ASIC accelerator system 500 consumes a static and dynamic power of 0.09 mW, and 1.79 mW, respectively (without voltage scaling).
Since previous HDC ASICs have not reported training energy and performance, in this example, we compared the per-input energy and execution time of the ASIC accelerator system training with RF (random forest, most efficient baseline) and SVM (most accurate conventional ML) on CPU, and DNN and HDC on eGPU.
We compare the energy consumption of the ASIC accelerator system inference with previous HDC platforms, and tiny-HD. We scale their report numbers to 14 nm for a fair comparison. We also include the RF (most efficient ML), SVM (most-accurate ML) and DNN on HDC on eGPU (most efficient HDC baseline).
Table 2 compares the normalized mutual information score of the K-means and HDC for the FCPS benchmarks and the Iris flower dataset. On average, K-means achieves slightly (0.031) higher score, but for datasets with more features, the disclosed ASIC accelerator system can better benefit from using windows (windows become less effective in a smaller number of features).
Disclosed herein is an ASIC accelerator system, a highly-efficient HDC accelerator that supports classification (inference and training) and clustering using a novel encoding technique that achieves 3.5% (6.5%) better accuracy compared to other HDC (ML) algorithms. The ASIC accelerator system 500 benefits from power-gating, voltage over-scaling, and dimension reduction for utmost energy saving. The result described herein shows that the ASIC accelerator system 500 improves the classification energy by 15.1× over a previous trainable HDC accelerator, and 4.1× over an inference-only accelerator. The ASIC accelerator system HDC-based clustering consumes 17,523× lower energy with 41× higher performance than Raspberry Pi running K-means with similar accuracy, facilitating ultra-efficient continuous learning on edge.
Enhancing Deep Neural Network Efficiency through Patterned Filter Clustering and Computation Reuse
The ever-increasing efficacy of Deep Neural Networks (DNNs) in diverse application domains is coupled with the increase in the size and computations of their models. Extensive research has been done to alleviate the memory and computational burden of DNNs. Primary compression techniques include weight quantization, pruning, clustering, and filter pruning, especially with a slant toward hardware efficiency such as hardware-aware quantization and structured pruning.
In weight quantization, the network parameters take values from a set of predetermined values (e.g., −2k−1 to 2k−1−1 in uniform quantization), while weight clustering groups the weights into abstract clusters, where all weights of a cluster share the same value. Thus, by clustering, one can store the cluster index/id of each weight (in index table), along with a small table that maps the indexes to weight values. Previous works show that ˜16 unique weights can retain the accuracy, which results in 2x memory compression by storing log2 16=4-bit indexes instead of the primary 8-bit weights.
b illustrate an example convolution operation in CNNs.
Described herein are techniques for enhancing computation reuse and minimize memory usage through the implementation of shared clustering patterns among filters. Filters f1 and f2 in
Described herein, the potentials of patterned filters are explored, introducing a mathematical formulation to identify the patterns and a training strategy to enforce these patterns while maintaining model accuracy. Such an approach represents a novel contribution to the field, marking the introduction of patterned filters to save memory and computation of DNNs. Furthermore, as described herein, discussion includes the dataflow, architecture, and processing units of the pattern clustering system accelerator, designed to support networks utilizing both patterned and conventional weight clustering. Given that weight quantization is a form of clustering, the architecture can also be compatible with quantized networks. The efficiency of the pattern clustering system is evaluated across various datasets and networks, focusing on computation and memory reduction, and comparisons are made with previously established works.
where S is the stride size (i.e., the sliding step of the filters).
Assuming every nf subset of a layer's filters share the same clustering pattern, the total parameter memory consists of C×k×k× logG bits to store the common index table (i.e., cluster indexes of weights instead of values), and nf×G×8b bits to store the actual weights of nf filters assuming 8-bit weights. The total number of operations include total Cxkxk ADD (in G groups/clusters), accompanied with G MULs and ADDs for each filter to generate an output.
Pattern selection can include determining the number of clustering patterns, the patterns themselves, and the assignment of patterns to filters. Exploring inter-filter structural similarities is a proper starting point in determining the common patterns and the filters that share these patterns. Patterning is more complicated than other problems such as filter pruning that considers the filters exclusively (e.g., pruning based on l1 norms or ranks of filters).
We formulate the “similarity finding” as the Hungarian matching problem. For each pair of filters fi and fj, we create the table of longest common subsequences between all groups, ending up in a G×G table. For instance, in
We obtain the similarity scores between all pairs of filters and create an F×F distance matrix (distance defined as 1/score). Finally, we use the distance matrix to find P (number of patterns) collections of filters, where filters of a collection have smaller distances to each other than to other collections. For this end, we use the k-medoids algorithm to cluster F filters into P collections. Unlike k-means that calculates the Euclidean distance between data points, k-medoids works with custom cost functions, e.g., a distance matrix. In addition, unlike k-means, k-medoids returns actual data points of the collection as the center points, leading to a greater interpretability of the centers. This is essential in pattern selection as the returned centers will be the filters with their clustering pattern selected to be shared. Note that the number of filters in each of the P pattern collections can be different.
Although imposing a limited number of patterns among all the filters works for simpler datasets such as Fashion-MNIST, in more complex datasets such as CIFAR100, there is often an accuracy degradation. This is a result of failing to extract certain pixel patterns because of the cluster-sharing constraint between the filters. Therefore, we relax the constraint of pattern sharing on certain filters in a layer, dubbed as free filters. Free filters still comply with weight clustering (hence they still benefit from factorization) but do not follow an enforced pattern.
To select the free filters, in the original pretrained model, we sort the filters based on the singular value decomposition (SVD) of their output feature maps using the train data. SVD value indicates how many rows of a feature map are linearly independent. The overall rank score of a filter is the mean of the generated feature maps SVDs. Filters with a rank higher than a threshold are deemed as more informative filters and selected as pattern-free (or indeed single-pattern) filters.
After identifying the patterns associated with each filter, we use projected gradient descent (PGD) to calibrate the model toward the determined patterns. PGD solves constrained optimization problems, which in this example is “the solution W of the DNN must belong to pattern constraints Q”, formally, min/W∈Qf({Wi}i=1L, X), where L is the layers and X is the input data. Starting from an initial W0∈Q (e.g., by cluster-wise averaging of pre-trained weights) PGD proceeds as follows:
PQ projects the gradients such that Wk+1∈Q as well. The projection of the gradients itself is an optimization problem:
Meaning that the new weights need to minimize |W−Wk|22 while also adhering to Q. Since weights of the solution W are clustered, i.e. all weights of a cluster get the same value, the solution of Equation (4) translates to minimizing Σ(x−wi)2 for each cluster, in which wi s are the post-gradient weights and x is the new weight of the cluster. Thus, x=
To reduce the memory accesses, the pattern clustering system 1700 uses a pattern-stationary data flow while trying to maximize the data reuse, as well. To this end, the PE array is logically split into row-groups, made up of two consecutive rows (total
row-groups in our architecture). All PEs in a row-group operate on the same inputs (intra row-group data sharing), but each PE possesses a different pattern. Thus, a rowgroup generates multiple channels of an output. The corresponding PEs in all row-groups (e.g., PE1, PE33, etc.) possess the same pattern (inter row-group data sharing), but use different inputs. Therefore, in a given time, the same channels of
outputs are on progress. Once all the channels associated with the running patterns are produced, the pattern clustering system 1700 scans another input window to generate the next
outputs. After scanning all input rows, the pattern clustering system 1700 starts over with the next set of patterns (if any) and repeats the same procedure to generate all the channels.
The data flow of the pattern clustering system 1700 can be elaborated using the 3×3 example convolution of
and the associated filter generates the right-most pixel of the output feature map. To do this, fetching of inputs starts from the bottom-right brick toward to top-left in a column-wise fashion (i.e., 1→6→ . . . 13) by fetching all sub-bricks commencing the next brick. This facilitates a great degree of data reuse. Once a sub-brick is fetched, it is broadcast to all PEs in a row-group. Along with the inputs, each PE receives the pattern index corresponding to the fetched activations.
To recap, we first create activation sub-groups by adding cluster specific activations, before multiplying with the cluster's weight value. To implement this, in every cycle, a PE processes one activation and adds it to the corresponding cluster group (out of G). After fetching and accumulating all the input bricks of an input window, each PE fetches the actual weights associated with the processed pattern. For each filter that shares the current pattern, the PE fetches its G unique weights cycle by cycle and multiplies with the accumulated values of group-1 to group-G. The aforementioned window w1 produces the output pixels associated with 32 patterns of PE1 to PE32 of output brick 1 (i.e., at least 32 channels of the output feature map). The convolution window is then shifted left. Hence, the row-group 1 will generate the same channels of output brick 2 as it did for output brick 1.
Multiple row-groups generate multiple output rows simultaneously. As row-group 1 processes input window w1, row-group 2 processes window w4 to generate 2nd output row. All row-groups generate the same channels since they use the same patterns (hence, filters). Once the row-groups finish scanning the current input rows (i.e., the windows reach the left edge), each input window moves up by
(number of row-groups) rows. After scanning all the rows, the pattern clustering system 1700 starts over from the first row with a new set of patterns until all output channels are created.
The pattern clustering system 1700 can take advantage of multiple levels of data sharing. The input activations are shared among all PEs of a row-group, and clusters index data are shared between all corresponding PEs in the row-groups (e.g., PE1, PE33, PE65, etc.). In addition, except the edge of the image, in a 3×3 convolution window, an input brick is shared between three windows of the same row. For example, in
Processing Elements:
cycles.
Register Files: An input brick may participate in several adjacent windows. The register files RF1 to RF4 receive one input activation as data, along with several cluster indexes as the address to accumulate the input with the proper group. One of the RFs is spare to avoid stalls, explained below. The reg idx (index register) continuously fetches these index data from the Index Lane buffer. Since the windows sharing an input are adjacent (i.e., an activation only differs in x dimension within the windows), the index data of these windows can be aligned in one memory row. Note that since corresponding PEs of row-groups process the same pattern, the fetched index data is broadcast to all of
corresponding PEs of all row-groups using the common index bus of a column.
Accumulator: Once all inputs of a window are accumulated in an RF, the PE loads unique weights w1 to wG one-by-one from the Weight Lane to the reg w, and reads the accumulated sum of group-1 to group-G from that RF, accumulates the multiplications in the reg out, and finally transfers the output to Out Lane. Since each filter sharing a pattern has its own unique weights, these multiplications need to be repeated for all filters sharing the pattern. A benefit of the pattern clustering system 1700 is that, once the input sub-groups are computed a pattern, producing new output channels (of shared filters) takes just G cycles per filter. Since the first window (of horizontally adjacent windows) is several input bricks ahead from the other two, in a given time, the results of only one window becomes ready in a PE. A PE contains one extra RF, so when an RF is stuck to finalize the multiplications, the fourth RF replaces it to process new input bricks and avoid stall.
Output Lanes: PEs in a column time-multiplex the same output bus to transfer the output activation to the Out Lane. The bus is granted in a round-robin fashion, but it does not cause performance overhead as outputs of all PEs of a column can be transferred to Out Lane before generating the outputs of next window. The Out Lane temporarily stores a few adjacent horizontal outputs (from the same PE), or adjacent vertical outputs (from the corresponding PEs of different row-groups) for pooling operation before writing to DRAM. The output data layout written into the DRAM is the same as input bricks, i.e., continuous pixels of an output brick are written in the same DRAM row.
The pattern clustering system concepts disclosed herein (e.g., pattern and rankbased free filter selection and training) were implemented using PyTorch. For training, SGD optimizer was used, momentum of 0.9 with weight decaying, and learning rate from 0.1 down to 0.0008 over 100 epochs. For parameter G (number of unique weights or clusters per pattern) we found G=16 sufficient to retain accuracy by sweeping across a spectrum of values. Similarly, we tried a range of values for P (number of patterns) and found P=16 sufficient for accuracy.
We implemented the pattern clustering system accelerator in SystemVerilog and verified its functionality with Modelsim. We synthesized it using TSMC 40 nm standard cell library at 0.9 V using Synopsys Design Compiler for a target frequency of 500 MHz. We used Artisan memory compiler with the same technology to generate SRAM buffers and register files. Power consumption of all elements is obtained using Synopsys Power Compiler. For DRAM access energy model, we used Destiny. Our primary architecture includes Ra=8 rows (four row-groups) and Ca=16 (32 PEs per row-group).
We evaluate the effectiveness of the pattern clustering system 1700 by comparing it with a filter pruning approach dubbed Hrank. We use VGG16, Resnetl8, and Resnet50 networks with CIFAR10 and CIFAR100 datasets, and a 200-class subset of ImageNet (Tiny ImageNet). The patterned filters run ADDs to accumulate the input activations for P filters, followed by MULs of their unique weights on the resulted groups. The free filters are special cases of patterned filters, where a free filter has one independent pattern. Thus, free filters also benefit from factorization to reduce the number of MULs, as well as weight clustering to reduce memory.
Table 3 summarizes the accuracy, operation count (ADD and MUL), and memory for the aforementioned models and datasets. The Base column indicates the baseline 8-bit model, and Hrank column is the filter pruning. We selected the pruning ratios of Hrank layers according to its original work.
CIFAR10: As compared to the baseline VGG16 network, while HRank provides 56.1% reduction in operation count and 62.2% reduction in parameters, the disclosed techniques offer 72.4% reduction in operation count and 77.9% reduction in parameters, with 0.3% better accuracy. For residual networks such as ResNet18, while the operation reduction in HRank is 54.4%, the disclosed techniques offer 69.4% reduction. We observe a similar trend for ResNet50; 68% operation reduction in the pattern clustering system 1700 as compared to 46% reduction of HRank. The pattern clustering system 1700 shrinks parameters size significantly (80.2% vs HRank's 66.8%) for Resnet18 and (64.1% vs HRank's 45.7%) for Resnet50, along with better accuracy metrics as compared to HRank.
CIFAR100: For CIFAR100, we achieve 73.1% operation count reduction using VGG16, 61.5% using ResNet18 and 68.6% using ResNet50. The reduction in parameters is considerably better than HRank's reductions (77.4% vs 61.1%, 71% vs 48.8% and 64% vs 46.2%) for VGG16, ResNet18 and ResNet50 respectively.
TinylmageNet: We observe a similar trend with the Tiny ImageNet dataset. Along with an improved operation reduction (up to 72%) and parameter reduction (up to 70.7%) as compared to the baseline, the improvements disclosed herein are better than HRank while achieving improved accuracy metrics (1-2%) over HRank.
In summary, among other things, the pattern clustering system 1700 shrinks the model memory up to 80.2% and operation count up to 73.1%, with a similar accuracy as compared to the 8-bit baseline models.
The architecture of the pattern clustering system 1700 can include four row-groups (Ra=8) and 16 columns (Ca=16). Table 4 reports the size of the pattern clustering system memories. The input buffer stores the entire brick of a row-group for reuse by the preceding row-group. The image depth goes up to 2048 channels in Resnet50, thus, the input buffer should store 2048×4 input activations of four row-groups, packed as 2048×32b (four inputs of a brick are packed in a row and fetched at once to a row-group). The index memory stores all 4-bit indexes, which is 512×3×3 for the largest filter. Since three indexes per pattern is read in a column (and there are two patterns in a column), the memory has a 768×(6×4) layout. The weight memory supplies the unique weights of a column's filters. Each pattern is shared with up to 32 filters; thus, it stores up to 64 weights. Similarly, the out lane stores all outputs generated by a column (four row-groups and 64 filters). In addition, it stores the adjacent pixels for pooling, requiring a total of 512 rows and 20-bit per row for each output pixel. Finally, each RF has 16 rows for accumulation of G=16 groups.
Table 5 shows the per-component area and delay of the pattern clustering system 1700. The 8×16 architecture of the pattern clustering system 1700 occupies an area of 1.84 mm2 (at 40 nm). The compact area is mainly due to sharing a weight index lane and an output lane within an entire column, and a small input activation memory that buffers the inputs for reuse so the pattern clustering system 1700 uses only 70 KB on-chip memory. The design consumes a peak (worst-case) power of 145.7 mW: 29.4 mW leakage, and maximum dynamic power of 116.3 mW (at 500 MHz), 34% of which is the DRAM access power. The data reuse of the pattern clustering system 1700 makes an effective DRAM access rate of ˜1 Byte/cycle, the same rate as PEs consume inputs in a shared fashion.
Comparison with Previous Work
We compare performance-per-watt of the pattern clustering system with the FuseKNA, which also reuses the overlapping ADDs among kernels in a bit-serial accelerator, and with SCNN, which is a MAC-based sparse (zero-skipping) accelerator (results compiled from [13]).
Described herein, the introduction of the concept of patterned cluster sharing among DNN filters is highlighted, demonstrating significant advancements in memory and operation efficiency through the reuse of clustering indexes and weight factorization. Techniques for the determination and assignment of patterns across filters, coupled with a strategic training approach to achieve desired patterns, are elaborated. The effectiveness of filter patterning was assessed using a variety of datasets and networks, showcasing substantial reductions in memory and operational demands, with improvements exceeding traditional filter pruning methods in terms of both efficiency and accuracy. Furthermore, the development of the pattern clustering system accelerator, embodying the principles discussed, is revealed to have achieved enhanced energy efficiency, outperforming contemporary accelerators by a notable margin.
Various examples of methods and systems for selectively communicating medication data field values to a patient information system can be found in the following clauses:
Clause 1. A method for encoding within a hyperdimensional computing framework, comprising:
Clause 2. The method of clause 1, wherein the obtained data comprises at least one of textual data, image data, voice data, or sensor data.
Clause 3. The method of clause 1, wherein the binary operation executed on the set of permuted level hypervectors is an exclusive OR (XOR) operation.
Clause 4. The method of clause 1, wherein the permutation operation applied to each selected level hypervector is based on a predetermined number of positions reflective of an order of the sequence of data elements within the particular window.
Clause 5. The method of clause 1, wherein the set of level hypervectors is predefined, each representing a distinct quantized value corresponding to possible values of data elements.
Clause 6. The method of clause 1, further comprising associating each window hypervector with a unique identifier hypervector through an XOR operation to incorporate global sequence information into the encoding.
Clause 7. The method of clause 6, wherein decoding the encoded hypervector includes utilizing the unique identifier hypervector to reconstruct the sequence of data elements from the encoded hypervector based on the global sequence information encoded by the unique identifiers.
Clause 8. The method of clause 1, wherein aggregating the window hypervectors includes a weighted aggregation based on a predetermined importance criterion assigned to each window.
Clause 9. The method of clause 1, further comprising normalizing the aggregated encoded hypervector to obtain a uniform vector magnitude across different instances of encoded data.
Clause 10. The method of clause 1, wherein adjacent windows of the plurality of windows have a shared subset of data elements at their interface so as to define an overlap of one or more final data elements from a first window and one or more beginning data elements of a subsequent window.
Clause 11. The method of clause 10, wherein a size of an overlapping portion between consecutive windows is adjusted according to a predetermined criterion related to sequential dependencies inherent in the obtained data.
Clause 12. Non-transitory physical computer storage comprising computer-executable instructions stored thereon that, when executed by one or more processors of a mobile device, are configured to implement a process comprising:
Clause 13. An ASIC accelerator system for hyperdimensional computing (HDC) encoding, comprising:
Clause 14. The system of clause 12, further comprising a memory module communicatively coupled to the processor, wherein the memory module stores the set of level hypervectors.
Clause 15. The system of clause 12, further comprising computer-readable instructions stored on a non-transitory computer-readable medium, wherein the instructions, when executed by the processor, cause the processor to perform the tasks of receiving the data; segmenting the received data; selecting the corresponding level hypervector; applying permutation operations; executing the binary operation; aggregating the window hypervectors; and outputting the encoded hypervector.
Clause 16. The system of clause 12, wherein the received data comprises at least one of textual data, image data, voice data, or sensor data.
Clause 17. The system of clause 12, wherein the binary operation executed on the set of permuted level hypervectors is an exclusive OR (XOR) operation.
Clause 18. An ASIC accelerator system for hyperdimensional computing (HDC) encoding, comprising:
Clause 19. The system of clause 18, further comprising:
Clause 20. The system of clause 18, wherein at least one of the data segmentation unit, the level hypervector selection unit, the permutation unit, the binary operation unit, or the aggregation unit is implemented by at least one processor configured to execute instructions for performing respective functions of that unit.
Clause 21. The system of clause 18, wherein adjacent windows of the plurality of windows have a shared subset of data elements at their interface so as to define an overlap of one or more final data elements from a first window and one or more beginning data elements of a subsequent window.
Clause 22. A method for enhancing computational efficiency in Deep Neural Networks (DNNs) through use of shared clustering patterns, the method comprising:
Clause 23. The method of clause 22, further comprising:
Clause 24. The method of clause 23, wherein no additional computational operations are required for processing similar activation patterns across different filters within the plurality of filters due to reuse of activation groups.
Clause 25. The method of clause 23, wherein the reusing activation groups leads to a reduction in a total number of computational operations required by the DNNs and enhances an operational efficiency of the DNNs by eliminating computational redundancy incurred in processing similar activation patterns across different filters of the plurality of filters.
Clause 26. The method of clause 22, wherein establishing the plurality of shared clustering patterns includes analyzing structural characteristics of the filters to determine pattern similarities and variances, utilizing a clustering algorithm to categorize the filters based on their operational similarities.
Clause 27. The method of clause 22, further comprising generating shared cluster-index information for the plurality of filters to minimize multiplication operations by leveraging pre-computed activations common to filters associated with the same clustering pattern.
Clause 28. The method of clause 22, wherein iteratively adjusting the weights involves applying a targeted training strategy, the targeted training strategy incorporating backpropagation and gradient descent techniques to align the weights with the shared clustering patterns.
Clause 29. The method of clause 28, wherein the targeted training strategy includes employing projected gradient descent to ensure the weights of the filters conform to the shared clustering patterns while maintaining or improving an accuracy of the DNNs.
Clause 30. The method of clause 22, further comprising analyzing a performance of the DNNs before and after enforcement of the shared clustering patterns to quantify improvements in computational efficiency and memory usage.
Clause 31. The method of clause 30, further comprising optimizing the shared clustering patterns based on the analyzing to further enhance the computational efficiency and memory usage of the DNNs, wherein the optimizing includes selecting optimal clustering patterns that maximize computation reuse while minimizing memory footprint.
Clause 32. The method of clause 31, further comprising applying the optimized shared clustering patterns to the plurality of filters in a deployment phase of the DNNs, ensuring that the computational efficiency and memory usage improvements are realized in actual operating conditions.
Clause 33. The method of clause 22, further comprising generating a mapping of input activations to the shared clustering patterns, the mapping facilitating efficient computation by identifying common activations across the plurality of filters and reducing redundant computations.
Clause 34. The method of clause 22, further comprising employing a gradient descent algorithm to iteratively refine the weights of the filters in accordance with the shared clustering patterns, the refinement being guided by an objective function that quantifies a performance of the DNNs.
Clause 35. Non-transitory physical computer storage comprising computer-executable instructions stored thereon that, when executed by one or more processors of a mobile device, are configured to implement a process comprising:
Clause 36. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to:
Clause 37. The non-transitory computer-readable storage medium of clause 36, wherein the computer-executable instructions further cause the one or more processors to:
Clause 38. The non-transitory computer-readable storage medium of clause 36, wherein the reuse of activation groups eliminates the need for additional computational operations for processing similar activation patterns across different filters within the plurality, leading to a reduction in a total number of computational operations required by the DNNs.
Clause 39. A system for enhancing efficiency in deep neural networks (DNNs) through implementation of shared clustering patterns, the system comprising:
Clause 40. The system of clause 39, wherein the computer-executable instructions further cause the one or more processors to:
Clause 41. The system of clause 39, wherein the computer-executable instructions further cause the one or more processors to:
Clause 42. The system of clause 39, wherein the computer-executable instructions further cause the one or more processors to:
Conjunctive language such as the phrase “at least one of X, Y and Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to convey that an item, term, etc. may be either X, Y or Z. Thus, such conjunctive language is not generally intended to imply that certain embodiments require at least one of X, at least one of Y and at least one of Z to each be present.
Conditional language, such as, among others, “can,” “could,” “might,” or “can,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to.” As used herein, the terms “connected,” “coupled,” or any variant thereof means any connection or coupling, either direct or indirect, between two or more elements; the coupling or connection between the elements can be physical, logical, or a combination thereof. Additionally, the words “herein,” “above,” “below,” and words of similar import, when used in this application, refer to this application as a whole and not to any particular portions of this application. Where the context permits, words in the above detailed description using the singular or plural number can also include the plural or singular number respectively. The word “or” in reference to a list of two or more items, covers all of the following interpretations of the word: any one of the items in the list, all of the items in the list, and any combination of the items in the list. Likewise, the term “and/or” in reference to a list of two or more items, covers all of the following interpretations of the word: any one of the items in the list, all of the items in the list, and any combination of the items in the list.
Depending on the embodiment, certain operations, acts, events, or functions of any of the algorithms described herein can be performed in a different sequence, can be added, merged, or left out altogether (for example, not all are necessary for the practice of the algorithms). Moreover, in certain embodiments, operations, acts, functions, or events can be performed concurrently, for example, through multi-threaded processing, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than sequentially.
Systems and modules described herein can comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein. Software and other modules can reside and execute on servers, workstations, personal computers, computerized tablets, PDAs, and other computing devices suitable for the purposes described herein. Software and other modules can be accessible via local memory, via a network, via a browser, or via other means suitable for the purposes described herein. Data structures described herein can comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein can comprise elements from graphical user interfaces, interactive voice response, command line interfaces, and other suitable interfaces.
Further, the processing of the various components of the illustrated systems can be distributed across multiple machines, networks, and other computing resources. In addition, two or more components of a system can be combined into fewer components. Various components of the illustrated systems can be implemented in one or more virtual machines, rather than in dedicated computer hardware systems and/or computing devices. Likewise, the data storage devices shown can represent physical and/or logical data storage, including, for example, storage area networks or other distributed storage systems. Moreover, the connections between the components shown can represent possible paths of data flow, rather than actual connections between hardware. While some examples of possible connections are shown, any of the subset of the components shown can communicate with any other subset of components in various implementations.
Embodiments are also described above with reference to flow chart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. Each block of the flow chart illustrations and/or block diagrams, and combinations of blocks in the flow chart illustrations and/or block diagrams, can be implemented by computer program instructions. Such instructions can be provided to a processor of a general purpose computer, special purpose computer, specially-equipped computer (for example, comprising a high-performance database server, a graphics subsystem, etc.) or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor(s) of the computer or other programmable data processing apparatus, create means for implementing the acts specified in the flow chart and/or block diagram block or blocks.
These computer program instructions can also be stored in a non-transitory computer-readable memory that can direct a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the acts specified in the flow chart and/or block diagram block or blocks. The computer program instructions can also be loaded onto a computing device or other programmable data processing apparatus to cause a series of operations to be performed on the computing device or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the acts specified in the flow chart and/or block diagram block or blocks.
Any patents and applications and other references noted above, including any that can be listed in accompanying filing papers, are incorporated herein by reference. Aspects of the disclosure can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet further implementations of the disclosure.
These and other changes can be made in light of the above detailed description. While the above description describes certain examples of the disclosure, and describes the best mode contemplated, no matter how detailed the above appears in text, the disclosure can be practiced in many ways. Details of the system can vary considerably in its specific implementation, while still being encompassed by the disclosure disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the disclosure should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the disclosure with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the disclosure to the specific examples disclosed in the specification, unless the above detailed description section explicitly defines such terms. Accordingly, the actual scope of the disclosure encompasses not only the disclosed examples, but also all equivalent ways of practicing or implementing the disclosure under the claims.
This application claims the benefit of U.S. Provisional Application No. 63/485,128, entitled “Methods, Circuits, And Systems Including Efficient Learning Engine On Edge Using Hyperdimensional Computing And Efficient Deep Neural Network Acceleration With Filter Sharing,” filed on Feb. 15, 2023, the disclosure of which is hereby incorporated by reference in its entirety. This application is being filed on Feb. 14, 2024, concurrently with the following U.S. Patent Application, which is incorporated by reference herein in its entirety: AttorneyFilingDocket No.Patent Application TitleDate170964-00078A1High-Dimensional Vector Space EncodingFeb. 14,Techniques For Hyperdimensional2024Computing Systems
This invention was made with government support under Grant No. HR0011-18-3-0004 awarded by the Department of Defense Advanced Research Projects Agency (DARPA). The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63485128 | Feb 2023 | US |