Some embodiments herein relate generally to neural networks processing units activation sparsity removal.
Unless otherwise indicated herein, the materials described herein are not prior art to the claims in the present application and are not admitted to be prior art by inclusion in this section.
Cloud computing and edge computing of artificial intelligence (AI)/machine learning (ML) applications and edge devices (example: smartphones, smart cameras) or other real-time applications that require ML are computation-intensive and often require multi-core and multi-device solutions to match system-required very high processing throughput.
Therefore size-efficient and power-efficient multi-core architectures are highly desirable to reduce solution cost and power consumption. Available solutions are currently based on graphics processing unit (GPU), central processing unit (CPU), field programmable gate array (FPGA), and some dedicated Application-Specific Integrated Circuits (ASICs) and/or Application-Specific Standard Products (ASSPs). These implementation methods are typically memory-size inefficient and have larger-than-needed processing units or in the case of dedicated ASICs/ASSPs don’t have the flexibility to adapt to changing machine learning evolving models.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one example technology area where some embodiments described herein may be practiced.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential characteristics of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In an example embodiment, a method of activation sparsity removal includes implementing a non-zero Activation jump algorithm. Alternatively, the method includes using multiple first in first out (FIFO) memories to store non-zero activations for each vector multiplication.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
To further clarify the above and other advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Some embodiments herein relate generally to performance optimization of neural networks processing units and may include parallel processors that operate as DNN accelerators. Target devices for the implementation can be programmable logic devices, ASICs, CPUs, GPUs, tensor processing units (TPUs), digital signal processors (DSPs) and/or ASSPs. More particularly, some example embodiments relate to scalable, adaptable, hardware programmable, optimized size, low power parallel processors that target DNN for ML and/or AI applications. The sDNA architecture and the sDNA algorithm described herein may support common ML design flow (TensorFlow, Caffe, and others) with full transparency.
Real-time ML solutions have become very common for many AI applications. The implementation of ML is based on many layers of neural networks called deep neural networks or DNN. There are many different models of DNN: EfficientNet, ResNet, MobileNet, GoogleNet, SqueezeNet, AlexNet, Vgg, and many others. The common challenges of DNN systems in many real-time applications are a very high throughput that can reach an order of many TeraOps/second (i.e., many 1012 operations per second).
Therefore, the efficiency of the DNN system may be critical to make these applications feasible, low-cost and low-power.
Some calculations that are done in DNN systems include multi-dimensional matrix multiplications between weights and activations multi-dimensional matrixes. A characteristic of such matrixes is that most of the weights and a significant number of activations are zero or could be forced to zero without significant effect on accuracy (quality of results). This high sparsity presents an opportunity to increase the DNN efficiency. In an example implementation, the DNN acceleration is achieved by removal of this sparsity (multiplication by zero).
In contrast to Nvidia’s sparsity removal method, some embodiments herein enable a full removal of all zero weights and all zero activation, regardless of structure, sparsity percentage or distribution, with no performance degradation. This DNN acceleration may be achieved by using a silicon size-efficient and power-efficient implementation as described herein that enables non-structured sparsity removal.
Another advantage of the sDNA architecture as described herein is that in some configurations the sDNA processing units are able independently to start a new DNN calculation and there is no requirement to wait for the other sDNA processing units to finish their calculations before starting a new DNN calculation (no synchronization requirement). This is a major efficiency and acceleration issue to other prior art competing DNN architectures as demonstrated in
The sDNA of
The control logic block may also calculate an amount of multiplications contained in each DNA word. This information may be used to control a routing multiplexer (mux). The routing mux may balance the calculation load of different multiplier-accumulators of the different NPUs. The multiplier-accumulators may calculate nodes (neuron’s activation functions) of the neural networks. After each vector multiplication calculation is completed optionally a non-linear operation such as ReLU or other non-linear functions may be applied before storing the non-zero results in the compressed A memory and its bit map representation at the A Map Memory or at Mem. (as will be described later).
In case of neural networks that use ReLU, ReLU6 or similar non-linear output functions, if the order of the accumulations in the MAC is first accumulating all the W*A results that have positive weights and then to accumulate the W*A results that have negative weights then it is possible to achieve additional acceleration by suspending the MAC operation in case the accumulator value reaches a negative value. In order to achieve this desired sequence of operations the Address Generator of
DNN are used in ML for AI applications. The majority of the calculations that are required for DNN implementations are multi-dimensional matrix multiplications. The multiplications are done between tensors (multi-dimensional matrixes) of weights and tensors of activations of the internal DNN layers or the sensor inputs for the first DNN layer. The multi-dimensional matrix could be a combination of two-dimensional convolution kernel, the number of current DNN layer channels, and the results dimension could be the next DNN layer number of channels. The majority of the weights and the activations may be zeroes or very close to zero (could be forced to zero). Therefore, without removing these zero multiplication operands (which is also called sparsity removal) there is a lot of inefficiency in these DNN implementations that increases the power consumption and cost of these AI/ML systems.
The sDNA of
As illustrated in
The specific weight and activation function operands may be fetched from their respective memory locations in pruned W memory and compressed A memory and then fed to the MAC of the specific NPU after being routed through the routing mux. As indicated in
The Control Logic block in
The ReLU, which is a non-linear post-processing function common to many ML models, or other non-linear post-processing functions, may be attached after the MAC and may be executed after the final result of the MAC tensor-multiplication is completed. If the ReLU result is non-zero then 1 is stored in the A Map Memory and its actual value is stored in the Compressed A Memory. Alternatively, this information can be stored in Mem.
Mem. could be an input/output (I/O) interface, internal memory or external memory (DDR for example). In the Mem. image of the activation function data or weights could be stored for later use of the sDNA algorithm.
Some embodiments herein implement a DNN with sparsity removal, as generally described above. Some embodiments herein implement a DNN with multiplier acceleration (MA), as generally described below. Alternatively or additionally, embodiments herein may implement a DNN with both sparsity removal and multiplier acceleration.
In some embodiments that implement multiplier acceleration, for example, it may be possible to take one of the operands of the multiplier, for example, pruned W which is the output of Pruned W Memory, as described in the
Some of the foregoing embodiments relate to an hardware implementation of algorithms for, e.g., DNN with sparsity removal and/or multiplier accelerator. Embodiments described herein may also be relevant and/or may be extended to software implementation of the algorithms. Alternatively or additionally, some embodiments herein implement a parallel expansion of one or more of the foregoing serial mode NPUs, as described with respect to, e.g.,
Referring to
Based on the AMM parallel scheme implemented, each group of multiple AMM outputs may be used as input to a relevant Activation Sparsity Removal (ASR) block. Each Activation Sparsity Removal block may implement a non-zero Activation jump algorithm similar or identical to the RNA (one bitstream of the DNA) algorithm/architecture as described herein, and/or may use multiple first in first out (FIFO) memories to store only the non-zero activations read from the AMM (additional FIFO read control logic is used to balance the different FIFOs used capacity with the MAC operations), and/or may use an adder tree for design simplification, or may bypass the ASR to support weights only sparsity removal (i.e., weight sparsity removal without ASR). Example embodiments of the Activation jump algorithm (alternatively referred to and/or described as activation sparsity removal, removal of zero weights, or the like) are described in U.S. Pat. 17/457,623 filed Dec. 3, 2021 (and published as US Pat. 20220188611) which is incorporated herein by reference in its entirety.
The outputs of the ASR blocks may optionally feed the Redundancy Removal (RR) blocks to achieve additional MAC acceleration as described in detail in and with respect to
Outputs of the Redundancy Removal block, the outputs of the ASR blocks in the event the RR is not implemented or is bypassed, or the outputs of the AMM blocks in the event the ASR and the RR are not implemented or are bypassed may be used as inputs to the MAC blocks that implement machine learning tensor multiplications.
If there are more than one activation that need to be multiplied with the same weight at the same vector-multiplication calculation, then it is possible to arrange the activations in pairs and to use the 2 pointers of each Address-Generator of
In order to increase the probability of finding activation pairs that are multiplied by the same Weight it is possible to use the symmetry property. It is possible to gather all the activations that need to be multiplied by the weight W or by the weight -W and pair them together to achieve additional acceleration as was described earlier (use of the 2 pointers of each Address-Generator of
Outputs of the MAC blocks may be followed by machine learning non-linear blocks such as ReLUs or other non-linear blocks.
Each non-linear block output is stored in the next layer AMM or used as feedback to the current layer of the AMM (in case of Sequential Execution NPUs architecture), to support different DNN architecture implementations.
Each Activation data read from the AMM may be used multiple times for different vector-multiplication operations.
Referring to
In order to implement larger kernels of convolution and/or to support larger than one strides, it is possible to fold together neighbor bins (memory sections in the AMM that store activation input channels of a pixel or a group of pixels) of activations so that the parallel operations of neighbor MAC units will be fully synchronized and there would be no loss of clock-cycles.
Implementation of embodiments herein on convolution neural networks may benefit from support of different size convolution operations such as 1*1 convolution, 3*3 convolution, 5*5 convolution, 7*7 convolution, and the like. Referring to
Some portions of the detailed description refer to different modules, components, etc. configured to perform operations. One or more of the modules may include code and routines configured to enable a computing system to perform one or more of the operations described therewith. Additionally or alternatively, one or more of the modules may be implemented using hardware including any number of processors, microprocessors (e.g., to perform or control performance of one or more operations), DSPs, FPGAs, ASICs or any suitable combination of two or more thereof. Alternatively or additionally, one or more of the modules may be implemented using a combination of hardware and software. In the present disclosure, operations described as being performed by a particular module may include operations that the particular module may direct a corresponding system (e.g., a corresponding computing system) to perform. Further, the delineating between the different modules is to facilitate explanation of concepts described in the present disclosure. Further, one or more of the modules may be configured to perform more, fewer, and/or different operations than those described such that the modules may be combined or delineated differently than as described.
In general, all embodiments described herein can be freely combined, as applicable and if compatible. Further, the invention is not limited to the described embodiments, but can be varied within the scope of the enclosed claims.
Unless specific arrangements described herein are mutually exclusive with one another, the various implementations described herein can be combined in whole or in part to enhance system functionality or to produce complementary functions. Likewise, aspects of the implementations may be implemented in standalone arrangements. Thus, the above description has been given by way of example only and modification in detail may be made within the scope of the present invention.
With respect to the use of substantially any plural or singular terms herein, those having skill in the art can translate from the plural to the singular or from the singular to the plural as is appropriate to the context or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity. A reference to an element in the singular is not intended to mean “one and only one” unless specifically stated, but rather “one or more.” Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the above description.
In general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general, such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that include A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). Also, a phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to include one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
This application is a continuation-in-part of U.S. Pat. 17/457,623 filed on Dec. 3, 2021 which claims the benefit of and priority to U.S. Provisional Pat. 63/123,784 filed on Dec. 10, 2020. The 17/457,623 application and the 63/123,784 application is each incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63123784 | Dec 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17457623 | Dec 2021 | US |
Child | 18328631 | US |