The examples and non-limiting example embodiments relate generally to communications and, more particularly, to a dynamic energy saving controller for machine learning hardware accelerators.
It is known to implement communication networks with various hardware components.
In accordance with an aspect, an apparatus includes means for determining a zero status bit of a first element that indicates whether the first element is zero; means for determining a zero status bit of a second element that indicates whether the second element is zero; means for determining at least one combined status bit corresponding to an index of the first element and an index of the second element, wherein the at least one combined status bit indicates whether a product of a next first element and a next second element is non-zero; means for determining at least one pointer that points to a memory address of a next set of elements comprising the next first element and the next second element, based on the at least one combined status bit; means for retrieving the next set of elements from a location in the at least one memory given by the memory address; and means for performing a computation to determine the product of the next first element and the next second element.
In accordance with an aspect, an apparatus includes means for receiving as input a sequential bitstream; means for determining, using a cache, a distribution of zero-inputs from the sequential bitstream; means for determining a zero-input from the distribution of zero-inputs; and means for determining whether to skip reading the zero-input, or to generate an on-chip zero as a zero-gating zero to replace the zero-input.
In accordance with an aspect, an apparatus includes means for determining to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; means for maintaining an accumulation value comprising an accumulation of products; means for maintaining a positive queue corresponding to products with positive results, and a positive pointer that points to the products with positive results; means for maintaining a negative queue corresponding to products with negative results, and a negative pointer that points to the products with negative results; means for determining to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; means for determining to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; means for determining to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and means for determining to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
The foregoing aspects and other features are explained in the following description, taken in connection with the accompanying drawings.
There are many machine learning hardware accelerators to execute neural network models. Depending on the workload, these hardware accelerators are power hungry. Moreover, most NN models are latency critical too, i.e. they come with strict time limits before which the calculations need to be accomplished.
As a visual example,
A NN model is a set of neurons connected to each other in a systematic format. A basic model of a neuron can be mathematically modelled as:
where xi and y are the inputs and the output of a neuron, respectively, while wi represent the weights of the NN model. By f( ) the activation function is labeled, which can be ReLU, sigmoid, etc. Different activation functions are interleaved to perform feature extraction (in other examples the different activation functions are interleaved with feature extractors) to add non-linearity in neural networks to allow them to fit an arbitrary N-dimensional decision boundary for the target application (classification, regression, etc.). One of the most frequently used activation functions in real-world inference deployments is ReLU (Rectified Linear Unit), which discards negative values by clipping them to 0, while preserving positive ones as is.
The ReLU activation function is given by the mathematical formula as:
This activation function is very simple to be realized in hardware (HW) thus much effort is put into using it as much as possible when the circuit implementation of a NN model is envisioned.
Each layer of a neural network learns a particular set of features from the input, prioritizing some activations while suppressing others. The deeper the layer, the more specific learned features are. Suppressed activations do not propagate information further through the network, thus creating sparsity (zero data) for the input of the next layer.
The combination of feature learning and activation functions is the source of compounding sparsity throughout the network.
In general, for each input set to a neuron, there are the same number (labeled with N) of corresponding weights, that can be unique or not. At the HW level, in this example, the neuron would require in a classical approach N multiplications and N additions. These are usually coupled in sequence as 1 multiplication followed by one addition commonly referred to as 1 Multiply and ACcumulate unit or 1 MAC unit. The MAC sequence, of length N in this example, is followed by a single activation function, like ReLU, in order to get the value of the output y.
On the numeric side, HW has each numeric value represented in a finite number of bits. In particular, for interference the numbers are usually represented as integers or fixed-point numbers. In an example, 8-bit integer is assumed as the numeric format for the convenience of explanation. This means the maximum representable number lie in the range −128 to +127. The lower bound is usually referred to as MinInt and the largest as MaxInt. Decreasing the number of bits means decreasing requirements on the area and power of a circuit representation, but still a large number of MAC operations plus memory loads and stores are needed.
Sparsity can be classified into two main types: structured sparsity (block sparsity) and non-structured sparsity. Structured sparsity is typically achieved through pruning methods that involve incorporating regularization terms during the training phase of CNN models. This pruning algorithm allows for control over the level of sparsity, ranging from regular filter-level sparsity to fine-grained and irregular sparsity.
In addition to sparsity controlled by software, non-structured sparsity is an inherent property of CNN architectures themselves, primarily due to the prevalent use of the ReLU activation function. The ReLU function introduces input-dependent sparsity (also referred to in this text activation sparsity) as across the layers of the CNN, making it impossible to predict the locations of zeros. Consequently, it becomes crucial to tackle this non-structured sparsity during the runtime of the CNN model.
Effectively handling sparsity brings multiple advantages. By reducing the presence of unnecessary non-zero values, power consumption at both the processing unit and the DRAM memory can be significantly reduced. Furthermore, there is potential for reducing the latency of CNN accelerators, which is particularly desirable for applications in datacenters and user-facing scenarios where latency is of utmost importance.
Sparsity exploitation is based on the fundamental observation that zeros in operands do not affect the final multiply-accumulate (MAC) results. Consequently, sparse DNN accelerators adopt two main approaches: zero-gating and zero-skipping.
Zero-gating processing elements (510) selectively deactivate arithmetic units within the PE (processing element) when the operand is zero, and still data is being read from the DRAM. This eliminates the switching activity within the combinational logic by turning off the toggling in the specific PE, in this case a MAC. Still latency of the computations is not affected by zero-gating (8 time moments to process 8 sets of inputs). While zero-gating improves energy efficiency by avoiding unnecessary hardware toggling, it does not provide improvements in terms of latency. Additionally, data reading limits the potential for substantial energy savings due to the underlying power bottleneck in the input-output operations.
Zero-skipping processing elements (520) actively skip zero operands to reduce latency, only feeding the PE synchronized non-zero elements. When the input operand (A) is zero, only the next non-zero element is fed into the MAC logic for calculation, utilizing only the relevant operands necessary for the final outcome. Weight memory access is based on the index of the non-zero element (e.g., A0, A2, A6 and A7 in this example), requiring additional control logic for efficient non-zero searching and routing. Zero-skipping, although occupying more area compared to zero-gating, has become the preferred method for achieving latency reduction (from 8 to 4 in the example of
There are several main techniques to avoid arithmetic operations on zero elements. The main approach is to load sparse data involved in the current convolution step into local memory and to compress it into a dense array by removing zero elements before providing the data to the ALU, or by masking elements of vector registers (SIMD vector processors) to overwrite corresponding results with zeros. Different variations of the concept are used by IP developers. For example a SIMD array and vector processor and a sparse matrix accelerator may use dedicated instructions which preload data elements into an intermediate staging area and then mask non-zero elements before supplying it to the ALU.
The majority of other existing hardware like TPUs (systolic arrays), GPUs (single instruction multi-threading, SIMT), and NPUs (neural processing units) either perform NOOP (no operations) on operands after checking at runtime if either of the operands is zero, or by performing the unnecessary computations nonetheless and gaining performance through massive parallelization. Other important ML architectures for hardware implementation, including hardware accelerators for deep neural networks, do not usually have power saving or neural network model execution schemes as the ones described herein.
Taking the example of a trained ResNet50, we observe that from the total number of operations, only 52% of them are necessary and the rest are operations that do not impact the output a the NN. To be more precise, 41% of the total operations involve activation sparsity and 7% on top of this involve feature sparsity. For the later category, both inputs and weights are still non-zero but the output of the neurons become zero after the follow-up activation layer.
In convolutional networks, a huge number of operations with zeros (sparsity in either operands, be it activations or weights) are performed or operations with non-zeros are performed that, in retrospect, do not influence the value of a feature map later in the network (sparsity in feature maps, also called further in this disclosure as feature sparsity). If an element of a feature map after activation, is zero, it did not have to be (accurately) computed in the preceding feature extraction layer. These characteristics are the exploitation points of the examples described herein. Even more so that area and power in the compute units, memory is the bottleneck of current computing systems, requiring optimizations of data orchestration techniques to obtain maximum hardware performance.
It is not possible in current state of the art to tell without the foresight if an element in the feature map will be useless after a follow-up activation layer. The mechanism to exploit this eventual uselessness is not as clear.
For these 59% of operations, operands are fetched through the memory hierarchy, operations on them are performed, and results are written back to memory. This is a huge cost and an inefficient use of resources that results in the loss of time and energy.
Described herein are control mechanisms that determine which operands (activations, weights) effectively get into the processing elements (PEs) and thus which multiplications and additions in a neural network (NN) model should be performed within each neuron, and which can be skipped. Described herein are two mechanisms for leveraging non-structural sparsity, one for processors, and the second for more parallel hardwired architectures. The main use case is AI/ML, but the mechanism can be applied in other contexts, for efficiently performing MAC operations when a part of the operands are zero. The examples described herein may also be used by some of the implementations of an early termination of the computation pipeline when the (final) result is not affected by continuing or not the specific computation flow. All this is done with limited increase in complexity of HW.
For processor architectures, described herein, is a smart controller or address generation unit (AGU) that dynamically yields to the processor the memory addresses of the next “necessary” elements, based on the state of computation. All this without the need for the processor to load each element and afterwards check if it is a zero or not. The mechanism avoids reading from memory, computing, and writing back to memory of zeros. An extension of the mechanism limits the number of MACs that will result in zero feature maps (output of the neuron) later in the network.
The examples described herein focus on avoiding computations and especially memory accesses by intelligently exploiting sparsity in inputs of each layer and predicted sparsity at the output of the neuron, after applying the activation. The first is facilitated by providing CPU addresses of next non-zero elements (positive or negative), and the second is facilitated by keeping track of the rolling accumulator value and contents of each of the two HW pointers.
For hardwired implementations (like for a convolution accelerator implementation) the controller described herein, together with the methodology utilize a combination of the zero-gating and zero-skipping schemes, and incorporate a control flow mechanism that effectively handles non-structural sparsity. This combination leads to a simplified controller than when using only zero-skipping and lower energy consumption than when using solely zero-gating.
Basically, to efficiently leverage sparsity, both at input and output of a neuron, the mechanisms described here are based on how and where the status maps like the zero map, sign map, and magnitude map are introduced, how and when those are built, and how they can be used to produce highly efficient HW.
Described herein is how the activation and feature sparsity mechanism works and how the pointer lists (see points 1.1 and 1.2) are defining the movement of data, together with the additional control mechanisms.
At a high-level, the herein described mechanism uses hardware pointers to indicate to the processor the memory location of the next useful element to read to be used in the computation. The pointers are generated based on status bits (values specifying properties of the word contained at the corresponding memory location, i.e. sign, zero, etc.) of a group of elements in focus, which can be stored in memory, that are involved in the current computation (e.g. a 3-by-3-by-C subset of a bigger image to be convolved with a filter of the same size). Hardware pointers are constructed by prefetching these status bits and applying specific Boolean logic on them to yield the index of the next useful element in the group to be loaded from memory.
The method can be used to exploit both structural and non-structural sparsity types. To avoid sparsity in activations/weights (multiplying with zeros), the aforementioned prefetching of zero-status bits is enough. To avoid sparsity in feature maps (outputs of the neurons), the value of accumulation at any point in time is tracked and convolution for the corresponding feature map cell is terminated once the accumulation becomes negative while no remaining products of activations and weights will result in positive numbers. It is not a foresight that completely avoids computations for feature maps that will become zero later in the network, but timely terminates any further MAC operations on a cell once the controller is sure the result of convolution is negative and will become zero later in the network. Same can be said for accumulation values that are above INTMAX and only positive products are left in this specific computation pipeline.
Constructing hardware pointers instead of loading and checking the content of each cell significantly reduces power use associated with DRAM/SRAM accesses and unnecessary computations. The overall approach at runtime controls and reduces the number of operations and memory accesses, thereby increasing speed, hardware utilization, and reducing power.
The examples described herein differ from existing solutions by its use of dynamically updated hardware pointers (registers storing memory addresses) to decrease the number of unnecessary data loads from memory. They point to the next set of valid operands (pairs of activation and filter elements (also referred to as weights in parts of this text)) involved in the current step of computation (step of convolution-sum of elementwise products of activation and filter elements producing a single feature map element). Instead of loading all elements from memory and then performing computations based on the data, the examples described herein suggest the address of the next valid data element to load, thereby decreasing unnecessary data movement.
The herein described smart memory controller and the CPU share a finite state machine (FSM) that keeps track of the state of the convolutions. The memory controller prefetches status bits from memory for the group of activation and weight elements in focus, indicated by the FSM and stores it into fast local scratchpad, or a cache. Based on the convolution stage, the controller maps the loaded status bits into an index of the next valid operand within the group and then decodes it into a physical memory address, storing it into the hardware pointer register for the CPU to consume. The FSM of convolution indicates to the CPU to move onto computing the next element once the hardware pointer register content becomes zero (points to NULL). The FSM's next state is controlled by signals from the memory controller that indicates when no more valid operands for the current operation are to be expected. The FSM's next state is also controlled by the CPU, depending on the status maps and intermediate results.
The pointers are generated by the controller from the loaded status bits using Boolean logic, hardwired or computed. Since a MAC operation involves two operands (activation and weight), status bits of both must be combined to validate if the MAC operation as “useful”. The status bits are bitwise compared (i.e., bitwise AND of zero bits, bitwise XOR of sign bits, etc.) resulting in a single status word with set bits corresponding to indices of the operands in the focused group that will produce a non-zero product or a non-zero feature map later in the network, respectively. The status word is consumed by the herein described AGU to map the status word into the memory address of the next “valid” operand element pair. Similarly, the operands could be ordered based on the magnitude map entries to feed the AGU in the correct order for the intended target. While in some cases it might be beneficial to start the computations from larger numbers, in others starting with lowest numbers is worth it. But this choice is application specific.
Use of status bits increases area (memory and logic), but substantially reduces the energy use associated with computations but even more so with useless and power hungry memory accesses because reads and writes of actual data (words of potentially multiple bytes, depending on the used accuracy) are not performed. Instead, groups of status bits are fetched (could also be cached for reuse), resulting in the cost reduction per memory access of:
where n is the word width of actual data, k the number of status bits per word, and f the fraction of sparsity (˜0.5 for regular sparsity in some residual network implementations). As it can be seen in the formula, there is a tipping point where the controller mechanism, even if very small is no longer beneficial.
Use of status bits also removes the need to write zero feature map elements (neuron outputs when zero) back to SRAM after a computation because only its zero status bit can be set. The content of the underlying memory location is irrelevant and can be disregarded when all computation decisions are based on the status bits of the corresponding memory cells. Reading and writing back to memory the status bits represents a fraction of reading and writing of the actual value, offering significant gain in energy (efficiency) and time (performance). An underlying memory cell content gets overwritten only when valid (non-zero) data needs to be stored after a computation, while old activation is no longer needed.
The backward slash pattern in
Address generation 676 uses the status maps of active cell activations 655 and weighs 656 to provide pointers 677, 678 to next set of activations and weights that when multiplied produced a non-zero result. Zero status maps 655 and 656 can be bit-wise OR-ed to determine which of the pairs of associated activations 653 and weights would produce a non-zero product. The CPU 679 pops pointers 677, 678 and uses activation and weight for computation. When a computation is done, the CPU checks whether the result is zero and, if so, sets the corresponding output status bit. When storing each of the outputs 660, CPU 679 can skip storing a zero output and just store its corresponding set output status bit like 651. Additionally, if the weights are known not to be zero, like in the case of CNNs, only the activation status map of the active cells 655 can be used to generate pointers 677 and 678.
Address generation unit 750 provides two sets of addresses 708 of: 1) operands producing the next positive activation-weight product (752), and 2) operands producing next negative input-weight product (754). The zero and sign maps are fed to the AGU, and Boolean logic and the status register 709 are in turn used to generate status registers 726 and 728.
The smart memory controller maps 750 the status register 709 into the offset of the next useful operand pair (pointed to by pointers 752, 754) in the focus group, which is then decoded based on the FSM state and used by the CPU 756 to load the next operand pair for computation. Based on current value in the accumulation, the CPU 756 can decide whether to pop pointer 752 or 754 until both are NULL or computation is terminated early. This is indicating to the FSM to move to processing the next output element. The CPU 756 determines the appropriate setting of status bits 715 ad 717 in status bit maps 736 and 738 and writes result 712 to outputs 735. Instead of writing zeros to output memory, the corresponding bit in Z status map 736 for the corresponding cell is simply set.
Since the CPU 756 has no foresight to predict which feature map will become zero later in the network, feature map sparsity can be exploited only partially. The CPU 756 first consumes addresses generated by the positive hardware pointer 752 before starting to use the negative hardware pointer 754. Once the value of the accumulation for the current output element falls below zero and all remaining pointers are in the negative list, further computations can be prematurely terminated because they will only make the result more negative, which will be clipped to zero by the follow-up ReLU activation. Similarly, in the case of overflow in fixed-point arithmetic, the reverse can apply using similar reasoning. Another implementation can be more dynamical, when the extra logic is balanced by the benefit. In such an implementation the FSM would switch between the pointers indicating next positive and negative pairs of activation/weight as needed, to attempt to keep the result in the available dynamic range. This approach could also be used to increase accuracy of the result, avoiding situations like cancelation. The mechanism indicates to the CPU 756 at which point it can stop the current convolution step, knowing that no further operations will influence the value of the current output later in the network. The gain is then due to early termination and is equivalent to the number of skipped operations and the associated memory reads.
The FSM keeps track of the convolution state and data is stored in a dense tensor format, where the FSM can easily retrieve elements via appropriate stride access. Each word is associated with its personal status bits describing its contents (i.e. zero and sign, or other status parameters for more sophisticated status register generation, for example, based on magnitude such that the largest absolute products are accumulated first or last, depending on the need). Status bits are updated and written through to memory after each computation alongside the result, unless the result is zero in which case only the zero status bit is written. This writing back can be done smooth or in batches.
The working principle of the herein described solution is (1-4):
1. Skipping MAC computations and memory accesses for combinations of operands (weights and activations in AI/ML use cases) that do not affect the specific MAC output (e.g., produce zero when multiplied, or products too small to make an impact).
2. Stopping further MAC computations corresponding to the same output once the accumulator value saturated below MinInt (or a zero in case of ReLU) or above MaxInt and no further accumulation will shift the value back into permitted, non-saturated range.
3. Do not overwrite output memory locations with zeros, only set the corresponding zero status bit.
4. Computing and storing the status maps related to output.
The mechanism saves on the number of computations performed (increased speed and hardware utilization) and reduces useless read and write memory accesses (increased energy efficiency).
The slashes in
At 1430, the status but is set based on the MAC output value. At 1432, one or more output values and one or more corresponding status bits are stored. At 1434, it is determined whether the computation is finished (for example other neurons to be computed). If at 1434 it is determined that the computation is not finished, for example “No”, the method transitions to 1404. If at 1434 it is determined that no more outputs are left, for example “Yes”, the method ends at 1440.
The mechanism exploiting regular sparsity, only using the zero-status bit, is generic in nature and can be applied to other signal processing or other computing domains like data analysis and is not only restricted to AI/ML.
Based on the area/power/latency budget, one or both sparsity types, i.e., activation and feature sparsity mechanisms can be exploited by using appropriate status bits. Activation sparsity has the least hardware cost and the most impact on convolutional networks, thus the easiest to leverage.
Conceptually, instead of the CPU checking at run-time which weight or activation is a zero, the memory controller provides the CPU an address at which the next non-zero element is located, if any. By computing and classifying separately the addresses for operands that produce positive and negative multiplication results, the mechanism can also exploit sparsity in feature maps.
Additionally, the same or slightly modified controller or AGU described above can be used where a specific scheme of synchronization between activations and weights is required (like in the case of a convolutional accelerator hardwired architecture or a systolic array). Restricting to the activation and weight sparsity in a HW accelerator with increased parallelism, it is possible to consider the synchronization of the activation with the weights, and both operands be fed correctly to the MAC units. An example here is when either weights or inputs are kept local to the HW accelerator.
An idea driving the herein described method is based on the observation of “divide and conquer”, with the aim of simplifying the control mechanisms/HW. Rather than solely relying on the pure zero-skipping method, the complex control policy is divided into simpler basic blocks by strategically interleaving special zero-gating “0”. In other words, instead of treating all “0” values as candidates for zero-skipping, certain “0” values are deliberately selected for zero-gating. These intentionally chosen zero-gating zeros that propagate through the data computation pipeline enables the HW designer to break down the dependency chain of skipping computations into these simple basic blocks. Within each basic block, the control policy becomes simplified, requiring minimal hardware resources. Additionally, the method includes provisioning outputs for the early terminated computations, either in a local or off-chip storage.
The herein described method requires having the zero distribution (see zero map) prior to the actual computation of the current layer and scheduling the zeros as either zero-skipping zeros or zero-gating zeros. The zero-scheduler is implemented in conjunction with the zero-map. The input to the zero-scheduler is the sequential bitstream derived from the zero-map, allowing for runtime assignment of zeros without waiting for all computations to be completed. The zero-scheduler assists the HW controller in determining whether to skip reading the zero-input as a zero-skipping zero or generate an on-chip zero (rather than reading a full precision “zero” from the off-chip DRAM) as a zero-gating zero.
The problem is split in basic blocks, where a basic block is defined as the computational flows that could be scheduled independently from the other basic blocks. The form of basic block is dependent on the actual HW implementation and the SW data flow (different spatial parallelism and temporal stationary methods). The herein described concept could be best illustrated by a systolic array architecture implementing a tensor processing unit (TPU). Another example is a convolutional accelerator, which may be implemented as a one-dimensional systolic array specialized for a residual CNN and has an efficient IO (one input per clock cycle (CC)) and weights changed at a slower rate.
In a general case, activations and weights are interchangeable in the examples and mechanisms below. Sparsity in the slow-changing input to PEs (usually weights) can also be accounted for by zero-gating the corresponding PE or optimally scheduling the operations. On top, the zero status maps of weights and activations can be combined and used just as well in the explanations and mechanism below.
The HW that leverages this way the activation sparsity is focused on two aspects. First, the zero cache 1606 determines and/or obtains the distribution of zeros to tackle the problem of expensive I/O. Secondly, the zero controller 1602, together with the routing 1630 (from Acc21611 to the input 1632 of PE_01612) and multiplexing (at M01640, M11641, M21642), could be used to control the independent computational flow.
To better visualize the data flow over multiple PEs and CCs, the behavior described above is captured for an example HW implementation into a table format in
To generalize, consider n as the number of PEs in a CU (for example PE_01612, PE_11614, and PE_21616 in CU_11603). Examine the case of having two zero activations and identify the basic blocks that can decouple these zeros from each other. This enables shrinking of the number of cases to a limited set based on the influence field of the specific CNN accelerator (mainly n). In the convolutional accelerator architecture depicted in
The goal of designing this control policy is to find the basic blocks of zero-skipping, like “xx 0 . . . 0 xxx 0 . . . 0 xx” and observe which cases can be treated alike.
A Single “0” activation in between non-zero activations
Referring to
In
In the example of 3 PEs, when the zeros are separated by at least 2 (generalizing n-1) non-zero consecutive activations, the two zeros can be considered independently, as two times a single “0” (see above).
In this example consider 3 PEs. In a more general case, the number of independent scenarios will be increasing (maybe exponentially) with the number of parallel PEs working on the same activation value.
In
The computational flow could be shown as in
1. Structured Approach to Tackle Fine-Grained Sparsity: The herein described method offers a structured approach to identify basic blocks and effectively handle fine-grained structured and non-structured sparsity. This approach allows for more efficient utilization of hardware resources.
2. Addressing Implementation Problems of the Zero-skipping method: The examples described herein address the implementation challenges associated with the zero-skipping method in the case of activation sparsity by reducing the complex computational flow into a limited number of simple schedulable flows. The same mechanism could be applied to the fast changing PE input or to the combined zero status maps of both weights and activations. This leads to significant power savings and improved latency, enhancing the overall efficiency of the CNN accelerator.
3. Dynamic Operation at Run-Time: The bit-map and zero-scheduler components of the herein described method can operate dynamically at run-time. This means that the distribution of zeros and their assignment can be determined during the inference process, allowing for real-time optimization.
1. Dependency on CNN Architecture and Computational Flow: The herein described method is dependent on the specific CNN architecture and its computational flow. Some architectures may not have well-defined basic blocks, limiting the applicability of the herein described approach in such cases. But still, worst case, zero gating (with zeros generated on chip instead of read) can be used to save power.
2. Latency Improvement Dependent on sparsity: The extent of latency improvement achieved through the herein described method is dependent on the operand data itself, i.e., number of activation zeros. Different input patterns may yield varying levels of improvement, and there may be cases where the latency improvement is not as significant.
These advantages and further considerations help provide a comprehensive understanding of the strengths and potential constraints associated with the herein described methods.
Next generation 6G communication technology and standards aim to be faster, bigger, just simply better than 5G and will leverage many antennas, many users, huge bandwidths, low latency, faster moving channel conditions, etc. Thus, everything should, in principle, scale up except the cost, thus except area, power, price. In this new endeavor, besides the classical signal processing algorithms, the main new start is AI/ML (artificial intelligence/machine learning). But while there are indications that AI/ML assisted receivers and transmitters, even at L1 can be outperforming the classical approaches, the problem is the cost of a corresponding implementation. Thus, to gain in the flexibility that AI/ML is bringing to the table, one has to pay a huge cost when looking at the hardware implementation. The good part is that the HW implementation of the AI/ML algorithms is starting to get traction and results. In this context, the hardware implementation methodology described herein may be used, as well as the address generation unit/controller. The advantage of this hardware implementation is its simplicity and decoupling from the specific AI/ML algorithm, thus changes in the NN structure may, if at all, only require small adaptations. This makes controllers based on the herein described methodology very flexible and future proof.
The examples described here have the greatest advantage in that they are orthogonal to the algorithmic work, and can be applied on a large range of computations, and not limited to the CNN discussed above. And they can do so without imposing any constrains on the NN designer, like other methods, including structural sparsity methods. Hence, described herein is a general methodology and implementation of hardware that can very efficiently leverage sparsity to offer ˜2× lower energy consumption, about one-half (½) computation latency at the low cost of the custom address generation unit and controller described here. This is not a general number, but is based on one of the CNN examples. Actual benefits depend on the level of sparsity in the weights or activations. An important bottleneck in AI/ML that can be eased is data movement. Decreased to a minimum is the amount of data that is moved to and from external memory for weights and partial results, by only reading and writing the useful data. This methodology with the associated implementation increases the chance of AI/ML being implemented into 5G.
The herein described hardware implementation involves efficient HW running at all layers of a telecommunication system, as discussed herein, with limited disruptions of a classical HW architecture, be it a CPU like or a full-blown hardwired solution.
The herein described implementation, though initially conceived as an ML accelerator, is a suitable general-purpose mechanism for accelerating sparse matrix arithmetic operations in a DSP, linear algebra, etc. and for reducing power use corresponding to unnecessary memory accesses.
These mechanisms and controllers do not compute when operands (be it activations and weights for the CPU example, and inputs for the HW accelerator example) are zero, by not performing some of the computations, changing computation time with changing activation sparsity. Higher levels of activation sparsity will correspond to lower computation time. Also, the pattern of activation sparsity will impact the computation time, since the ZG-ZS mechanism can increase the computation time. Separate smaller memories may be implemented next to the main scratchpads/caches or DRAM that keep track of status bits. The examples described herein result in decrease in memory energy use, proportional to the amount of sparsity in the network, and may implement special instructions for fetching status bits and mapping them into memory addresses in the case of a processor like a CPU.
In
The base station 70, as a network element of the cellular network 1, provides the UE 10 access to cellular network 1 and to the data network 91 via the core network 90 (e.g., via a user plane function (UPF) of the core network 90). The base station 70 is illustrated as having one or more antennas 58. In general, the base station 70 is referred to as RAN node 70 herein. An example of a RAN node 70 is a gNB. There are, however, many other examples of RAN nodes including an eNB (LTE base station) or transmission reception point (TRP). The base station 70 includes one or more processors 73, one or more memories 75, and other circuitry 76. The other circuitry 76 includes one or more receivers (Rx(s)) 77 and one or more transmitters (Tx(s)) 78. A program 72 is used to cause the base station 70 to perform the operations described herein.
It is noted that the base station 70 may instead be implemented via other wireless technologies, such as Wi-Fi (a wireless networking protocol that devices use to communicate without direct cable connections). In the case of Wi-Fi, the link 11 could be characterized as a wireless link.
Two or more base stations 70 communicate using, e.g., link(s) 79. The link(s) 79 may be wired or wireless or both and may implement, e.g., an Xn interface for fifth generation (5G), an X2 interface for LTE, or other suitable interface for other standards.
The cellular network 1 may include a core network 90, as a third illustrated element or elements, that may include core network functionality, and which provide connectivity via a link or links 81 with a data network 91, such as a telephone network and/or a data communications network (e.g., the Internet). The core network 90 includes one or more processors 93, one or more memories 95, and other circuitry 96. The other circuitry 96 includes one or more receivers (Rx(s)) 97 and one or more transmitters (Tx(s)) 98. A program 92 is used to cause the core network 90 to perform the operations described herein.
The core network 90 could be a 5GC (5G core network). The core network 90 can implement or comprise multiple network functions (NF(s)) 99, and the program 92 may comprise one or more of the NFs 99. A 5G core network may use hardware such as memory and processors and a virtualization layer. It could be a single standalone computing system, a distributed computing system, or a cloud computing system. The NFs 99, as network elements, of the core network could be containers or virtual machines running on the hardware of the computing system(s) making up the core network 90.
Core network functionality for 5G may include access and mobility management functionality that is provided by a network function 99 such as an access and mobility management function (AMF(s)), session management functionality that is provided by a network function such as a session management function (SMF). Core network functionality for access and mobility management in an LTE network may be provided by an MME (Mobility Management Entity) and/or SGW (Serving Gateway) functionality, which routes data to the data network. Many others are possible, as illustrated by the examples in
In the data network 91, there is a computer-readable medium 94. The computer-readable medium 94 contains instructions that, when downloaded and installed into the memories 15, 75, or 95 of the corresponding UE 10, base station 70, and/or core network element(s) 90, and executed by processor(s) 13, 73, or 93, cause the respective device to perform corresponding actions described herein. The computer-readable medium 94 may be implemented in other forms, such as via a compact disc or memory stick.
The programs 12, 72, and 92 contain instructions stored by corresponding one or more memories 15, 75, or 95. These instructions, when executed by the corresponding one or more processors 13, 73, or 93, cause the corresponding apparatus 10, 70, or 90, to perform the operations described herein. The computer readable memories 15, 75, or 95 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, flash memory, firmware, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The computer readable memories 15, 75, and 95 may be means for performing storage functions. The processors 13, 73, and 93, may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The processors 13, 73, and 93 may be means for causing their respective apparatus to perform functions, such as those described herein.
The receivers 17, 77, and 97, and the transmitters 18, 78, and 98 may implement wired or wireless interfaces. The receivers and transmitters may be grouped together as transceivers.
The apparatus 2700 includes a display and/or I/O interface 2708, which includes user interface (UI) circuitry and elements, that may be used to display aspects or a status of the methods described herein (e.g., as one of the methods is being performed or at a subsequent time), or to receive input from a user such as with using a keypad, camera, touchscreen, touch area, microphone, biometric recognition, one or more sensors, etc. The apparatus 2700 includes one or more communication e.g. network (N/W) interfaces (I/F(s)) 2710. The communication I/F(s) 2710 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique including via one or more links 2724. The link(s) 2724 may be the link(s) 11 and/or 79 and/or 31 and/or 81 from
The transceiver 2716 comprises one or more transmitters 2718 and one or more receivers 2720. The transceiver 2716 and/or communication I/F(s) 2710 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de) modulator, and encoder/decoder circuitries and one or more antennas, such as antennas 2714 used for communication over wireless link 2726.
The control module 2706 of the apparatus 2700 comprises one of or both parts 2706-1 and/or 2706-2, which may be implemented in a number of ways. The control module 2706 may be implemented in hardware as control module 2706-1, such as being implemented as part of the one or more processors 2702. The control module 2706-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the control module 2706 may be implemented as control module 2706-2, which is implemented as computer program code (having corresponding instructions) 2705 and is executed by the one or more processors 2702. For instance, the one or more memories 2704 store instructions that, when executed by the one or more processors 2702, cause the apparatus 2700 to perform one or more of the operations as described herein. Furthermore, the one or more processors 2702, the one or more memories 2704, and example algorithms (e.g., as flowcharts and/or signaling diagrams), encoded as instructions, programs, or code, are means for causing performance of the operations described herein.
The apparatus 2700 to implement the functionality of control 2706 may be UE 10, base station 70 (e.g. gNB 70), or core network 90 including any of the network functions 99, which network functions 99 may be implemented with a network entity. Thus, processor 2702 may correspond to processor(s) 13, processor(s) 73 and/or processor(s) 93, memory 2704 may correspond to one or more memories 15, one or more memories 75 and/or one or more memories 95, computer program code 2705 may correspond to program 12, program 72, or program 92, communication I/F(s) 2710 and/or transceiver 2716 may correspond to other circuitry 16, other circuitry 76, or other circuitry 96, and antennas 2714 may correspond to antennas 28 or antennas 58.
Alternatively, apparatus 2700 and its elements may not correspond to either of UE 10, base station 70, or core network and their respective elements, as apparatus 2700 may be part of a self-organizing/optimizing network (SON) node or other node, such as a node in a cloud.
Apparatus 2700 may correspond to the apparatuses depicted in
The apparatus 2700 may also be distributed throughout the network (e.g. 91) including within and between apparatus 2700 and any network element (such as core network 90 and/or the base station 70 and/or the UE 10).
Interface 2712 enables data communication and signaling between the various items of apparatus 2700, as shown in
In
The following examples are provided and described herein.
Example 1. An apparatus including: means for determining a zero status bit of a first element that indicates whether the first element is zero; means for determining a zero status bit of a second element that indicates whether the second element is zero; means for determining at least one combined status bit corresponding to an index of the first element and an index of the second element, wherein the at least one combined status bit indicates whether a product of a next first element and a next second element is non-zero; means for determining at least one pointer that points to a memory address of a next set of elements comprising the next first element and the next second element, based on the at least one combined status bit; means for retrieving the next set of elements from a location in the at least one memory given by the memory address; and means for performing a computation to determine the product of the next first element and the next second element.
Example 2. The apparatus of example 1, wherein the first element and the next first element are weights of a neural network, and the second element and the next second element are activation inputs of the neural network.
Example 3. The apparatus of any of examples 1 to 2, further including: means for determining a sign status bit of the first element that indicates whether the first element is positive or negative; means for determining a sign status bit of the second element that indicates whether the second element is positive or negative; wherein the at least one combined status bit comprises a positive combined status bit corresponding to the index of the first element and the index of the second element, wherein the positive combined status bit indicates whether the product of the next first element and the next second element is non-zero and positive; wherein the at least one combined status bit comprises a negative combined status bit corresponding to the index of the first element and the index of the second element, wherein the negative combined status bit indicates whether the product of the next first element and the next second element is non-zero and negative; and means for determining the at least one pointer that points to the memory address of the next set of elements comprising the next first element and the next second element, based on the positive combined status bit and the negative combined status bit.
Example 4. The apparatus of example 3, further including: means for determining the positive combined status bit and the negative combined status bit based on a bitwise and operation of the zero status bit of the first element and the zero status bit of the second element, and a bitwise xor operation of the sign status bit of the first element and the sign status bit of the second element.
Example 5. The apparatus of any of examples 3 to 4, further including: means for determining a first pointer that points to a memory address of a next set of elements comprising a first element and a second element that when multiplied is positive, based at least on the positive combined status bit; means for retrieving the next set of elements that when multiplied is positive from a location in the at least one memory given by the memory address; means for performing a computation to determine a product of the first element and the second element that when multiplied is positive; means for determining a second pointer that points to a memory address of a next set of elements comprising a first element and a second element that when multiplied is negative, based at least on the negative combined status bit; means for retrieving the next set of elements that when multiplied is negative from a location in the at least one memory given by the memory address; and means for performing a computation to determine a product of the first element and the second element that when multiplied is negative.
Example 6. The apparatus of example 5, further including: means for dereferencing the first pointer; means for determining whether the first pointer points to null, wherein when the first pointer points to null, there are no more pairs of the next set of elements that when multiplied is positive; means for performing the computation to determine the product of the first element and the second element that when multiplied is positive, in response to the first pointer not pointing to null; and means for determining that there are no more pairs of the next set of elements that when multiplied is positive, in response to the first pointer pointing to null.
Example 7. The apparatus of any of examples 5 to 6, further including: means for dereferencing the second pointer; means for determining whether the second pointer points to null, wherein when the second pointer points to null, there are no more pairs of the next set of elements that when multiplied is negative; means for performing the computation to determine the product of the first element and the second element that when multiplied is negative, in response to the second pointer not pointing to null; and means for determining that there are no more pairs of the next set of elements that when multiplied is negative, in response to the second pointer pointing to null.
Example 8. The apparatus of any of examples 1 to 7, further including: means for determine whether to leverage activation sparsity of a model for which the computation is performed; wherein the at least one combined status bit is determined in response to determining to leverage the activation sparsity of the model for which the computation is performed.
Example 9. The apparatus of any of examples 3 to 8, further including: means for determining whether to leverage activation sparsity and feature sparsity of a model for which the computation is performed; wherein the at least one combined status bit is determined in response to determining to leverage activation sparsity of the model for which the computation is performed; wherein the positive combined status bit and the negative combined status bit are determined in response to determining to leverage feature sparsity of the model for which the computation is performed.
Example 10. The apparatus of any of examples 1 to 9, further including: means for determining whether a status register comprising the at least one combined status bit is empty or below a threshold; means for performing the computation to determine the product of the next first element and the next second element, in response to the status register comprising the at least one combined status bit not being empty or being above the threshold; and means for determining to not perform the computation to determine the product of the next first element and the next second element, in response to the status register comprising the at least one combined status bit not being empty or being below the threshold.
Example 11. The apparatus of any of examples 1 to 10, further including: means for clearing the at least one combined status bit from a status register.
Example 12. The apparatus of any of examples 1 to 11, further including: means for determining the at least one combined status bit, in response to both the first element being non-zero and the second element being non-zero.
Example 13. The apparatus of any of examples 1 to 12, further including: means for determining to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; means for maintaining an accumulation value comprising an accumulation of products; means for maintaining a positive queue corresponding to products that, when computed, gives positive results, and a positive pointer that points to the products with positive results; means for maintaining a negative queue corresponding to products that, when computed, gives negative results, and a negative pointer that points to the products with negative results; means for determining to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; means for determining to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; means for determining to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and means for determining to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
Example 14. An apparatus including: means for receiving as input a sequential bitstream; means for determining, using a cache, a distribution of zero-inputs from the sequential bitstream; means for determining a zero-input from the distribution of zero-inputs; and means for determining whether to skip reading the zero-input, or to generate an on-chip zero as a zero-gating zero to replace the zero-input.
Example 15. The apparatus of example 14, wherein generating the on-chip zero is performed without reading a full precision zero from an off-chip dynamic random access memory.
Example 16. The apparatus of any of examples 14 to 15, further including: means for determining to generate the on-chip zero as a zero-gating zero to replace the zero-input, in response to the zero-input following a non-zero input and preceding a non-zero input; and means for determining to generate a plurality of on-chip zeros for more parallelism.
Example 17. The apparatus of any of examples 14 to 16, further including: means for selecting, using a multiplexer, an output of a first processing element; and means for processing, using a memory of a convolution accelerator, an output of an accumulator of a second processing element and the selected output of the first processing element to generate a looped input to the first processing element.
Example 18. The apparatus of any of examples 14 to 17, wherein the distribution of zero-inputs comprise weight and activation inputs of a neural network.
Example 19. An apparatus including: means for determining to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; means for maintaining an accumulation value comprising an accumulation of products; means for maintaining a positive queue corresponding to products with positive results, and a positive pointer that points to the products with positive results; means for maintaining a negative queue corresponding to products with negative results, and a negative pointer that points to the products with negative results; means for determining to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; means for determining to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; means for determining to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and means for determining to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
Example 20. The apparatus of example 19, further including: means for determining to not overwrite output memory locations with zeros, and instead to set a corresponding zero status bit used to indicate whether an operand is zero.
Example 21. The apparatus of any of examples 19 to 20, wherein: the first threshold comprises a maximum representable number; and the second threshold comprises a minimum representable number, or the second threshold comprises zero when the multiply and accumulate computations are associated with a rectified linear unit.
Example 22. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: determine a zero status bit of a first element that indicates whether the first element is zero; determine a zero status bit of a second element that indicates whether the second element is zero; determine at least one combined status bit corresponding to an index of the first element and an index of the second element, wherein the at least one combined status bit indicates whether a product of a next first element and a next second element is non-zero; determine at least one pointer that points to a memory address of a next set of elements comprising the next first element and the next second element, based on the at least one combined status bit; retrieve the next set of elements from a location in the at least one memory given by the memory address; and perform a computation to determine the product of the next first element and the next second element.
Example 23. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: receive as input a sequential bitstream; determine, using a cache, a distribution of zero-inputs from the sequential bitstream; determine a zero-input from the distribution of zero-inputs; and determine whether to skip reading the zero-input, or to generate an on-chip zero as a zero-gating zero to replace the zero-input.
Example 24. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: determine to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; maintain an accumulation value comprising an accumulation of products; maintain a positive queue corresponding to products with positive results, and a positive pointer that points to the products with positive results; maintain a negative queue corresponding to products with negative results, and a negative pointer that points to the products with negative results; determine to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; determine to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; determine to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and determine to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
Example 25. A method including: determining a zero status bit of a first element that indicates whether the first element is zero; determining a zero status bit of a second element that indicates whether the second element is zero; determining at least one combined status bit corresponding to an index of the first element and an index of the second element, wherein the at least one combined status bit indicates whether a product of a next first element and a next second element is non-zero; determining at least one pointer that points to a memory address of a next set of elements comprising the next first element and the next second element, based on the at least one combined status bit; retrieving the next set of elements from a location in the at least one memory given by the memory address; and performing a computation to determine the product of the next first element and the next second element.
Example 26. A method including: receiving as input a sequential bitstream; determining, using a cache, a distribution of zero-inputs from the sequential bitstream; determining a zero-input from the distribution of zero-inputs; and determining whether to skip reading the zero-input, or to generate an on-chip zero as a zero-gating zero to replace the zero-input.
Example 27. A method including: determining to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; maintaining an accumulation value comprising an accumulation of products; maintaining a positive queue corresponding to products with positive results, and a positive pointer that points to the products with positive results; maintaining a negative queue corresponding to products with negative results, and a negative pointer that points to the products with negative results; determining to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; determining to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; determining to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and determining to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
Example 28. A computer readable medium including instructions stored thereon for performing at least the following: determining a zero status bit of a first element that indicates whether the first element is zero; determining a zero status bit of a second element that indicates whether the second element is zero; determining at least one combined status bit corresponding to an index of the first element and an index of the second element, wherein the at least one combined status bit indicates whether a product of a next first element and a next second element is non-zero; determining at least one pointer that points to a memory address of a next set of elements comprising the next first element and the next second element, based on the at least one combined status bit; retrieving the next set of elements from a location in the at least one memory given by the memory address; and performing a computation to determine the product of the next first element and the next second element.
Example 29. A computer readable medium including instructions stored thereon for performing at least the following: receiving as input a sequential bitstream; determining, using a cache, a distribution of zero-inputs from the sequential bitstream; determining a zero-input from the distribution of zero-inputs; and determining whether to skip reading the zero-input, or to generate an on-chip zero as a zero-gating zero to replace the zero-input.
Example 30. A computer readable medium including instructions stored thereon for performing at least the following: determining to skip multiply and accumulate computations and memory accesses for combinations of operands that produce zero when multiplied; maintaining an accumulation value comprising an accumulation of products; maintaining a positive queue corresponding to products with positive results, and a positive pointer that points to the products with positive results; maintaining a negative queue corresponding to products with negative results, and a negative pointer that points to the products with negative results; determining to stop multiply and accumulate computations corresponding to an output once the accumulation value is greater than or equal to a first threshold, remaining products to compute being positive, and the negative pointer pointing to null; determining to process products from the negative queue to decrease the accumulation value, in response to the accumulation value being greater than or equal to the first threshold and the negative pointer not pointing to null; determining to stop multiply and accumulate computations corresponding to the output once the accumulation value is less than or equal to a second threshold, remaining products to compute being negative, and the positive pointer pointing to null; and determining to process products from the positive queue to increase the accumulation value, in response to the accumulation value being less than or equal to the second threshold and the positive pointer not pointing to null.
References to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential or parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGAs), application specific circuits (ASICs), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
The memories as described herein may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, non-transitory memory, transitory memory, fixed memory and removable memory. The memories may comprise a database for storing data.
As used herein, the term ‘circuitry’ may refer to the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and memories that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present. As a further example, as used herein, the term ‘circuitry’ would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware. The term ‘circuitry’ would also cover, for example and if applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device.
It should be understood that the foregoing description is only illustrative. Various alternatives and modifications may be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different example embodiments described above could be selectively combined into a new example embodiment. Accordingly, this description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims.
The following acronyms and abbreviations that may be found in the specification and/or the drawing figures are given as follows (the abbreviations and acronyms may be appended with each other or with other characters using e.g. a dash, hyphen, slash, or number, and may be case insensitive):
| Number | Date | Country | Kind |
|---|---|---|---|
| 23217050.6 | Dec 2023 | EP | regional |