This disclosure generally relates to hardware architecture and, more particularly, to a processor architecture for machine-learning computations.
Neural networks are increasingly being used to implement machine learning (ML) techniques to solve a wide variety of problems including, but not limited to, object identification, feature classification, or content-driven image processing. Some neural networks, which may be referred to as convolutional neural networks, include one or more convolutional layers. In a convolutional neural network (CNN), the convolutional layers typically account for the vast majority of the computations performed and the data movement within the CNN and/or between the CNN and other elements of an ML model, making them a performance bottleneck. Existing ML accelerators focus on using high compute parallelism along with an optimized data orchestration throughout the memory hierarchy to speed up the processing of convolutional layers. However, existing ML accelerators may not perform well when implemented within edge devices that have strict power consumption constraints and that run inference exercises using previously trained models in real time. For example, existing ML accelerators may not perform well within artificial reality systems for virtual reality (VR), augmented reality (AR), mixed reality (MR), or hybrid reality implemented on standalone head-mounted displays (e.g., on AR/VR headsets), mobile devices or other edge computing devices.
In particular embodiments, a system may comprise at least a processor and an external memory coupled to the processor. The external memory may be a non-transitory memory. The processor may comprise an internal memory, a Multiply-Accumulate (MAC) array, a first vector register array communicatively connected to the MAC array through a multiplexer (MUX), a second vector register array communicatively connected to the MAC array, and a third vector register array communicatively connected to the MAC array. The non-transitory memory may comprise instructions executable by the processor. In particular embodiments, the processor may be a very large instruction word (VLIW) processor comprising a plurality of function units. The instructions may be performed simultaneously in the plurality of function units. In particular embodiments, an instruction set architecture (ISA) of the processor may support hardware instructions associated with hardware components of the processor including the internal memory, the MAC array, the first vector register array, the second vector register array, and the third vector register array. In particular embodiments, the instructions are single instruction multiple data (SIMD) instructions, each of which may process a fixed-size vector data.
In particular embodiments, the processor may configure one or more banks of the internal memory with configuration information including one or more bank sizes and one or more alignment sizes. The processor may allocate buffers within the internal memory for input activation tensor, output activation tensor, weights, or biases.
In particular embodiments, the processor may transfer weights for M filters and an input activation tensor from an external memory to the internal memory. In particular embodiments, the processor may insert paddings to the input activation tensor in the internal memory based on first configuration parameters. The first configuration parameters may be determined based at least on an activation tensor width, an activation tensor height, an output activation tensor width, an output activation tensor height, and a stride. The paddings may increase width or height of the input activation tensor. To insert the paddings to the activation tensor, the processor may allocate a second memory space for an activation matrix corresponding to each channel of the activation tensor along with a configured-number of paddings in the internal memory. The processor may initialize the configured-number of paddings in the second memory space. Then, the processor may copy data for each row of the activation matrix from a first memory space to a respective memory chunk in the second memory space. In the first memory space, the activation matrix may be stored in a sequence.
In particular embodiments, a shape of the MAC array may be dynamically configured at runtime to a required shape for convolution operations based on hardware configuration parameters stored in a register array designated for the hardware configuration parameters. The hardware configuration parameters may be determined based at least on C, a number of input channel, and M, a number of filters used for the convolution operations. In particular embodiments, the shape of the MAC array may be configured to C-by-M. An output activation tensor generated as a result of the convolution operations may have M channels. In particular embodiments, possible combinations of C and M may include (64, 8), (32, 16), or (16, 32).
In particular embodiments, the processor may perform convolution operations between an input activation tensor and M filters to generate an output activation tensor. To generate the output activation tensor, the processor may calculate each row e of the output activation tensor of the convolution operations by performing the convolution operations on corresponding R rows {h0e, . . . , hR-1e} of the input activation tensor with the M filters, where R is a filter height. The processor may calculate the output activation tensor row-by-row. A pixel element on row e of the output activation tensor may comprise M channels. To calculate row e of the output activation tensor, the processor may calculate multiplications for each row r among R rows of the filters. For calculating multiplications of weight elements in row r of the filters to generate row e of the output activation tensor, the processor may first determine that row hre of the input activation tensor that corresponds to row r of the filters for calculating row e of the output activation tensor by =e×Uy+r, where Uy is a vertical stride for the convolution operations, and where e, hre, and r are zero-based indices. For each column s, where s begins at zero, of the filters, the processor may optimize convolution operations between a filter element at coordinates (r, s) from each of the M filters and valid pixel elements in row hre of the input activation tensor. For the optimized convolution operations, the processor may load P pixel elements starting from pixel element s of row hre of the input activation tensor to the first vector register array, where P=W−S+1, where W is an input tensor width, and S is a filter width. Each pixel element may comprise an input activation vector of size C. In particular embodiments, the processor may be operable when executing a first instruction among the instructions in the external memory. The first instruction, a SIMD instruction, may cause the processor to load an input activation vector from the internal memory to a vector register indicated by the first instruction among the first vector register array. The input activation vector may comprise C input activation values corresponding to a pixel element within an input activation tensor, where Cis a number of input channels. A source location of the input activation vector in the internal memory may be indicated by the first instruction. The processor may execute the instruction a number of times with different parameters to load P pixel elements starting from pixel element s of row hre of the input activation tensor to the first vector register array.
In particular embodiments, for the optimized convolution operations between filter elements at coordinates (r, s) from the M filters and valid pixel elements in row hre of the input activation tensor, the processor may load a filter element at coordinates (r, s) from each of the M filters to the second vector register array. Each filter element may comprise a weight vector of size C. Each of M weight vectors in the weight vector array may comprise C weight values. The weight vector array may comprise a filter element at a position of M filters. The processor may execute the second instruction a number of times with various parameters to load filter elements at coordinates (r, s) of the M filters.
In particular embodiments, the processor may be further operable to execute a third instruction among the instructions. The third instruction, a single SIMD instruction, may cause the processor to feed a weight vector array from the second vector register array to the MAC array. The third instruction may also cause the processor to broadcast an input activation vector to the MAC array. The input activation vector may be selected by the MUX from the first vector register array based on the third instruction. The third instruction may further cause the processor to multiply an input activation value broadcast to each MAC unit of the MAC array from the input activation vector and a weight value fed to the MAC unit from the weight vector array. The third instruction may cause the processor to generate a partial output activation value for each of the M filters by accumulating outputs of MAC units corresponding to the filter. A partial output activation vector may comprise M partial output activation values. The third instruction may further cause the processor to store the partial output activation vector to a vector register of the third vector register array. The storing may be an overwriting residual values of the vector register with values of the partial output activation vector or an accumulating the values of the partial output activation vector to the residual values of the vector register.
In particular embodiments, the processor may calculate a partial output activation vector for each valid pixel element k among the P pixel elements in the first vector register array by executing the third instruction, where the valid pixel element k at an iteration l is determined as k=l*Ux, where Ux is a horizontal stride for the convolution operations, wherein P=W−S+1, and where k and l are zero-based indices. For calculating a partial output activation vector, the processor may: (a) perform M*C element-wise multiplications between the pixel element k and the filter element at coordinates (r, s) of M filters; (b) generate a partial output activation vector having M output channels by summing results of multiplications belonging to a respective filter; and (c) accumulate the partial output activation vector to a corresponding vector register among the third vector register array. To perform M*C element-wise multiplications between the pixel element k and the filter element at coordinates (r, s) of M filters, the processor may feed M weight vectors in the second vector register array to a corresponding column of the MAC array. Each of the M weight vectors may be a filter element at coordinates (r, s) from respective filter among the M filters. Then, the processor may broadcast an input activation vector in the first vector register array to columns of the MAC matrix. The input activation vector may correspond to the pixel element k among the W−S+1 pixel elements in the first vector register array. The MUX may select a vector register containing the pixel element k among the first vector register array containing the W−S+1 pixel elements. Finally, the processor may perform a multiplication at each MAC unit in the MAC matrix between a respective activation value of the pixel element k corresponding to a channel and a respective weight value of the filter element at coordinates (r, s) corresponding to the channel from one of the M filters.
In particular embodiments, the processor may be further operable when executing a fourth instruction among the instructions. The fourth instruction may cause the processor to quantize n-bit numbers in a vector register among the third vector register array to m-bit numbers based on quantization parameters stored in a corresponding vector register designated for the quantization parameters. The fourth instruction may further cause the processor to perform a non-linear operation on the quantized m-bit numbers. In particular embodiments, the n-bit numbers may be 32-bit numbers. In particular embodiments, m may be 8, 16, or 32. In particular embodiments, the processor may perform quantization operations on the third vector register array based on quantization parameters by executing the fourth instruction. A quantization operation comprises a non-linear activation operation. Parameters associated with the non-linear activation operation may be in the quantization parameters stored in the corresponding vector register. The quantization operation may further comprise adding a bias. Parameters associated with the bias may be stored in a corresponding vector register. The third vector register array may comprise 32-bit elements. Quantization may quantize the 32-bit numbers in the third vector register array into 8-, 16-, or 32-bit numbers. In particular embodiments, the processor may store row e of the output activation tensor in the third vector register array to the internal memory.
The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, functions, operations, or steps of the embodiments disclosed above. Embodiments according to the invention are in particular disclosed in the attached claims directed to a method, a storage medium, a system and a computer program product, wherein any element mentioned in one claim category, e.g., method, can be claimed in another claim category, e.g., system, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the elements thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject-matter which can be claimed comprises not only the combinations of elements as set out in the attached claims but also any other combination of elements in the claims, wherein each element mentioned in the claims can be combined with any other element or combination of other elements in the claims. Furthermore, any of the embodiments and elements thereof described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or element described or depicted herein or with any of the elements of the attached claims.
Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
Before discussing the present embodiments in detail, it may be beneficial to first provide some background information regarding neural networks and machine learning (ML) models in general. Machine-learning techniques have been used in a number of domains such as computer vision, natural language processing, video context understanding, self-driving cars, etc. Neural Networks (NN)/Deep Learning (DL) algorithms are the most popular and are the focus of this disclosure. These algorithms learn from massive datasets during a compute intensive process called training, by repeatedly adjusting parameters of the NN (Weights and Bias) to minimize the error between a NN output and a pre-recorded ground-truth. Once training is complete, these network parameters are fixed and deployed in the real world called inference. The focus of this disclosure is on inference use-cases, but many of the disclosures may be applied to training as well.
Convolution Neural Networks (CNNs) are a class of NNs used popularly in computer vision and image processing. A CNN may constitute a stack of convolutional layers followed by a non-linear function like a rectified linear unit (ReLU), a leaky-ReLU, Sigmoid, etc., which may be grouped together as a singular block. A CNN may also consist of other operators such as pooling, fully-connected (FC) layers to change the dimensionality of intermediate data and a SoftMax layer to normalize the outputs to a probability distribution. These components are stacked in different combinations to represent unique NN architectures. Networks have high learning-capacity/representative-power roughly proportional to the compute complexity, which may be measured by a number of multiply-accumulate operations and the number of parameters. Typically, convolutional layers, FC layers are the most compute intensive and benefit significantly with hardware acceleration.
As shown in
A Very Long Instruction Word (VLIW) processor may issue and complete more than one operation at a time. In the VLIW processors, the long instruction word encodes concurrent operations, resulting in dramatically reduced hardware complexity. A typical VLIW processor has multiple function units and multiple independent operations are grouped together to form a single VLIW instruction. These operations are initialized in the same clock cycle and each operation is dispatched to an independent function unit. All the function units share the same register file.
Single Instruction Multiple Data (SIMD) is for cases where a single instruction operates on multiple data simultaneously. A typical example of an SIMD instruction may be:
Mapping ML algorithms with high compute complexity into scalar processors is slow and inefficient. Some types of operations in CNN may be optimized by mapping to dedicated hardware processing units. For example, matrix-multiply and convolution function can be accelerated by mapping to a 2D multiply-accumulate hardware. Such implementations may benefit from additional computational resources, data-flow and hardware microarchitecture which promotes higher memory reuse, compute utilization and better area efficiency.
The second approach shown in (B) of
ML workloads, specifically deep-learning algorithms such as CNNs, have imposed a significant increase in compute and memory throughput requirements on modern processors. These algorithms have high parallelism, predictable memory access and gain significant hardware efficiency by optimizing for constrained operation sets, data-types and dataflows. Generic Central Processing Unit (CPU) and Digital Signal Processor (DSP) are highly programmable and optimized for various general workloads. However, the generic CPU/DSP are suboptimal in terms of compute performance, power, and area efficiency, for such modern workloads. Modern processor concepts like SIMD and VLIW significantly improved the compute throughput of such processors for many data-parallel algorithms such as signal processing, but the processors still suffer from significant inefficiencies for handling ML computations such as Neural Networks. Purposefully built ML accelerators, such as shown in (B) of
In particular embodiments, a system may comprise at least a processor and an external memory coupled to the processor. The external memory may be a non-transitory memory. The processor may comprise one or more cores, one or more function units, an integrated Direct Memory Access (iDMA), and an internal memory. The processor may also comprise a Multiply-Accumulate (MAC) array, a quantizer, a first vector register array communicatively connected to the MAC array through a multiplexer (MUX), a second vector register array communicatively connected to the MAC array, and a third vector register array communicatively connected to the MAC array. The non-transitory memory may comprise instructions executable by the processor. In particular embodiments, the processor may be a very large instruction word (VLIW) processor comprising a plurality of function units. The instructions may be performed simultaneously in the plurality of function units. In particular embodiments, an instruction set architecture (ISA) of the processor may support hardware instructions associated with hardware components of the processor including the internal memory, the MAC array, the quantizer, the first vector register array, the second vector register array, and the third vector register array. In particular embodiments, the instructions are single instruction multiple data (SIMD) instructions, each of which may process a fixed-size vector data.
Typical ML computations are mapped to optimal processor binaries which are either automatically generated using ML compilers or by hand-written code. Each node of the ML compute graph may be mapped to an optimized hardware component. Further, these nodes may be connected to each other using a nearest level of memory, preferably the internal memory 440. Higher-level graph optimizations may be performed by breaking down each node into smaller tiles, fusing nodes and reordering the sequence of operations to yield maximum performance for the end-end ML deployment. The system may implement each ML computation graph operation in dedicated hardware components. Operations such as convolution and matrix multiply may leverage the ML extensions, including the MAC array 480, the weight vector register array 460, and the output vector register array 470. Operations that are not supported by the ML extensions like pooling etc. will fall back to baseline function units 420 in the processor 400. The internal memory 440 and the general-purpose registers 450 may be central to pass data across nodes and to share information between the components within the processor 400. iDMA engine 430 may be employed to transfer data between the external memory 405 and the internal memory 440. The data transferred may include the ML parameters including weights and biases of a given layer and spills of intermediate activations across layer nodes which cannot be held stationary in the internal memory 440.
The iDMA 430 may be used during step 1 to load all required data to perform the convolution operation to the internal memory 440 from an external memory 405. Multiple LD instructions may be processed by the iDMA 430 during the step.
Steps 2-3 represent the software pipelined implementation of the CNN pseudo code presented in
Remaining instructions of the CNN pseudocode, which cannot fit inside the inner loop, may be performed during the Epilog stage. Step 4, post process, is a hypothetical step which applies certain post processing to the outputs of CNN. In the example illustrated in
In particular embodiments, the registers may hold vector of variable lengths, encoded in ISA. For example, an instruction LD_W_16 (source-reg, offset-src, dest-reg, offset-dest) will load 16 elements to the weight vector register array 460 from a specified offset of source-reg while LD_W_8 (source-reg, offset-src, dest-reg, offset-dest) may access only 8 elements.
In particular embodiments, the MAC array 480 may consist of a plurality of MAC units, each computing a product of a weight fed from the weight vector register array 460 and an input activation fed from an input activation vector register array. The MAC array 480 may broadcast an input activation vector of size C into M (corresponding to output channel M) lanes each fed with a weight vector of size C, to compute C*M multiply operations. The input activation vector may be selected by the MUX 610 among the input activation vectors in the input activation vector register array. The C and M parameters may be configurable at runtime to enable efficient processing of various shapes of convolution operation. The configuration may be referred to as cnn_mode. In particular embodiments, cnn_mode may be configured explicitly as an encoding in the ISA instruction. In particular embodiments, cnn_mode may be configured globally in a control register. In the example illustrated in
In particular embodiments, the numeric could be variable precision fixed point or floating point (e.g., in4, int8, fp32, etc.), which are either configurable at runtime using ISA encoding or statically at design time. In the example illustrated in
In particular embodiments, the hardware components in the ML extensions may be configured to handle various shapes, process different data types, implement different activation functions and support different rounding modes. The configuration may be stored in a control register. The control register may comprise fields including activation type, group size, group/channel configuration, output rounding mode, output data type, weight data type, and input data type. The data type fields may specify which data type is used for respective data. The output rounding mode may determine one of round half up, round to nearest even, or no rounding (truncate). Since the MAC array 480 comprises 512 multipliers (MAC units), a value of the group size field and a channel size indicated in the group/channel configuration field may determine an output channel size. When the group size=1, if input channel=64, then the output channel should be 8. When the group size=1, if input channel=32, then the output channel should be 16. When the group size=1, if input channel=16, then the output channel should be 32. When group size=8 and input channel=8, then the output channel should be 8. When group size=4 and input channel=16, the output channel should be 8. When group size=4 and input channel=8, then the output channel should be 16. When group size=2 and input channel=16, then the output channel should be 16.
In particular embodiments, the MAC array 480 may have 512 MAC units that can carry out 512 8×8-bit multiplications with accumulations in a single cycle. Depending on the input/output channel dimensions of the CNN workload, the MAC array 480 can support 3 working modes when group=1:
A first mode is input channel=64 and output channel=8. In this mode, the MAC array 480 has 8 output lanes. The MAC array 480 may take 64-byte input activation data indicated by IA0-IA63 and broadcast along all 8 output lanes M0-M7. Then in each lane, a dot product will be carried out where each of the 64-byte input activation data is multiplied with 64-byte weight data of that lane. The output of the 64 multiplications may be accumulated. An output y0 of the first lane M0 may be calculated as y0=IA0*M0[0]+IA1*M0[1]+IA2*M0[2]+ . . . +IA63*M0[63]. The final sum may be stored in the output vector register array 470. An example instruction for an CNN operation in this mode may be CNN_0 (v0, 0, 1), in which v0 may be a vector register containing 64-byte input activation data. The 2nd parameter is not used in this mode. The third parameter decides whether this CNN instruction overwrites the corresponding output vector register (if it's 1) or accumulates into the corresponding output vector register (if it's 0).
A second mode is input channel=32, output channel=16. In this mode, the MAC array 480 has 16 output lanes. The MAC array 480 may take a half of the elements of a 64-byte input activation vector register indicated by IA0-IA31 and broadcast along all 16 output lanes M0-M15. Then in each lane, a dot product will be carried out where each of the 32-byte input activation data is multiplied with 32-byte weight data of that lane. The output of the 32 multiplications may be accumulated. An output y0 of the first lane M0 may be calculated as y0=IA0*M0[0]+IA1*M0[1]+IA2*M0[2]+ . . . +IA31*M0[31]. The final sum may be stored in the output vector register array 470. In this mode, an input activation vector register of 64-byte is processed twice and a CNN instruction can decide which half to process based on the second parameter. Example instructions for the two CNN operations may be CNN_0 (v0, 0, 1) and CNN_1 (v0, 1, 1). In this mode, 2 CNN instructions are needed to process v0 register.
A third mode is input channel=16, output channel=32. In this mode, the MAC array 480 has 32 output lanes. The MAC array 480 may take a quarter of the elements of a 64-byte input activation vector register indicated by IA0-IA15 and broadcast along all 32 output lanes M0-M31. Then in each lane, a dot product will be carried out where each of the 16-byte input activation data is multiplied with 16-byte weight data of that lane. The output of the 16 multiplications may be accumulated. An output y0 of the first lane M0 may be calculated as y0=IA0*M0[0]+IA1*M0[1]+IA2*M0[2]+ . . . +IA15*M0[15]. The final sum may be stored in the output vector register array 470. In this mode, an input activation vector register of 64-byte is processed four times and a CNN instruction can decide which quarter to process based on the second parameter. Example instructions for the four CNN operations may be CNN_0 (v0, 0, 1), CNN_1 (v0, 1, 1), CNN_2 (v0, 2, 1), and CNN_3 (v0, 3, 1). In this mode, 4 CNN instructions are needed to process v0 register.
When the group size is larger than one, the 512 MAC units in the MAC array 480 may be divided into smaller groups and work accordingly depending on input/output channel sizes.
In particular embodiments, the output of the MAC units may be fed through a hierarchical adder tree 620, whose configuration may be changed at runtime, determined by the cnn_mode. The cnn_mode may specify how the different MAC units are grouped to perform the summation operation. In the example illustrated in
In particular embodiments, the numeric of the intermediate adder tree 620 and the accumulator 630 may be maintained at a higher precision than the input to the MAC unit such that accuracy is not lost. In the example illustrated in
Example Instructions Extended with ML Extensions
A set of SIMD instructions have been introduced to utilize the ML extensions. The SIMD instructions may be parallelized in a VLIW processor at runtime. In particular embodiments, the processor 400 may be operable when executing a first instruction among the instructions in the external memory 405. The first instruction, a SIMD instruction, may cause the processor 400 to load an input activation vector from the internal memory 440 to a vector register indicated by the first instruction among the first vector register array. The input activation vector may comprise C input activation values corresponding to a pixel element within an input activation tensor, where C is a number of input channels. A source location of the input activation vector in the internal memory may be indicated by the first instruction. For example LD_ATEMP0, LD_ATEMP1, . . . LD_ATEMPN may be used to load a vector register ATEMP0, ATEMP1, . . . , ATEMPN with respective activation vectors. ATEMP0, ATEMP1, . . . , ATEMPN may belong to an input activation vector register array. In particular embodiments, ATEMP0, ATEMP1, . . . , ATEMPN may be a part of the general-purpose registers 450. Basic usage may be LD_ATEMP<0-N>(mem_addr, offset), wherein the activation vector is loaded from mem_addr+offset.
In particular embodiments, the processor 400 may be further operable when executing a second instruction among the instructions in the external memory. The second instruction, a SIMD instruction, may cause the processor 400 to load a weight vector from the internal memory 440 to a vector register indicated by the second instruction among the second vector register array, wherein the second vector register array may be the weight vector register array 460. A source location of the weight vector in the internal memory 440 may be indicated by the second instruction. For example, LD_WTEMP0, LD_WTEMP1, . . . , LD_WTEMP7 can be used to load WTEMP registers, an instance of the weight vector register array 460. Basic usage may be LD_WTEMP<0-7>(mem_addr, offset), where the weight vector may be loaded from a location mem_addr+offset of the internal memory 440.
In particular embodiments, the processor 400 may be further operable to execute a third instruction among the instructions. The third instruction, a single SIMD instruction, may cause the processor 400 to feed a weight vector array from the second vector register array to the MAC array 480, wherein the second vector register array may be the weight vector register array 460. Feeding a weight vector array to the MAC array 480 may comprise providing a weight value of each element to a corresponding MAC unit of the MAC array 480. In particular embodiments, each 8-bit element in the weight vector register array 460 may be pre-associated with an MAC unit in the MAC array 480. The third instruction may also cause the processor 400 to broadcast an input activation vector to the MAC array 480. The input activation vector may be selected by the MUX 610 from the first vector register array based on the third instruction. The third instruction may further cause the processor 400 to multiply an input activation value broadcast to each MAC unit of the MAC array 480 from the input activation vector and a weight value fed to the MAC unit from the weight vector array. The third instruction may cause the processor to generate a partial output activation value for each of the M filters by accumulating outputs of MAC units corresponding to the filter. A partial output activation vector may comprise M partial output activation values. The third instruction may further cause the processor 400 to store the partial output activation vector to a vector register of the third vector register array, wherein the third vector register array may be the output vector register array 470. The storing may be an overwriting residual values of the vector register with values of the partial output activation vector or an accumulating the values of the partial output activation vector to the residual values of the vector register. For example, CNN_<0-15>(vec_reg, param0, param1) can be used for feeding a weight vector array from the weight vector register array 460 to the MAC array 480, broadcasting an input activation vector among the input activation vectors stored in a register array to the MAC array 480, performing multiplications between weights and the broadcast input activation vector by a combination of vec_reg and param0, where vec_reg indicates a vector register among a vector register array and param0 indicates which portion of the vector register to be used, generating a partial output activation value for each of the M filters by accumulating outputs of MAC units corresponding to the filter, and, store the partial output activation vector to a vector register of the output vector register array 470. param1 may indicate whether the storing is overwriting or accumulating. The vector register among the output vector register array 470 may be indicated by a value in <0-15> in the opcode. Instructions for other types of CNN operations including a single channel CNN, depth-wise CNN, and group-wise CNN are also available.
In particular embodiments, the processor 400 may be further operable to execute a fourth instruction among the instructions. The fourth instruction, a single SIMD instruction, may cause the processor 400 to modify a part of a vector register. For example, LD_OTMP 128_<0-15>(mem_addr, offset) can be used to partially modify one of OTMP registers, an instance of the output vector register array 470. The instruction loads only the first 128-bits of the 512-bit register from mem_addr+offset in the internal memory 440. The register within the register array may be indicated by <0-15> in the opcode. In particular embodiments, the processor 400 may be further operable to execute a fifth instruction among the instructions. The fifth instruction, a single SIMD instruction, may cause the processor 400 to store a part of a vector register to the internal memory 440. For example, ST_OTMP128_<0-15>(mem_addr, offset) can be used to partially store the first 128-bits of a 512-bit register indicated by <0-15> in the opcode to mem_addr+offset in the internal memory 440.
In particular embodiments, the processor 400 may be further operable when executing a sixth instruction among the instructions. The sixth instruction may cause the processor 400 to quantize n-bit numbers in a vector register among the third vector register array to m-bit numbers based on quantization parameters stored in a corresponding vector register designated for the quantization parameters. The sixth instruction may further cause the processor to perform a non-linear operation on the quantized m-bit numbers. In particular embodiments, the n-bit numbers may be 32-bit numbers. In particular embodiments, m may be 8, 16, 32, or any suitable number. For example, QUANT_<0-15>(reg1, reg0) can be used to quantize a vector register among the output vector register array 470. The vector register may be indicated by <0-15> in the opcode. The reg0 and reg1 are wide vector registers of 1536-bit (192-byte) each containing quantization data. The quantization data structure is following:
When a per-layer quantization is needed, G_QUANT_0 (qindex, lqindex, qvar_select) can be used, where qindex and lqindex range from 0-7 and are used to select an OTEMP register indexed from 0 through 7. qvar_select indicates whether Relu or LeakyRelu is used for a non-linear activation.
First, the weight vector register array 460 may be loaded from the internal memory 440. The weight tensors can be arranged in a hardware efficient layout like MXC a-priori and is loaded into the weight vector register array of size C*M. Loading weight vectors may consume multiple cycles/steps based on the bandwidth available to the internal memory 440. Second, activation registers may be loaded from the internal memory 440. The activation registers may be a dedicated input activation vector register array, or a part of the general-purpose registers 450. Each activation vector is in C-dimension. Loading activation vectors may take multiple cycles. The reg vector is loaded in the IA0 activation register in the example illustrated in
In particular embodiments, the processor 400, at step 810, may configure one or more banks of the internal memory 440 with configuration information including one or more bank sizes and one or more alignment sizes. The internal memory 440 may be a tightly coupled memory that might be software managed. The system may let the software managed memory filled and evicted to higher levels of memory using DMA over using L1 cache, which is typical in processor workloads. The approach may help to achieve better performance for ML extension hardware components, for which the memory sequence can be known a priori at compile time. Multiple banks for the internal memory 440 may be present in hardware to allow simultaneous access to these banks, without having bank conflicts. Bank configuration for the internal memory 440 may be a first step before executing the CNN workloads and the desired internal memory sizes for bank0 and bank1 are specified. An example configuration operation may be MEM_init(bank0_size, alignment_size, bank1_size, alignment_size). The MEM_init( ) function may allocate bank0_size and bank1_size for bank0 and bank1 and the initial addresses of bank0 and bank1 may be aligned by alignment_size. The processor may allocate buffers within the internal memory for input activation tensor, output activation tensor, weights, or biases. An example buffer configuration may be done by MEM_buffer_alloc(buffer, bank_index, allocation_size, alignment_size).
In particular embodiments, the processor 400, at step 815, may transfer weights for M filters, an input activation tensor, biases, and information associated with configurations from an external memory 405 to the internal memory 440. Example data transfer may be done by dma_smem2MEM(smem_addr, buffer_addr, transfer_size). Usually, iDMA may be used for the data transfer. However, load/store instructions may be used for the transfer as well.
In particular embodiments, the processor 400 may insert paddings to the input activation tensor in the internal memory based on first configuration parameters at step 820. The first configuration parameters may be determined based at least on an activation tensor width, an activation tensor height, an output activation tensor width, an output activation tensor height, and a stride. The paddings may increase width or height of the input activation tensor. For convolutions with filter size larger than one, the input activation data may need to be converted from a continuous 1D layout to 2D layout in order to accommodate the paddings along width and height dimensions. To insert the paddings to the activation tensor, the processor 400 may allocate a second memory space for an activation matrix corresponding to each channel of the activation tensor along with a configured-number of paddings in the internal memory 440. The processor 400 may initialize the configured-number of paddings in the second memory space. Then, the processor may copy data for each row of the activation matrix from a first memory space to a respective memory chunk in the second memory space. In the first memory space, the activation matrix may be stored in a sequence.
The ML extension hardware components including the MAC array 480 may be reconfigurable to calculate various types and dimensions of CNN workloads efficiently. For example, the MAC array 480 may calculate 1×1 CNNs, 3×3 CNNs, depth-wise and group-wise CNNs. In particular embodiments, at step 825, a shape of the MAC array may be dynamically configured at runtime to a required shape for convolution operations based on hardware configuration parameters stored in a register array designated for the hardware configuration parameters. The hardware configuration parameters may be determined based at least on C, a number of input channel, and M, a number of filters used for the convolution operations. In particular embodiments, the shape of the MAC array may be configured to C-by-M. An output activation tensor generated as a result of the convolution operations may have M channels. In particular embodiments, possible combinations of C and M may include (64, 8), (32, 16), (16, 32), or any suitable combination.
In particular embodiments, the processor 400 may, at step 830, load the quantization parameters of the CNN workloads into wide-vector registers from the internal memory 440 if per-channel quantization configurations are needed. The loading to wide-vector registers may be a 2-step operation: first, the quantization parameters may be loaded from the internal memory 440 into vector registers. Then the quantization parameters may be loaded into wide-vector registers from vector registers. If a layer quantization is used, a single state register may be used for quantization parameters.
In particular embodiments, the processor 400 may, at step 835, load a row of the input activation tensor involved in the convolution to an input activation vector register array. When performing R×S convolution operations between the input activation tensor and M filters, the processor 400 may calculate each row e of an output activation tensor of the convolution operations by performing the convolution operations on corresponding R rows {h0e, . . . , hR-1e} of the input activation tensor with the M filters, where R is a filter height. The processor may calculate the output activation tensor row-by-row. A pixel element on row e of the output activation tensor may comprise M channels. To calculate row e of the output activation tensor, the processor may calculate multiplications for each row r among R rows of the filters. For calculating multiplications of weight elements in row r of the filters to generate row e of the output activation tensor, the processor may first determine that row hre of the input activation tensor that corresponds to row r of the filters for calculating row e of the output activation tensor by hre=e×Uy+r, where Uy is a vertical stride for the convolution operations, and where e, r, and hre are zero-based indices. Once the row hre of the input activation tensor is determined, the processor 400 may load the row hre into the input activation vector register array. In particular embodiments, the input activation vector register array may be a vector register array dedicated to input activation data. In particular embodiments, the input activation vector register array may be the general-purpose registers 450.
In particular embodiments, the processor 400 may, at step 840, preprocess the loaded row of the input activation tensor in the input activation vector register array to optimize the performance of the convolution. For each column s, where s begins at zero, of the filters, the processor may optimize convolution operations between a filter element at coordinates (r, s) from each of the M filters and valid pixel elements in row hre of the input activation tensor. For the optimized convolution operations, the processor may load P pixel elements starting from pixel element s of row hre of the input activation tensor to the first vector register array, where P=W−S+1, where W is an input tensor width, and S is a filter width. Each pixel element may comprise an input activation vector of size C. The processor may execute the first instruction a number of times with different parameters to load P pixel elements starting from pixel element s of row hre of the input activation tensor to the first vector register array.
In particular embodiments, for the optimized convolution operations between filter elements at coordinates (r, s) from the M filters and valid pixel elements in row hre of the input activation tensor, at step 845, the processor 400 may load a filter element at coordinates (r, s) from each of the M filters to the second vector register array. The processor 400 may load a filter element at coordinates (r, s) of the M filters from the internal memory 440 to the weight vector register array 460. A filter element of each filter may comprise a weight vector of size C. In particular embodiments, the processor 400 may be further operable when executing the second instruction. The second instruction may cause the processor 400 to load a weight vector from the internal memory 440 to a vector register indicated by the second instruction among the second vector register array. A source location of the weight vector in the internal memory 440 may be indicated by the second instruction. Each weight vector in the weight vector array may comprise C weight values. The weight vector array may comprise a filter element at a position of M filters. The processor may execute the second instruction a number of times with various parameters to load filter elements at coordinates (r, s) of the M filters. For example, continuing with a prior example illustrated in
In particular embodiments, the processor may calculate a partial output activation vector for each valid pixel element k among the P pixel elements in the first vector register array by executing the third instruction, where the valid pixel element k at an iteration l is determined as k=l*Ux, where Ux is a horizontal stride for the convolution operations, wherein P=W-S+1, and where k and l are zero-based indices. For example, continuing with a prior example illustrated in
In particular embodiments, for calculating a partial output activation vector, the processor may (a) perform M*C element-wise multiplications between the pixel element k and the filter element at coordinates (r, s) of M filters; (b) generate a partial output activation vector having M output channels by summing results of multiplications belonging to a respective filter; and (c) accumulate the partial output activation vector to a corresponding vector register among the third vector register array. To perform M*C element-wise multiplications between the pixel element k and the filter element at coordinates (r, s) of M filters, the processor may feed M weight vectors in the second vector register array to a corresponding column of the MAC array. Each of the M weight vectors may be a filter element at coordinates (r, s) from respective filter among the M filters. Then, the processor may broadcast an input activation vector in the first vector register array to columns of the MAC matrix. The input activation vector may correspond to the pixel element k among the W−S+1 pixel elements in the first vector register array. The MUX may select a vector register containing the pixel element k among the first vector register array containing the W−S+1 pixel elements. Finally, the processor may perform a multiplication at each MAC unit in the MAC matrix between a respective activation value of the pixel element k corresponding to a channel and a respective weight value of the filter element at coordinates (r, s) corresponding to the channel from one of the M filters. For example, continuing with a prior example illustrated in
In particular embodiments, the processor 400 may perform quantization operations on the third vector register array using the quantizer 490 based on quantization parameters by executing the fourth instruction. A quantization operation comprises a non-linear activation operation. Parameters associated with the non-linear activation operation may be in the quantization parameters stored in the corresponding vector register. The quantization operation may further comprise adding a bias. Parameters associated with the bias may be stored in a corresponding vector register. The third vector register array may comprise 32-bit elements. In particular embodiments, the processor may store row e of the output activation tensor in the third vector register array to the internal memory. For example, continuing with a prior example illustrated in
Standard convolution may have too many parameters, which may crease a chance of over-fitting. To avoid such scenarios, different approaches including depth-wise (DW) convolution have been considered.
To enable vectorization in DW-conv, accumulation may be performed in the spatial dimension instead of channel dimension. An example equation for a R×S(3×3) convolution, when M output channels assumed, may be: Out[0][0][M]=IA[0][0][M]*W[M][0][0]++IA[0][2][M]*W[M][0][2]+IA[1][0][M]*W[M][1][0] ++IA[2][0][M]*W[M][2][0]++IA[2][1][M]*W[M][2][1]+IA[2][2][M]*W[M][2][2]. For a 3×3 convolution, 9 MAC units may be used in each lane. Also, maintaining same layout for an input activation tensor and an output activation tensor for DW_conv as regular convolutions may be preferable to avoid cycles spent in transforming the layout across regular convolutions and DW convolutions.
The MAC array architecture in
In Phase 2 the filter is moved along the height dimension by a stride of 1, which is shown in
Accelerating Convolutions with a Single Input Channel
Mapping convolutions with a single channel input activation tensor in a MAC array presented in
In the example illustrated in
In step2, the filter is moved to the right by its stride(1 in the example illustrated in
These steps are repeated multiple times until all the values of IA0, IA1, and IA2 are exhausted. At this point a new set of IA values are loaded (possibly in parallel) representing another spatial tile of the input activation tensor and multiple compute steps are performed. This is repeated to complete the problem size of convolution.
Group-wise convolutions were introduced to reduce computational complexity while dividing features into groups. Each group is expected to find out specific features from the network. These group features may be then combined to form a single and denser feature map. In group-wise convolution, many convolutions are done independently in groups of G1 input channels and G2 output channels, across a larger input tensor of C input channels and M output channels. Typically, G1=G2.
The group-wise convolutions can be mapped to the MAC array illustrated in
In particular embodiments, a large MAC array may be fractured into smaller MAC arrays which can operate with independent data. All of the MAC sub-arrays may be fed from the same IA register and output to the output vector register array. However, the dataflow may be different from the architecture in
The compute utilization of a convolution sub-group of size (8×8) would have been (8/32)*(8/16)=⅛ in a MAC array presented in
A MAC array could be modified to accommodate multiple group configurations, selectable runtime.
Accelerating CNN with Sparse Tensors
Many ML computations may comprise matrix multiplications with sparce matrices or matrices comprising zeros. A sparce matrix is a matrix that is comprised of mostly zero values. These zeros can be compressed to save memory space and bandwidth required to fetch from higher levels of memory. Additionally, these zero values may be skipped during a matrix multiplication calculation to accelerate the computation and improve the hardware utilization defined as effective Multiply-Add/cycle. Accelerating the computation may also result in decreased latency and improved throughput.
A non-lossy optimization with a special hardware ISA and microarchitecture is proposed herein to opportunistically accelerate matrix multiplication/CNN computation with tensors with sparsity (0 numerical values) in the weight array. The proposed optimization may be a further enhancement to a fully-programmable SIMD VLIW CPUs/DSPs illustrated in
The shaded portion of the table in
ML tensors may be optimized to have only N non-zero values out of every M consecutive dense values. Those tensors may be referred to as M:N structured sparse tensors. The table in
In particular embodiments, hardware architectures and methods may be used to skip computations involving zero values by leveraging sparsity in a weight tensor stored in a modified weight vector register array. Input activation tensor in the input activation vector register array may be assumed to be dense. Thus, sparsity or zero-skipping may not be leveraged even if the input activation tensor is sparse. The sparse structure supported may be hardened. A hardware architecture may be capable of handling one of many M:N configurations. In particular embodiments, an algorithm may map an input activation tensor to the weight vector register array 460 and map the weight tensor to the input activation vector register array if the input activation tensor is a structured sparse tensor. Special ISA instructions may be introduced to support the hardware architectures introduced to support structured sparse matrix multiplication accelerations.
In the base architecture illustrated in
In particular embodiments, the sparse weight vector register array 2060 may be used to store the structured sparse encodings. The number of encoding bits required per MAC-cell or weight register may be N*log 2(M) bits. The sparse weight vector register array 2060 may be used to feed the control lines of the sparse MUX Array 2010 to provide pointers to the input activation elements that need to be routed to each MAC unit in the MAC Array 480.
In particular embodiments, the sparse MUX array 2010 may be an array of MUXs to route a corresponding input activation scalar value to each MAC unit. The number of MUXs may be equal to the number of MAC units.
In particular embodiments, selecting between various sparse mode and regular dense modes may be accomplished by setting global state registers in the processor ISA. The Sparse Weight states and logic in sparse MUX array 2010 may be bypassed in regular dense modes.
This optimized MAC array extensions for structured sparsity may be incorporated into VLIW SIMD DSP/CPU as specialized ISA. With the aid of these optimizations, sparse matrices can be accelerated by a factor of M/N, in addition to the benefits of memory and bandwidth savings due to structure spare compression.
This disclosure contemplates any suitable number of computer systems 2200. This disclosure contemplates computer system 2200 taking any suitable physical form. As example and not by way of limitation, computer system 2200 may be an embedded computer system, a system-on-chip (SOC), a single-board computer system (SBC) (such as, for example, a computer-on-module (COM) or system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a mesh of computer systems, a mobile telephone, a personal digital assistant (PDA), a server, a tablet computer system, or a combination of two or more of these. Where appropriate, computer system 2200 may include one or more computer systems 2200; be unitary or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 2200 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein. As an example and not by way of limitation, one or more computer systems 2200 may perform in real time or in batch mode one or more steps of one or more methods described or illustrated herein. One or more computer systems 2200 may perform at different times or at different locations one or more steps of one or more methods described or illustrated herein, where appropriate.
In particular embodiments, computer system 2200 includes a processor 2202, memory 2204, storage 2206, an input/output (I/O) interface 2208, a communication interface 2210, and a bus 2212. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.
In particular embodiments, processor 2202 includes hardware for executing instructions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, processor 2202 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 2204, or storage 2206; decode and execute them; and then write one or more results to an internal register, an internal cache, memory 2204, or storage 2206. In particular embodiments, processor 2202 may include one or more internal caches for data, instructions, or addresses. This disclosure contemplates processor 2202 including any suitable number of any suitable internal caches, where appropriate. As an example and not by way of limitation, processor 2202 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 2204 or storage 2206, and the instruction caches may speed up retrieval of those instructions by processor 2202. Data in the data caches may be copies of data in memory 2204 or storage 2206 for instructions executing at processor 2202 to operate on; the results of previous instructions executed at processor 2202 for access by subsequent instructions executing at processor 2202 or for writing to memory 2204 or storage 2206; or other suitable data. The data caches may speed up read or write operations by processor 2202. The TLBs may speed up virtual-address translation for processor 2202. In particular embodiments, processor 2202 may include one or more internal registers for data, instructions, or addresses. This disclosure contemplates processor 2202 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, processor 2202 may include one or more arithmetic logic units (ALUs); be a multi-core processor; or include one or more processors 2202. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.
In particular embodiments, memory 2204 includes main memory for storing instructions for processor 2202 to execute or data for processor 2202 to operate on. As an example and not by way of limitation, computer system 2200 may load instructions from storage 2206 or another source (such as, for example, another computer system 2200) to memory 2204. Processor 2202 may then load the instructions from memory 2204 to an internal register or internal cache.
To execute the instructions, processor 2202 may retrieve the instructions from the internal register or internal cache and decode them. During or after execution of the instructions, processor 2202 may write one or more results (which may be intermediate or final results) to the internal register or internal cache. Processor 2202 may then write one or more of those results to memory 2204. In particular embodiments, processor 2202 executes only instructions in one or more internal registers or internal caches or in memory 2204 (as opposed to storage 2206 or elsewhere) and operates only on data in one or more internal registers or internal caches or in memory 2204 (as opposed to storage 2206 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple processor 2202 to memory 2204. Bus 2212 may include one or more memory buses, as described below. In particular embodiments, one or more memory management units (MMUs) reside between processor 2202 and memory 2204 and facilitate accesses to memory 2204 requested by processor 2202. In particular embodiments, memory 2204 includes random access memory (RAM). This RAM may be volatile memory, where appropriate. Where appropriate, this RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, where appropriate, this RAM may be single-ported or multi-ported RAM. This disclosure contemplates any suitable RAM. Memory 2204 may include one or more memories 2204, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.
In particular embodiments, storage 2206 includes mass storage for data or instructions. As an example and not by way of limitation, storage 2206 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Storage 2206 may include removable or non-removable (or fixed) media, where appropriate. Storage 2206 may be internal or external to computer system 2200, where appropriate. In particular embodiments, storage 2206 is non-volatile, solid-state memory. In particular embodiments, storage 2206 includes read-only memory (ROM). Where appropriate, this ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these. This disclosure contemplates mass storage 2206 taking any suitable physical form. Storage 2206 may include one or more storage control units facilitating communication between processor 2202 and storage 2206, where appropriate. Where appropriate, storage 2206 may include one or more storages 2206. Although this disclosure describes and illustrates particular storage, this disclosure contemplates any suitable storage.
In particular embodiments, I/O interface 2208 includes hardware, software, or both, providing one or more interfaces for communication between computer system 2200 and one or more I/O devices. Computer system 2200 may include one or more of these I/O devices, where appropriate. One or more of these I/O devices may enable communication between a person and computer system 2200. As an example and not by way of limitation, an I/O device may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable I/O device or a combination of two or more of these. An I/O device may include one or more sensors. This disclosure contemplates any suitable I/O devices and any suitable I/O interfaces 2208 for them. Where appropriate, I/O interface 2208 may include one or more device or software drivers enabling processor 2202 to drive one or more of these I/O devices. I/O interface 2208 may include one or more I/O interfaces 2208, where appropriate. Although this disclosure describes and illustrates a particular I/O interface, this disclosure contemplates any suitable I/O interface.
In particular embodiments, communication interface 2210 includes hardware, software, or both providing one or more interfaces for communication (such as, for example, packet-based communication) between computer system 2200 and one or more other computer systems 2200 or one or more networks. As an example and not by way of limitation, communication interface 2210 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI network. This disclosure contemplates any suitable network and any suitable communication interface 2210 for it. As an example and not by way of limitation, computer system 2200 may communicate with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 2200 may communicate with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), or other suitable wireless network or a combination of two or more of these. Computer system 2200 may include any suitable communication interface 2210 for any of these networks, where appropriate. Communication interface 2210 may include one or more communication interfaces 2210, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.
In particular embodiments, bus 2212 includes hardware, software, or both coupling components of computer system 2200 to each other. As an example and not by way of limitation, bus 2212 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination of two or more of these. Bus 2212 may include one or more buses 2212, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.
Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.
Herein, “or” is inclusive and not exclusive, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A or B” means “A, B, or both,” unless expressly indicated otherwise or indicated otherwise by context. Moreover, “and” is both joint and several, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A and B” means “A and B, jointly or severally,” unless expressly indicated otherwise or indicated otherwise by context.
The scope of this disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that a person having ordinary skill in the art would comprehend. The scope of this disclosure is not limited to the example embodiments described or illustrated herein. Moreover, although this disclosure describes and illustrates respective embodiments herein as including particular components, elements, feature, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that a person having ordinary skill in the art would comprehend. Furthermore, reference in the appended claims to an apparatus or system or a component of an apparatus or system being adapted to, arranged to, capable of, configured to, enabled to, operable to, or operative to perform a particular function encompasses that apparatus, system, component, whether or not it or that particular function is activated, turned on, or unlocked, as long as that apparatus, system, or component is so adapted, arranged, capable, configured, enabled, operable, or operative. Additionally, although this disclosure describes or illustrates particular embodiments as providing particular advantages, particular embodiments may provide none, some, or all of these advantages.
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 63/477,527, filed 28 Dec. 2022, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63477527 | Dec 2022 | US |