Embodiments generally relate to multiply-accumulate (MAC) processors. More particularly, embodiments relate to area and energy efficient multi-precision MAC (“MultiMAC”) unit-based processors.
Deep Neural Networks (DNN) may be useful for a host of applications in the domains of computer vision, speech recognition, image, and video processing, primarily due to the ability of DNNs to achieve high levels of accuracy relative to human-based computations. The improvements in accuracy, however, may come at the expense of significant computational cost. For example, the underlying deep neural networks typically have extremely high computing demands, as each test input involves on the order of hundreds of millions of MAC operations as well as hundreds of millions of filter weights to be processed for classification or detection.
As a result, high-end graphics processing units (GPUs) may be suitable to execute these types of workloads because GPUs typically contain thousands of parallel MAC units that can simultaneously operate and produce the output in much less time. GPUs, however, may have very high-power consumption that make them unsuitable to be deployed in highly energy constrained mobile/embedded systems where energy and area budgets are extremely limited.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
The process of quantization can be effective in making relatively large DNN models compact to be deployed on area and energy constrained mobile and other edge devices. Quantization reduces the precision of weights, feature maps, and intermediate gradients from the baseline floating point sixteen/Brain floating sixteen (FP16/BF16) to integer-8/4/2/1 (INT8/4/2/1). Not only does this approach reduce storage requirements by 2×/4×/8×/16×, but the approach also reduces computation complexity by a similar degree that results in a proportional improvement in throughput. As a result, some of the most advanced state-of-the-art DNN accelerators are built with the ability to perform MAC operations of multiple precisions (INT8/4/2/1). The introduction of a multi-precision MAC basically leads to performance improvements that can significantly improve two measurable metrics for DNN accelerators: i) performance per unit area measured using the TOPS/mm2 (Tera (1012) operations per mm2) metric, and ii) performance per unit energy measured using the TOPS/W (Tera (1012) operations per Watt) metric.
Embodiments provide for a DNN processing engine (PE) that can support MAC operations of different precisions, (INT8/4/2/1 to FP16/BF16) while using a common low overhead sparsity acceleration logic. The primary contributions presented by embodiments are related to the ways that data is implemented and fed to the multi-precision MAC unit based on sparsity. Towards that end, “MultiMAC” is an area-efficient, multi-precision multiply and accumulate unit-based processing element in DNN accelerators, where embodiments intelligently share circuit elements among various precision circuits to reduce the area overhead and the energy consumption of the multi-precision MAC unit. The sharing of the circuit elements is enabled by using a data flow that allows input channel-based accumulation (e.g., common in most tensor operations in a DNN accelerator).
As will be discussed in greater detail, the processor 40 provides an area-efficient, multi-precision multiply and accumulate unit-based processing element for DNN accelerators, where circuit elements are shared among various precision circuits to reduce the area overhead and the energy consumption of the multi-precision MAC processor 40. For example, only four 5-bit (5b) multipliers 46 are sufficient to support eight different precision modes for MAC operations such as INT8, UINT8, INT4, UINT4, U4_I4, I4_U4, INT2, and INT1. Here, INT1 is effectively binary (BIN) mode with values −1 and 1 represented by 0 and 1. Integration of the processor 40 enables a dense TOPS/mm2 of an accelerator to be increased by almost 2×, 4×, and 8×, respectively, when executing quantized inferences in INT4/2/1 precision modes.
Additionally, by recoding activations and weights in groups of 4-bits, a binarized convolution is realized using INT2 hardware or signed −5b multipliers in this case (e.g., by contrast, other approaches typically require separate hardware and data paths to support both binarized and integer convolutions in a single MAC). Embodiments reduce the area of multi-precision MACs that must support both integer and binarized convolutions. Integration of the processor 40 also enables a dense TOPS/W of an accelerator to be increased by 1.87×, 3.75×, and 7.5×, respectively when running quantized inferences in INT4/2/1 precision modes.
Due to the innovative way of sharing logic for different precisions, the processor 40 may improve the area by 32% at 1.8 GHz compared to existing multi-precision MAC designs. Indeed, the processor 40 also works seamlessly without any additional overhead in coordination with find-first sparsity acceleration techniques via block sparsity. Using this strategy, 1.06×, 1.19×, 1.44× and 1.08×, 1.15×, 1.31× more TOPS/W and TOP S/mm2 improvements are obtained, respectively, over the baseline case (INT8) where the sparsity logic complexity varies proportionally to the number of operands. For 50% sparsity, the proposed accelerator achieves 1.88×, 3.75×, 7.5×, 15× and 1.95×, 3.98×, 7.97×, 15.93× higher TOPS/W and TOPS/mm2 for INT8/4/2/1 modes compared to accelerators without any MultiMAC support.
Embodiments implement a compute near memory microarchitecture where each PE includes one or more of the enhanced MAC units 52 along with local memory or register files for storing the activations (IF RF) and the filter weights (FL RF). The output activations are stored within the OF RF. In this particular mode of operation, the IF and FL RFs are arranged sequentially in the IC dimension so that the MAC unit within each PE can be fed with these values one after another that are then multiplied and accumulated over time and stored in the OF RF. Due to the existence of this mode of operation, PEs can accumulate over ICs every round which enables the current MultiMAC implementation where each INT8/byte operand within the RFs can be assumed to be either a single IC or multiple ICs of bitwidth 4, 2, and 1. For IC bitwidths 1, 2, 4, and 8, each byte represents 8, 4, 2, and 1 ICs, respectively. For the sake of simplicity, an IC is consistently split into multiple smaller precision ICs so that it can be accumulated with the help of the MultiMAC. At lower precision modes, the enhanced MAC unit 52 enables the accumulation of multiple ICs (1/2/4/8) in a single clock period. Since the enhanced MAC unit 52 is designed to operate at the same high frequency as that of the single precision MAC unit 50, the accumulation of multiple ICs in a single clock cycle leads to significantly higher throughput (TOPS). Note that this fixed way of grouping or concatenating lower precision ICs into a single operand involves the least changes or additions in the load, compute, and drain of a typical DNN accelerator to support multi-precision convolutions.
The application of MultiMAC is not limited to the domain of machine learning accelerators. Multi-precision multipliers are also useful for many GPU-specific as well as various image and video processing hardware accelerators. To be more specific, any application that uses multiply-accumulate operations and can tolerate quantization errors such as most applications in the domains of multimedia processing (graphics, video, image, audio, speech, etc.) may benefit from the MultiMAC technology described herein.
By contrast, the enhanced data path 72 includes multi-precision MAC unit, with no explicit 8b multiplier. Instead, the 8b mode is implemented using four separate 4b quadrant multipliers 71. Furthermore, rather than complete product computes per channel, sub-products are summed across channels to obtain the same result with lower area. This optimization takes advantage of the fact that intermediate results may be disregarded, with only the final dot-product being relevant. The enhanced data path 72 also ensures minimal reconfiguration multiplexers in the final adder tree. As a result of this dot-product structure, 4b, 2b dot-products are shifted left by 4b. Instead of correcting that shift by reconfiguring the carry-save tree, a non-critical accumulator input is shifted left by 4b and the final result is shift right by 4b in modes lower than 8b. This approach causes adder-tree width to grow by 4b but lower area is obtained overall. Per channel computations are completed before summation across channels in the illustrated enhanced data path 72.
Thus,
The conventional convolution architecture 92 shows an example circuit to perform the MAC operation for a byte of activations and a byte of weights, where each byte contains eight binary activations/weights. First, the activations and weights are XOR-ed together and the number of ones in the result is counted. The number of ones counted in the XOR-ed result is used to generate an index into a LUT (lookup table) and the LUT returns the sum of the binarized products in the range {−8, −6, −4, 2, 0, 2, 4, 6, 8}. The LUT output is added to the accumulator to produce the MAC result. This approach requires a separate binary datapath and a separate integer datapath in the MAC.
With continuing reference to
Thus, the architecture 130 and/or the architecture 120 (
If the product is negative (e.g., one but not both inputs are negative), the Booth encoding for every partial product is inverted (along with the 1's to 2's bit). In addition, the multiplicand is added to tree reduction, and shifted two bits to the left of the highest Booth R4 coded partial product, but only if the most significant bit of the multiplier operand is “1” (e.g., the operand value is negative). This approach is taken because an unsigned multiplier (which is negated or not by the signed magnitude signs) is used via a mixed radix extension. All of the partial products of the multiplier are in Booth R4 format, but the uppermost partial products is in radix-2 (R2) format. Another way of looking at this condition is that the uppermost partial product is coded as a subset of Booth R4, where only the least significant bit (lsb) of the uppermost tri-bit (which is the most significant bit/msb of the penultimate tri-bit) is considered. A different way of explaining this approach is that if the msb of the multiplicand is “1”, then the next Booth tribit would be “001”, or (+1× multiplicand) and if the msb was “0” then the tribit would be “000” (0x multiplicand).
Thus, the architecture 150 and/or the architecture 140 (
In addition, the architecture 150 may, for a group of multipliers, invert individual partial products to operate one or more multipliers as a signed magnitude multiplier and sum ranks of partial products. In an embodiment, the architecture 150 sums a group of partial products in a different radix separately from the ranks of partial products. In an embodiment, the group of multipliers provide unsigned multiplication. The group of multipliers may also be in an unsigned magnitude format.
In an embodiment, the architecture 180 calculates, via multipliers, signed magnitude values in a first precision and a second precision. Additionally, the architecture 180 may calculate/determine a first set of additional partial products in the first precision and calculate/determine a second set of additional partial products in the second precision.
In an embodiment, the architecture 190 sorts individual exponents of floating point representations to identify a largest exponent. Additionally, the architecture 190 denormalizes multipier products to the largest exponent and sums the denormalized multiplier products to obtain a product sum. In one example, the architecture 190 normalizes the product sum to a single floating point value.
Sometimes, this approach will not be accurate enough. Thus, the sum at each block can be normalized (which is a larger circuit than not normalizing), but the exception handling and rounding may be bypassed (e.g., only the information is forwarded). This approach may reduce the cost of each floating point adder by 10%-20% over a typical solution (until the final block, which uses a full FP adder).
In an embodiment, the architecture 200 denormalizes, at each subsequent arithmetic block, a smaller of two values to a larger value. Additionally, the architecture 200 may sum the two values.
Sparsity logic 212 (e.g., find-first sparsity logic) works with compressed data (e.g., zero-value compressed). The zero and non-zero positions in the activation and weight data are represented by a bit in the bitmap in a compressed mode. The non-zero values are compressed and kept adjacent to one another in an IF RF 214. In the single precision MAC, each byte represents one activation or filter point and is represented by one bit in the bitmap. The same logic can be kept intact and easily be applied for MultiMAC by introducing the concept of block sparsity where each bit in bitmap can either represent 1, 2, 4, or 8 ICs based on whether UINT8/INT8, UINT4/INT4, UINT2/INT2, or binary mode (BIN), respectively, are active. Only in the case when all ICs or the entire byte is 0, will a 0 be placed in the bitmap (e.g., otherwise the value will be a 1). This coarse-granular approach to maintaining sparsity information for lower precision modes may have pros and cons. For example, one advantage is that the same sparsity encoder that operates at a byte-level may be used, which decreases the overall impact on DNN accelerator area and energy. Another advantage is that the storage and processing overhead of the bitmap for each IC is also reduced at lower precisions. A downside of block sparsity, however, may be that it keeps track of sparsity at a much coarser-granularity and therefore reduces the maximum potential speedup that can be achieved through fine-granular tracking.
In addition to the integer-based MultiMAC, support may be provided for floating point execution within the PE. Although this support may involve a completely separate floating point MAC (FPMAC, e.g., separate from the MultiMAC, is not shared), the existing sparsity logic may be readily used for floating point execution.
For example, computer program code to carry out operations shown in the method 240 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Illustrated processing block 242 provides one or more substrates such as, for example, silicon, sapphire, gallium arsenide, etc. Processing block 244 couples logic (e.g., transistor array and other integrated circuit/IC components) to the substrate(s). In the illustrated example, the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware. Moreover, the logic includes a plurality of arithmetic blocks (e.g., ALUs), wherein the plurality of arithmetic blocks share a single multiplier sized that is uniform across the plurality of arithmetic blocks. Additionally, the single multiplier size is less than the maximum precision size supported by the plurality of arithmetic blocks. In an embodiment, the maximum precision size is eight bits and the single multiplier size is five bits. In one example, block 244 includes arranging sparsity information for activations and weights in accordance with a bitmap format that is common to multiple precisions. The method 240 therefore enhances performance at least to the extent that single multiplier size renders the MAC processor more area and/or energy efficient.
Illustrated processing block 252 compresses same ranks of partial products before combining the same ranks with sums of partial products of other ranks, wherein block 254 adds ones and twos complement bits independently before the ones and twos complement bits are summed with combinations of partial products. In an embodiment, block 256 adds partial products within a plurality of arithmetic blocks in rank order.
Illustrated processing block 262 decodes subsets of weights and activations as a multiplier pre-process operation. Additionally, block 264 adds multiplier outputs in rank order across a plurality of arithmetic blocks.
Illustrated processing block 272 inverts individual partial products to operate one or more multipliers as a signed magnitude multiplier. Additionally, block 274 may add a single mixed radix partial product, wherein a final partial product of a lower radix operates of a subset of possibilities of a higher radix.
Illustrated processing block 282 inverts individual partial products to operate one or more multipliers as a signed magnitude multiplier, wherein block 284 sums ranks of partial products. Additionally, block 286 sums a group of partial products in a different radix separately from the ranks of partial products. In an embodiment, the group of multipliers provide unsigned multiplication. The group of multipliers may also be in an unsigned magnitude format.
Illustrated processing block 292 zeroes out a top portion of partial products, wherein block 294 zeroes out a bottom portion of partial products. In one example, block 296 compresses ranks of each set of original partial products independently. Block 298 may shift groups of ranks into an alignment of a smaller precision.
Illustrated processing block 302 calculates, via multipliers, signed magnitude values in a first precision and a second precision. Additionally, block 304 calculates/determines a first set of additional partial products in the first precision, wherein block 306 calculates/determines a second set of additional partial products in the second precision.
Illustrated processing block 312 sorts individual exponents of floating point representations to identify a largest exponent. Additionally, block 314 denormalizes multiplier products to the largest exponent, wherein block 316 sums the denormalized multiplier products to obtain a product sum. In an embodiment, block 318 normalizes the product sum to a single floating point value.
Illustrated processing block 322 denormalizes, at each subsequent arithmetic block, a smaller of two values to a larger value. Additionally, block 324 sums the two values.
Illustrated processing block 303 provides one or more substrates such as, for example, silicon, sapphire, gallium arsenide, etc. Processing block 305 couples logic (e.g., transistor array and other integrated circuit/IC components) to the substrate(s). In the illustrated example, the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware. Moreover, the logic includes a plurality of arithmetic blocks (e.g., ALUs), wherein the plurality of arithmetic blocks each contain multiple multipliers. Moreover, one or more intermediate multipliers may be of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
Illustrated processing block 313 combines multipliers one or more of 1) within each arithmetic block, or 2) across multiple arithmetic blocks. Additionally, block 315 may map one or more smaller multipliers to partial products of the intermediate multiplier(s), wherein the smaller multiplier(s) are of a size that is less than the size of the intermediate multiplier(s). In addition, block 317 may combine the intermediate multiplier(s) to obtain one or more larger multipliers, wherein the larger multiplier(s) are of a size that is greater than the size of the intermediate multiplier(s). In an embodiment, block 319 sums partial products in rank order, wherein block 321 shifts the summed partial products to obtain shifted partial products. In such a case, block 323 adds the shifted partial products to obtain one or more of larger multipliers, sums of larger multipliers or sums of smaller multipliers. Moreover, block 325 may pre-code groups of smaller multiplier products, wherein block 327 adds the pre-coded groups of smaller multiplier products.
Illustrated processing block 332 sources one or more arithmetic blocks by a plurality of input channels. In an embodiment, block 334 decomposes each of the plurality of input channels into smaller input channels.
Turning now to
The illustrated system 340 also includes an input output (IO) module 348 implemented together with the host processor 342, an AI accelerator 351 (e.g., DNN processing engine) and a graphics processor 350 (e.g., graphics processing unit/GPU) on a semiconductor die 352 as a system on chip (SoC). The illustrated IO module 348 communicates with, for example, a display 354 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), a network controller 356 (e.g., wired and/or wireless), and mass storage 358 (e.g., hard disk drive/HDD, optical disk, solid state drive/SSD, flash memory).
In an embodiment, the AI accelerator 351 includes a multi-precision MAC processor such as, for example, the MAC processor 40 (
In one example, the logic 364 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 362. Thus, the interface between the logic 364 and the substrate(s) 172 may not be an abrupt junction. The logic 364 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 362.
The processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 450 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order retirement of instructions. Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 400 is transformed during execution of the code 413, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 425, and any registers (not shown) modified by the execution logic 450.
Although not illustrated in
Referring now to
The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in
As shown in
Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b. The shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.
While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package.
The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in
The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in
In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
As shown in
Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of
Example 1 includes a multiply-accumulate (MAC) processor comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware, the logic including a plurality of arithmetic blocks, wherein the plurality of arithmetic blocks each contain multiple multipliers, and wherein the logic is to combine multipliers one or more of within each arithmetic block or across multiple arithmetic blocks.
Example 2 includes the MAC processor of Example 1, wherein one or more intermediate multipliers are of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
Example 3 includes the MAC processor of Example 2, wherein the logic is to map one or more smaller multipliers to partial products of the one or more intermediate multipliers, and wherein the one or more smaller multipliers are of a size that is less than the size of the one or more intermediate multipliers.
Example 4 includes the MAC processor of Example 2, wherein the logic is to combine the one or more intermediate multipliers to obtain one or more larger multipliers, and wherein the one or more larger multipliers are of a size that is greater than the size of the one or more intermediate multipliers.
Example 5 includes the MAC processor of Example 2, wherein the logic is to sum partial products in rank order, and shift the summed partial products to obtain shifted partial products, and add the shifted partial products to obtain one or more of larger multipliers, sums of larger multipliers or sums of smaller multipliers.
Example 6 includes the MAC processor of Example 2, wherein the logic is to pre-code groups of smaller multiplier products, and add the pre-coded groups of smaller multiplier products.
Example 7 includes the MAC processor of Example 6, wherein the logic is to multiply pre-coded combinations of smaller multiplier products by a constant to obtain a sum.
Example 8 includes the MAC processor of Example 1, wherein all of the multiple multipliers are of a same precision.
Example 9 includes the MAC processor of Example 1, wherein the logic is to source one or more arithmetic blocks by a plurality of input channels, and decompose each of the plurality of input channels into smaller input channels.
Example 10 includes the MAC processor of Example 1, wherein the logic is to add multiplier outputs in rank order across the plurality of arithmetic blocks.
Example 11 includes the MAC processor of Example 1, wherein the logic is to decode subsets of weights and activations as a multiplier pre-process operation.
Example 12 includes the MAC processor of Example 1, wherein the logic is to invert individual partial products to operate one or more multipliers as a signed magnitude multiplier.
Example 13 includes the MAC processor of Example 12, wherein the logic is to add a single mixed radix partial product, and wherein a final partial product of a lower radix operates as a subset of possibilities of a higher radix.
Example 14 includes the MAC processor of Example 12, wherein, for a group of multipliers, the logic is to sum ranks of partial products, and sum a group of partial products in a different radix separately from the ranks of partial products.
Example 15 includes the MAC processor of Example 14, wherein the group of multipliers one or more of provide unsigned multiplication or are in signed magnitude format.
Example 16 includes the MAC processor of Example 1, wherein the logic is to zero out a top portion of partial products, zero out a bottom portion of the partial products, compress ranks of each set of original partial products independently, and shift groups of ranks into an alignment of a smaller precision.
Example 17 includes the MAC processor of Example 16, wherein the logic is to calculate, via multipliers, signed magnitude values in a first precision and a second precision, calculate a first set of additional partial products in the first precision, and calculate a second set of additional partial products in the second precision.
Example 18 includes the MAC processor of Example 1, wherein the logic is to sort individual exponents of floating point representations to identify a largest exponent, denormalize multiplier products to the largest exponent, sum the denormalized multiplier products to obtain a product sum, and normalize the product sum to a single floating point value.
Example 19 includes the MAC processor of Example 1, wherein the plurality of arithmetic blocks are cascaded in a sequence, and wherein the logic is to denormalize, at each subsequent arithmetic block, a smaller of two values to a larger value, and sum the two values.
Example 20 includes the MAC processor of any one of Examples 1 to 19, wherein the logic is to arrange sparsity information for activations and weights in accordance with a bitmap format that is common to multiple precisions.
Example 21 includes the MAC processor of any one of Examples 1 to 19, wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates.
Example 22 includes a computing system comprising a network controller, and a multiply-accumulate (MAC) processor coupled to the network controller, wherein the MAC processor includes logic coupled to one or more substrates, wherein the logic includes a plurality of arithmetic blocks, wherein the plurality of arithmetic blocks each contain multiple multipliers, and wherein the logic is to combine multipliers one or more of within each arithmetic block or across multiple arithmetic blocks.
Example 23 includes the computing system of Example 22, wherein one or more intermediate multipliers are of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
Example 24 includes a method comprising providing one or more substrates, and coupling logic to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware, the logic including a plurality of arithmetic blocks, wherein the plurality of arithmetic blocks each contain multiple multipliers, and wherein the logic is to combine multipliers one or more of within each arithmetic block or across multiple arithmetic blocks.
Example 25 includes the method of Example 24, wherein one or more intermediate multipliers are of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
Example 26 includes means for performing the method of any one of Examples 24 to 25.
Technology described herein therefore delivers high performance at a fraction of the area and energy costs in DNN accelerators, which may be key to efficient edge inference for various AI applications including imaging, video, and speech applications. The technology also provides a design that is high performance, has low silicon footprint and energy consumption, and can provide a unique edge in terms of better performance, and taking advantages of transistor scaling.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.