Embodiments generally relate to artificial intelligence (AI) computing. More particularly, embodiments relate to reconfigurable multibit analog in-memory computing with compact computation for AI applications.
A neural network (NN) can be represented as a structure that is a graph of several neuron layers flowing from one layer to the next. The outputs of one layer of neurons can be based on calculations, and are the inputs of the next layer. To perform these calculations, a variety of matrix-vector, matrix-matrix, and tensor operations may be required, which are themselves comprised of many multiply-accumulate (MAC) operations. Indeed, there are so many of these MAC operations in a neural network, that such operations may dominate other types of computations (e.g., activation and pooling functions). The neural network operation may be enhanced by reducing data fetches from long term storage and distal memories separated from the MAC unit.
Compute-in-memory (CiM) static random-access memory (SRAM) architectures (e.g., merged memory and MAC units) may deliver increased efficiency to convolutional neural network (CNN) models as compared to near-memory computing architectures due to reduced latencies associated with data movement. A notable trend in CiM processor architectures may be to use analog mixed-signal (AMS) hardware when performing MAC operations (e.g., multiplying analog input activations by digital weights and accumulating the result) in a CNN model. In such a case, a C-2C capacitor ladder network may be integrated (e.g., embedded, incorporated) within the SRAM to perform the MAC operations. Integrating the C-2C capacitor ladder network within the SRAM may increase circuit area, and in turn reduce memory density. Additionally, conventional C-2C capacitor ladder network solutions are typically limited to a fixed data format for the weights, which may have a negative impact on flexibility and/or performance.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Compute-in-Memory (CiM), one of the computation methods that is not based on classical von Neumann architecture, is a promising candidate for convolutional neural network (CNN) and deep neural network (DNN) applications. The development of CiM architectures, however, is more difficult to realize in purely digital systems, since the conventional multiply-accumulate (MAC) operation units are too large to fit into high-density Manhattan style memory arrays.
Currently, most of the practical CiM works are developed with static random access memory (SRAM) technologies. Among them, the solutions that primarily use digital computation can only utilize a small fraction of the entire SRAM memory array for simultaneous computation with a multibit data format. This limitation is because the digital computational circuit size for multibit data increases quadratically with the number of bits, whereas the memory circuit size within SRAM array increases linearly. Accordingly, there is a substantial mismatch between unit computational circuit size and unit memory circuit size for multibit implementations. As a result, only a small number of computational circuit units can be implemented for all-digital solutions, which causes a significant bottleneck in the overall throughput of in-memory computing.
To achieve efficient and high-throughput multibit in-memory computing, a C-2C-ladder-based analog MAC unit can be used for SRAM-based multibit CiM schemes. Additionally, an improved SRAM design with multiplexing capability may be used to achieve better supporting weight stationary machine language (ML) operations. Moreover, an analog in-memory computing macro may be used that can be built from standard SRAM macros.
Turning now to
By contrast, an enhanced architecture 40 includes a capacitor ladder network 42 that is external to a memory array 44 (e.g., SRAM cluster) and generates an out-of-SRAM C-2C multibit combination 46, which substantially reduces circuit area overhead and increases memory density. More particularly, moving the capacitor ladder network 42 for multi-bit combination out of the memory array 44 enables each SRAM cluster to perform 1-bit weight and input activation (IA) multiplication with only a one unit capacitor Cu, rather than a one unit (C) capacitor plus a two unit (2C) capacitor as in the conventional architecture 20. Such an approach significantly reduces the capacitor circuit area overhead while increasing memory cell density for weight storage.
For example, there are several differences between the enhanced architecture 40 and the conventional architecture 20. First, within each 1-bit-8-bank SRAM cluster 22, 26 in the conventional architecture 20 (e.g., that contains 1-bit weight data with N sub-banks for weight data multiplexing), the enhanced architecture 40 only has one unit capacitor Cu, rather than one C and one 2C. The compactness from this single Cu capacitor alone provides the out-of-SRAM multibit combination scheme the ability to reduce the analog MAC circuit overhead (i.e., the capacitors in the SRAM cluster). As result, more SRAM cells can fit within each SRAM cluster of the same size by providing even more sub-banks for multiplexing or reduce the size of the SRAM cluster if the number of sub-banks is kept the same. In either case, the weight storage density within the SRAM array can be increased, while in the latter case, the MAC computation unit density is also increased (e.g., since more MAC units can fit within an SRAM array, as the SRAM cluster size is reduced).
Another difference is that the partial product of 1-bit weight and input activation (IA) within each SRAM cluster connects to a partial output activation (pOA) line for summation and averaging, achieving MAC operation. For comparison, in the conventional architecture 20, there may be no such pOA line for summation. Instead, the multibit combination of 1-bit weight and IA multiplication product is carried out locally between the neighboring SRAM clusters 22, 26 and only the SRAM cluster 22, 26 corresponding to the most significant bit (MSB) connects to an output activation (OA) line.
Yet another difference is that each pOA line is to be connected through a capacitance ladder network 42 outside the memory array 44 for multi-bit combination, which results in an OA line at the MSB output of the capacitance ladder that is corresponding to the multibit multi-dimensional (64-dimensional/64D) MAC computation. The enhanced architecture 40 has only one C-2C ladder for generating the MAC result on the OA line, whereas in the conventional architecture 20, the number of C-2C ladders involved is the same as the number of summations within the MAC operation (e.g., sixty-four).
at the bottom plate. Thus, a 64-D MAC operation has been achieved for sixty-four sets of 1-bit weights and IA inputs. It can be further assumed that the unit capacitors within the C-2C ladder are CC and C2C and the equivalent capacitance Ceq including CC and 64Cu is
In order to maintain the C-2C ratio for binary multi-bit combination, the following relationship can be enforced:
C
term
=C
2C
−C
eq
=C
eq Eq. 2
The value at the OA line output 60 becomes
Thus, a 64-D MAC operation has been achieved for 8-bit weights and IA inputs using an out-of-SRAM C-2C-ladder-based multi-bit combination scheme with a fixed weight data format.
The technology described herein is also the first analog CiM solution with uniformed MAC unit design as well as a uniformed multibit recombination structure that resides outside the SRAM array. Accordingly, the technology described herein is more scalable and reconfigurable. Thus, embodiments deliver significant computation density improvement while keeping the uniformity of the structures for offering both scalability and reconfigurability in CiM array.
Reconfigurable Out-of-SRAM C-2C-Ladder-Based Multi-Bit Combination for Analog MAC
Additionally, the conventional architecture 20 (
More particularly, placing the capacitor ladder network external to the memory array provides the ability to selectively activate a plurality of switches (not shown) based on the data format of the multibit weight data (e.g., after manufacture) because the circuit overhead for providing reconfigurability may now also reside outside the memory array (e.g., avoiding any negative impact on weight storage density). Indeed, different weight data formats may be used during inference when switching between neural network layers.
First, a unit C-2C cell 82 that is associated with the ith pOA line (pOAi) now has a termination capacitor Cterm and a pair of switches controlled by complementary signals of Si and
Embodiments provide for an analog MUX for multiplexing multiple OA lines to muxed OA (mOA) lines, such that the analog value on each mOA line can be digitized by a subsequent analog-to-digital data converter (ADC). The following discussion provides examples of the analog MUX to associate OA lines with mOA lines.
The added reconfigurability on various data formats is made practical by moving the C-2C ladder out of the SRAM array and reducing the number of C-2C ladders to only one for each MAC operation. With the C-2C ladder moving out of the SRAM array and a number of C-2C ladders being consolidated, the reconfigurability can be added rather efficiently with minimum circuit overhead since the reconfigurability is now only used for the one out-of-SRAM C-2C ladder that covers an entire 64-D MAC operation.
Turning now to
Likewise, the termination capacitor, Cterm, may now be:
C
term
=C
2C
−C
eq−4Cp=Ceq+2Cp Eq. 5
Due to the charge sharing from parasitic capacitance, the OA line values would also have another scaling factor
and the resulting UA line voltage is shown below:
Although the OA line voltage is attenuated as compared to Eq. 3, this attenuation is a linear operation and the effect of this scaling can be digitally reversed once the OA line 120 is digitized through an ADC (not shown).
Similarly, this reconfigurability may have a very minor penalty with respect to noise and mismatch. For noise, the capacitor itself does not have noise. Rather, the sampling process through a resistor adds the so-called KT/C noise (e.g., Johnson-Nyquist noise, which is a function of the Boltzmann constant (K), temperature (T) and capacitance (C)) to the voltage value stored on the capacitor, and the KT/C noise value has a square root relationship to capacitor size C. Accordingly, the added reconfigurability which incurred additional parasitic capacitance, would only decrease the absolute KT/C noise value. It can be shown that the overall KT/C noise has a scaling factor of
after accounting for the parasitic capacitance. Also as shown above, the OA values, which is the signal here, has a linear scaling factor of
As a result, the signal-to-noise (SNR) ratio is then scaled by
which is a very minor negative impact on SNR. For example, if 3Cp adds up to 20% of the Ceq, then the SNR on OA lines would only degrade by about 10% or 0.8 dB, which translates to 0.13 bits. As for mismatch concerns, the overall capacitance including the parasitic capacitance may be most relevant. Therefore, the overall mismatch is not degraded with added reconfigurability.
Illustrated processing block 132 provides for storing multibit weight data to a memory array. In one example, the memory array includes an SRAM. Block 134 conducts, by a capacitor ladder network, MAC operations on first analog (e.g., input activation) signals and the multibit weight data. Additionally, block 136 outputs, by the capacitor ladder network, second analog (e.g., output activation) signals based on the MAC operations, wherein the capacitor ladder network is external to the memory array. The capacitor ladder network may include a C-2C capacitor ladder network.
In an embodiment, the capacitor ladder network includes a plurality of switches and block 134 includes selectively activating, by a controller, the plurality of switches based on a data format of the multibit weight data. In such a case, the plurality of switches may include a plurality of switch pairs (e.g., S1 and
Illustrated processing block 142 carries, by a plurality of output activation (OA) lines, the second analog signals. Block 144 combines, by one or more multiplexers coupled to the plurality of OA lines, the second analog signals. In an embodiment, block 144 combines only valid OA lines given a specific weight data format to mOA lines in a time-division multiplexing manner. The method 140 therefore further enhances performance at least to the extent that combining the second analog signals as shown improves computational throughput and/or storage density.
Turning now to
In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM). In an embodiment, an TO (input/output) module 288 is coupled to the host processor 282. The illustrated TO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), mass storage 302 (e.g., hard disk drive/HDD, optical disc, solid state drive/SSD) and a network controller 292 (e.g., wired and/or wireless). The host processor 282 may be combined with the TO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298.
In an embodiment, the AI accelerator 296 includes the enhanced architecture 40 (
The logic 354 may be implemented at least partly in configurable or fixed-functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
Example 1 includes a performance-enhanced computing system comprising a network controller and a processor coupled to the network controller, the processor including logic coupled to one or more substrates, wherein the logic includes a memory array to store multibit weight data and a capacitor ladder network to conduct multiply-accumulate (MAC) operations on first analog signals and the multibit weight data, the capacitor ladder network further to output second analog signals based on the MAC operations, wherein the capacitor ladder network is external to the memory array.
Example 2 includes the computing system of Example 1, wherein the capacitor ladder network includes a plurality of switches and the logic includes a controller to selectively activate the plurality of switches based on a data format of the multibit weight data.
Example 3 includes the computing system of Example 2, wherein the plurality of switches includes a plurality of switch pairs, and wherein each switch pair corresponds to one of the second analog signals.
Example 4 includes the computing system of Example 2, wherein the data format includes one of an eight-bit integer format or a four-bit integer format.
Example 5 includes the computing system of Example 1, wherein the capacitor ladder network includes a plurality of partial output activation lines to carry the second analog signals, and one or more multiplexers coupled to the plurality of partial output activation lines, the one or more multiplexers to combine the second analog signals.
Example 6 includes the computing system of any one of Examples 1 to 5, wherein the memory array includes a static random access memory and the capacitor ladder network includes a C-2C capacitor ladder network.
Example 7 includes a semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware, the logic including a memory array to store multibit weight data, and a capacitor ladder network to conduct multiply-accumulate (MAC) operations on first analog signals and the multibit weight data, the capacitor ladder network further to output second analog signals based on the MAC operations, wherein the capacitor ladder network is external to the memory array.
Example 8 includes the semiconductor apparatus of Example 7, wherein the capacitor ladder network includes a plurality of switches and the logic includes a controller to selectively activate the plurality of switches based on a data format of the multibit weight data.
Example 9 includes the semiconductor apparatus of Example 8, wherein the plurality of switches includes a plurality of switch pairs, and wherein each switch pair corresponds to one of the second analog signals.
Example 10 includes the semiconductor apparatus of Example 8, wherein the data format includes one of an eight-bit integer format or a four-bit integer format.
Example 11 includes the semiconductor apparatus of Example 7, wherein the capacitor ladder network includes a plurality of partial output activation lines to carry the second analog signals, and one or more multiplexers coupled to the plurality of partial output activation lines, the one or more multiplexers to combine the second analog signals.
Example 12 includes the semiconductor apparatus of any one of Examples 7 to 11, wherein the memory array includes a static random access memory.
Example 13 includes the semiconductor apparatus of any one of Examples 7 to 12, wherein the capacitor ladder network includes a C-2C capacitor ladder network.
Example 14 includes the semiconductor apparatus of any one of Examples 7 to 12, wherein the logic coupled to the one or more substrates includes transistor regions that are positioned within the one or more substrates.
Example 15 includes a method of operating a performance-enhanced computing system, the method comprising storing multibit weight data to a memory array, conducting, by a capacitor ladder network, multiply-accumulate (MAC) operations on first analog signals and the multibit weight data, and outputting, by the capacitor ladder network, second analog signals based on the MAC operations, wherein the capacitor ladder network is external to the memory array.
Example 16 includes the method of Example 15, further including selectively activating, by a controller, a plurality of switches in the capacitor ladder network based on a data format of the multibit weight data.
Example 17 includes the method of Example 16, wherein the plurality of switches includes a plurality of switch pairs, and wherein each switch pair corresponds to one of the second analog signals.
Example 18 includes the method of Example 16, wherein the data format includes one of an eight-bit integer format or a four-bit integer format.
Example 19 includes the method of Example 15, further including carrying, by a plurality of partial output activation lines, the second analog signals, and combining, by one or more multiplexers coupled to the plurality of partial output activation lines, the second analog signals.
Example 20 includes the method of any one of Examples 15 to 19, wherein the memory array includes a static random access memory and the capacitor ladder network includes a C-2C capacitor ladder network.
Example 21 includes an apparatus comprising means for performing the method of any one of Examples 15 to 20.
Analog in-memory computing technology described herein therefore provides superior performance advantages as opposed to other in-memory computing solutions. For example, the technology described herein provides edge AI platforms with both high throughput and high efficiency. Embodiments address two major technical problems associated with analog CiM-analog MAC computation circuit overhead and lack of reconfigurability on data format. With these challenges alleviated, a potential analog CiM accelerator based on the technology described herein can significantly outperform conventional offerings (e.g., reconfigurable weight data formats during inference when switching between layers of a neural network). The resulting performance advantages are particularly beneficial in edge AI applications in which computing throughput and memory density are issues of concern. The technology described herein also obviates any need to under-utilize existing multibit weight data formats or have only single-bit weight data format in analog CiM arrays (e.g., combined with performing bit-serial operation digitally outside CiM arrays) in an effort to achieve reconfigurable weight data formats in analog CiM.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.