Embodiments generally relate to artificial intelligence (AI) computing. More particularly, embodiments relate to technology to realize signed multiply-accumulate (MAC) operation in the analog domain with a differential signal path and intrinsic process, voltage, and temperature (PVT) variation tolerance.
Compute-in-memory (CiM) static random-access memory (SRAM) architectures may deliver increased efficiency to convolutional neural network (CNN) models. A notable trend in CiM processor architectures may be to use analog mixed-signal (AMS) hardware when performing multiply-accumulate (MAC) operations in a CNN model. Most AMS CiM processors, however, have relatively low process, voltage, and temperature (PVT) variation tolerance. Additionally, AMS CiM processors may have increased memory requirements depending on the input data format.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
As already noted, analog mixed-signal (AMS) compute-in-memory (CiM) processors may have increased memory requirements depending on the input data format and/or relatively low process, voltage, and temperature (PVT) variation tolerance. For example, most AMS CiM processors have two main challenges: 1) support for signed multi-bit data and 2) PVT variation tolerance.
Signed data format is advantageous in many machine learning (ML) and neural network (NN) applications (e.g., a mixture of positive and negative weight values may be helpful in identifying edges in images). Signed data format may be relatively straightforward in the digital domain because the overhead to support signed formats in digital is merely reserving a single bit, the sign bit, to represent the polarity of the data (e.g., the value “0” represents positive numbers, and the value “1” represents negative numbers). One extra bit of overhead can easily be ignored compared to the remaining 7, 15, 31 or 63 bits. The situation is quite different, however, in the analog domain, since the sign bit is also treated as the most-significant-bit (MSB) of the data, which results in doubling the required operations and normally leading to a doubled memory cell number.
AMS hardware output is also susceptible to PVT variations, limiting the computing precision and, ultimately, the inference accuracy of a CNN model. Computing at the edge also has substantial constraints such as, for example, power limitations (e.g., most edge device, such as wireless sensors, mobile devices, etc., only have a very limited power budget). Thus, intensive operations can drain the battery or the source quickly.
To save power, the most practical and straightforward solution may be lowering the supply voltage of the circuit. The equation of the dynamic power consumption is given by: P=CV2f, where P is the power consumption, C is the loading capacitance of the circuit, V is the supply voltage, and f is the operating frequency. As shown in the equation, the power consumption is proportional to the square of the supply voltage. With a lower supply voltage, hardware and circuits are more sensitive to noise and have larger delay, which will cause error during computation and lead to failure in the classification.
As the PVT variation is a significant issue of AMS MAC implementations, calibration solutions are typically used to guarantee a robust operation and an acceptable computing result. The hardware and power of those variation compensation approaches could be acceptable for low-end precision reduced AMS CNN processors, due to the relaxed SNR requirement. For high precision processors, however, calibration overhead could negate the benefits gained by the AMS implementation.
There are two commonly used low cost methods to achieve signed data format in analog CiM for NN applications: 1) reducing number of bits to only supporting binary (0, 1) or ternary (−1, 0, +1) format, and 2) using unsigned hardware with code remapping. Binary/ternary NN hardware has become very popular in recent years. Especially in CiM implementations, a substantial number of recently reported CiMs are binary or ternary based, as such CiM implementations can demonstrate the highest throughput and power efficiency. Although binary/ternary neural networks have shown high power efficiency, performance and supported applications are severely limited by one-bit data. With only one meaningful bit, this kind of hardware implementation can only deal with some very basic datasets, such as MNIST (Modified National Institute of Standards and Technology database, or CIFAR-10 (Canadian Institute for Advanced Research-Ten database). The accuracy drop may be unacceptable when classifying more complicated datasets, such as CIFAR-100, or ImageNet.
With continuing reference to
Digital computing is robust because of sufficient design redundancy. Analog computing, on the other hand, sacrifices the extra robustness for higher power efficiency. Consequently, analog computing typically suffers from the impact of PVT variations and hardware mismatch. One approach to mitigate those negative effects may be to directly lower the supply voltage and accept the resulting errors. Although neural networks, such as especially deep neural networks (e.g., “ResNet”) may be robust to errors, when choosing to accept errors, designers normally need to face a trade-off dilemma: 1) Prioritizing efficiency, then the classification accuracy cannot be guaranteed, and 2) Choosing performance and sacrificing the power consumption. Neither of these two options is optimal. Other solutions may include static mismatch error compensation or dynamically operation condition adjustment.
For example, another approach may be to focus on statically correcting the error by either adding extra correction hardware or by evolving data coding. With the aid of such hardware, designers may be able to lower the supply voltage without causing a significant negative impact on the overall neural network performance. Although error correction coding (ECC), may detect and even correct errors during data read and write in memory, the ECC cannot protect the data during computation. Hardware based error correction, on the other hand, is too complicated and difficult to implement due to basic computing element substitution requirements. Those cells need additional control and support. In addition, the corresponding layout shape and size are also different from the basic computing standard cells. There are also downsides to noise aware training: 1) mismatch between the model and the actual noise sources on-chip, 2) extra training requirements, 3) a need to conduct the training separately for different chip architectures (e.g., lacking portability when migrating networks from one design to another), etc.
Dynamically adjusting the supply voltage by continuously monitoring the classification failure rate may be another option. Based on the observed failure rate, a control system tuning the voltage regulator may enable the workload to stay at a comfortable condition. Noise aware training is another common approach to improving network tolerance to PVT. To constantly track the ambient environment, however, traditional dynamic supply voltage adjustment solutions normally are based on sensing the classification failure rate, which presents at least four technical problems: 1) The classification failure has two causes, computing fault and input corruption. There is no solution to distinguish these two by simply monitoring the classification failure rate, 2) To calculate the classification failure rate, data from data center may be required. The edge device cannot determine whether failure occurred on its own, therefore additional data transmission is required, 3) As the solution needs to wait for data and process results from the data center, the delay in the voltage control loop is unbounded, which can easily cause instability and oscillation in the loop, and 4) Voltage tuning cannot alleviate the impact of temperature and process variations. As will be discussed in greater detail, the technology described herein uses a butterfly switching based differential format in the CiM signal path to compensate for aforementioned problems without employing complicated calibration blocks.
More particularly, most CiM implementations may traditionally use single-ended signaling in their respective processing structures. As a result, these solutions suffer from a higher error rate in edge deployment, where operation conditions may change severely. By contrast, differential signals provide inherent first order cancellation of coherent noise, crosstalk, and PVT (process, voltage, and temperature) variations, which may be a common occurrence in analog, RF (radio frequency), mixed-signal, and high-speed digital links.
As shown in
More particularly, a CiM processor 30 includes an input data buffer 32 that provides digital activation signals (e.g., input activations/IAs) to a plurality of DACs 34 (34a-34n), which convert the digital activation signals into first analog signals 35. A symmetric differential signal path 36 uses MAC hardware 38 to conduct signed MAC operations on the first analog signals 35 and multibit weight data (e.g., “W” obtained from weight RAM accesses). In an embodiment, the multibit weight data is in a signed magnitude format. The MAC hardware 38 also outputs second analog signals 37 based on the signed MAC operations, wherein a plurality of ADCs 40 (40a-40n) convert the second analog signals 37 into digital accumulation signals (e.g., output activations/OAs). The digital accumulation signals may be sent to an output data buffer 42. In an embodiment, the DACs 34, the MAC hardware 38, and the ADCs 40 are adjusted to accept differential signals.
Of particular note is that conventional calibration modules 44 may be eliminated from the CiM processor 30 due to intrinsic PVT and noise tolerance provided by the differential signals. Additionally, the differential signals result in voltage output range 46 of the CiM processor 30 that is twice that of a conventional single-ended output range 48. Moreover, a noise profile 50 of the CiM processor 30 is symmetric around the value of zero.
With continuing reference to
With continuing reference to
More particularly, the two-rail capacitor ladder network 62, 64 includes two C-2C ladders placed side-by-side (e.g., implemented as passive metal-oxide-metal/MOM capacitors above a standard memory cell active region), because the differential structure uses two standalone signals to form the differential output. The two-rail ladder network 62, 64 may execute multiplication operations, and is a capacitor network in digital-to-analog converter (DAC) designs to provide analog voltage outputs. As best shown in
The switches are controlled by digital bits and connected to either a fixed reference voltage VREF or one of VIN,P or VIN,N. Ratioed by the serial capacitors 2C, the contributions of the branches 61, 63 are binary weighted along the two-rail ladder network 62, 64 and superimposed onto the output node of the two-rail ladder network 62, 64.
The data stored in memory cells are shared by both sides of the rail to control those switches except the MSB in the word. The MSB, assigned as the sign bit (one for negative values, zero for positive values), controls a transmission gate based butterfly switch circuitry 66 steering between the VIN,P and VIN,N. The GND node in the single-ended C-2C ladder is replaced by a reference node with a voltage level of half VIN,P (VIN,P/2). The input data is arranged in the format of “signed magnitude”, while the final output of the ladder network, VOD, is formed by the difference of the VOUT,P and VOUT,N, in a range between −1 to +1. As a result, the equation of the differential output VOD for an N-bit ladder is given below:
With continuing reference to
Turning now to
Turning now to
Illustrated processing block 92 provides for generating, by a plurality of DACs coupled to a differential signal path, first analog signals based on digital activation signals. In an embodiment, the plurality of DACs include one or more of current steering DACs or differential resistive DACs. Block 94 conducts, by the differential signal path, signed MAC operations on first analog signals and multibit weight data stored in the differential signal path. In one example, the multibit weight data is in a signed magnitude format. Moreover, block 94 may involve bypassing a remapping of the multibit weight data. Block 96 outputs, by the differential signal path, second analog signals based on the signed MAC operations. In an embodiment, block 96 also involves steering, by butterfly switch circuitry of the differential signal path, the second analog signals between a positive voltage and a negative voltage based on MSBs in the multibit weight data. Additionally, blocks 94 and 96 may bypass, by the differential signal path, a calibration of the first analog signals and the second analog signals. Block 98 generates, by a plurality of ADCs coupled to the differential signal path, digital accumulation signals based on the second analog signals. In an embodiment, the plurality of ADCs include differential SAR converters.
The method 90 therefore enhances performance at least to the extent that supporting positive/negative signals and signed multiplication with differential signals enables negative values to be represented in the analog domain (e.g., which in turn facilitates ML and NN applications). Additionally, the differential signal doubles the dynamic range of the CiM processor, which further enhances performance. Moreover, the conducting signed MAC operations in the differential signal path enables PVT robust computations and the elimination of costly calibration units. Indeed, the differential signal path provides immunity to supply noise (e.g., common mode random error), which cannot be calibrated with a single-ended signal.
Illustrated processing block 102 performs, by a first capacitor ladder network coupled to butterfly switch circuitry of the differential signal path, multiplication operations with respect to a positive voltage. Additionally, block 104 performs, by a second capacitor ladder network coupled to the butterfly switch circuitry, multiplication operations with respect to a negative voltage. The method 100 therefore further enhances performance at least to the extent that the first and second capacitor ladder networks obviates the need for a separate mid-rail voltage reference (e.g., enables the use of reference-less ADCs).
Turning now to
In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM). In an embodiment, an IO (input/output) module 288 is coupled to the host processor 282. The illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), mass storage 302 (e.g., hard disk drive/HDD, optical disc, solid state drive/SSD) and a network controller 292 (e.g., wired and/or wireless). In one example, the network controller 292 obtains an input data stream associated with an AI, ML or NN application. The host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 (e.g., CiM processor) into a system on chip (SoC) 298.
In an embodiment, the AI accelerator 296 includes logic 300 having a differential signal path that performs one or more aspects of the method 90 (
The logic 354 may be implemented at least partly in configurable or fixed-functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
Example 1 includes a performance-enhanced computing system comprising a network controller and a processor coupled to the network controller, wherein the processor includes logic coupled to one or more substrates, the logic including a differential signal path to conduct signed multiply-accumulate (MAC) operations on first analog signals and multibit weight data stored in the differential signal path, and output second analog signals based on the signed MAC operations.
Example 2 includes the computing system of Example 1, wherein the multibit weight data is in a signed magnitude format.
Example 3 includes the computing system of Example 1, wherein the differential signal path includes butterfly switch circuitry to steer the second analog signals between a positive voltage and a negative voltage based on most significant bits in the multibit weight data.
Example 4 includes the computing system of Example 3, wherein the differential signal path further includes a first capacitor ladder network coupled to the butterfly switch circuitry, wherein the first capacitor ladder network is to perform multiplication operations with respect to the positive voltage, and a second capacitor ladder network coupled to the butterfly switch circuitry, wherein the second capacitor ladder network is to perform multiplication operations with respect to the negative voltage.
Example 5 includes the computing system of Example 1, wherein the differential signal path is to bypass a remapping of the multibit weight data.
Example 6 includes the computing system of Example 1, wherein the differential signal path is to bypass a calibration of the first analog signals and the second analog signals.
Example 7 includes the computing system of any one of Examples 1 to 6, wherein the logic further includes a plurality of digital to analog converters (DACs) coupled to the differential signal path, the plurality of DACs to generate the first analog signals based on digital activation signals, and wherein the plurality of DACs include one or more of current steering DACs or differential resistive DACs.
Example 8 includes the computing system of any one of Examples 1 to 7, wherein the logic further includes a plurality of analog to digital converters (ADCs) coupled to the differential signal path, the plurality of ADCs to generate digital accumulation signals based on the second analog signals, and wherein the plurality of ADCs include differential successive approximation register converters.
Example 9 includes a semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic includes a differential signal path and is implemented at least partly in one or more of configurable or fixed-functionality hardware, the differential signal path to conduct signed multiply-accumulate (MAC) operations on first analog signals and multibit weight data stored in the differential signal path, and output second analog signals based on the signed MAC operations.
Example 10 includes the semiconductor apparatus of Example 9, wherein the multibit weight data is in a signed magnitude format.
Example 11 includes the semiconductor apparatus of Example 9, wherein the differential signal path includes butterfly switch circuitry to steer the second analog signals between a positive voltage and a negative voltage based on most significant bits in the multibit weight data.
Example 12 includes the semiconductor apparatus of Example 11, wherein the differential signal path further includes a first capacitor ladder network coupled to the butterfly switch circuitry, wherein the first capacitor ladder network is to perform multiplication operations with respect to the positive voltage, and a second capacitor ladder network coupled to the butterfly switch circuitry, wherein the second capacitor ladder network is to perform multiplication operations with respect to the negative voltage.
Example 13 includes the semiconductor apparatus of Example 9, wherein the differential signal path is to bypass a remapping of the multibit weight data.
Example 14 includes the semiconductor apparatus of Example 9, wherein the differential signal path is to bypass a calibration of the first analog signals and the second analog signals.
Example 15 includes the semiconductor apparatus of any one of Examples 9 to 14, wherein the logic further includes a plurality of digital to analog converters (DACs) coupled to the differential signal path, the plurality of DACs to generate the first analog signals based on digital activation signals, and wherein the plurality of DACs include one or more of current steering DACs or differential resistive DACs.
Example 16 includes the semiconductor apparatus of any one of Examples 9 to 15, wherein the logic further includes a plurality of analog to digital converters (ADCs) coupled to the differential signal path, the plurality of ADCs to generate digital accumulation signals based on the second analog signals, and wherein the plurality of ADCs include differential successive approximation register converters.
Example 17 includes the semiconductor apparatus of any one of Examples 9 to 15, wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates.
Example 18 includes a method of operating a compute in memory (CiM) processor, the method comprising conducting, by a differential signal path, signed multiply-accumulate (MAC) operations on first analog signals and multibit weight data stored in the differential signal path, and outputting, by the differential signal path, second analog signals based on the signed MAC operations.
Example 19 includes the method of Example 18, wherein the multibit weight data is in a signed magnitude format.
Example 20 includes the method of Example 18, further including steering, by butterfly switch circuitry of the differential signal path, the second analog signals between a positive voltage and a negative voltage based on most significant bits in the multibit weight data.
Example 21 includes the method of Example 20, further including performing, by a first capacitor ladder network coupled to the butterfly switch circuitry, multiplication operations with respect to the positive voltage, and performing, by a second capacitor ladder network coupled to the butterfly switch circuitry, multiplication operations with respect to the negative voltage.
Example 22 includes the method of Example 18, further including bypassing, by the differential signal path, a remapping of the multibit weight data.
Example 23 includes the method of Example 18, further including bypassing, by the differential signal path, a calibration of the first analog signals and the second analog signals.
Example 24 includes the method of any one of Examples 18 to 23, further including generating, by a plurality of digital to analog converters (DACs) coupled to the differential signal path, the first analog signals based on digital activation signals, wherein the plurality of DACs include one or more of current steering DACs or differential resistive DACs.
Example 25 includes the method of any one of Examples 18 to 23, further including generating, by a plurality of analog to digital converters (ADCs) coupled to the differential signal path, digital accumulation signals based on the second analog signals, wherein the plurality of ADCs include differential successive approximation register converters.
Example 26 includes an apparatus comprising means for performing the method of any one of Examples 18 to 25.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.