Multiply-and-accumulate (MAC) units are building blocks of digital processing units that may be used in many applications including artificial intelligence (AI) for edge devices, signal/image processing, convolution, and filtering. Recently, the focus on AI implementation on edge devices is increasing as edge devices improve and AI techniques advance. AI on edge devices is capable to address difficult machine learning problems using deep neural network (DNN) architectures. However, DNN algorithms are computationally intensive, with large data sets and high memory bandwidth. This results in a memory access bottleneck that introduces considerable energy and performance overheads.
The following presents a simplified summary of some embodiments of the invention in order to provide a basic understanding of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some embodiments of the invention in a simplified form as a prelude to the more detailed description that is presented later.
In many embodiments, a cross-coupling capacitor processing unit (C3PU) supports analog mixed signal in-memory computing to perform multiply-and-accumulate (MAC) operations. In embodiments, the C3PU includes a capacitive unit, a CMOS transistor, and a voltage-to-time converter (VTC). The capacitive unit can serve as a computational element that holds a multiplier operand and performs multiplication once an input voltage corresponding to a multiplicand is applied to an input terminal of the VTC. The input voltage is converted by the VTC to a pulse width signal. The CMOS transistor transfers the multiplication. A demonstrator including a 5×4 array of the C3PUs is presented. The demonstrator is capable of implementing 4 MACs in a single cycle. The demonstrator was verified using Monte Carlo simulation in 65 nm technology. The 5×4 C3PU demonstrator consumed an energy of 66.4 fJ/MAC at 0.3 V voltage supply. The demonstrator exhibited an error of 5.4%. The demonstrator exhibited low energy consumption and occupies a smaller area by 3.4 times and 2.4 times, respectively, with similar error value when compared to a digital-based 8×4-bit fixed point MAC unit. The 5×4 C3PU demonstrator was used to implement an artificial neural network (ANN) for performing iris flower classification and achieved a 90% classification accuracy compared to ideal accuracy of 96.67% using MATLAB.
Deep neural networks (DNNs) are approximate in nature and many AI applications can tolerate lower accuracy. This opens the opportunity for potential tradeoffs between energy efficiency, accuracy, and latency.
One direction to eliminate the need for explicit memory access is to utilize in-memory computing (IMC) architectures, which has significant advantages in energy efficiency and through-put over conventional counterparts based on von Neumann architecture. Both Digital and analog approaches for IMC have been proposed. An artificial neural network (ANN) using analog implementation has the potential to outperform the digital-based neural networks in energy efficiency and speed. One key component in an analog implemented ANN is a synaptic memory that is utilized for weight storage. Several weight storage approaches have been proposed including: 1) traditional volatile memory including SRAM and DRAM, 2) non-volatile memory including CMOS-based flash memory, emerging technology, and Resistive RAM (RRAM) such as memristor, and 3) analog mixed signal (AMS) using capacitors and transistors. Both SRAM and DRAM are limited to high power devices that are not suitable for duty-cycled edge devices. The flash memory traps the weight charges in the floating gate, which is electrically isolated from the control gate. On the other hand, the emerging technology of memristors stores the weight as a conductance value. Memristors, however, suffer from low endurance and sneak path, which results in a state disturbance. AMS using capacitors and transistors has been demonstrated for storing weights as charges and for control of the conductance of the transistors. AMS, however, requires relatively a large and complex biasing circuit to control the charges on the capacitor in addition to non-linearity due to the variations of the drain-to-source voltage of the transistor. SRAM has been used both as memory and cross-coupling capacitor as a computational element to perform binary MAC operation using bitwise XNOR gate. The advantage of the cross-coupling computation is that it helps in reducing the inaccuracy of the AMS circuits since the capacitor has lower power consumption and process variation.
A cross-coupling capacitor (C3) computing, hence, named, C3 processing unit (C3PU) coupled with a voltage-to-time converter (VTC) circuitry is described herein that implements AMS MAC operation. The C3PU utilizes a cross-coupling capacitor for IMC as both a memory and a computational element to perform AMS MAC operation. The C3PU can be utilized in applications that heavily rely on vector-matrix multiplications including but not limited to ANN, CNN, and DSP. The C3PU is suitable for applications with fixed coefficients such as weights on pre-trained CNN or image compression.
In many embodiments, a 5.7 μW low power voltage-to-time converter (VTC) is implemented at the input voltage terminal of the C3PU to generate a modulated pulse width signal. In many embodiments, the VTC is used to produce a linear multiplication operation.
A 5×4 crossbar architecture based on C3PU was designed and simulated in 65 nm technology to employ 4 MACs where each MAC performs 5 multiplications and 4 additions. Simulation results show that the energy efficiency of the 5×4 C3PU is 66.4 fJ/MAC at 0.3 V voltage supply with an error compared to computation in MATLAB of less than 5.4%.
A 5×4 crossbar architecture was used to implement a two-layer ANN for performing iris flower classification. The synaptic weights were trained offline and then mapped into capacitance ratio values for the inference phase. The ANN classifier circuit was designed and simulated in 65 nm CMOS technology. It achieved a high inference accuracy of 90% compared to a baseline accuracy of 96.67% obtained from MATLAB.
In the following description, various embodiments of the present invention will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the present invention may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.
According to various embodiments of the present disclosure, techniques for in-memory computing (IMC) can include implementations of synaptic memory that is utilized for weight storage in an artificial neural network in an analog system. According to certain specific embodiments, to implement analog MAC operation, a cross-coupling capacitor processing unit (C3PU) is provided having a circuit design using a crossbar architecture.
Example C3PU Circuit and Operation
The following sections discuss the design details and operation of an example C3PU. The A coupling capacitance is used to transfer apply a voltage to the gate of the transistor. Current is passed through the transistor based on the voltage applied to the gate of the transistor.
Turning now to the drawing figures in which similar reference identifiers refer to similar elements,
The value of Vg determines the operational mode of the transistor 102 and affects its trans-conductance value and hence its linearity.
To overcome the former issues that significantly affect the functionality of the C3PU multiplier, the analog input voltage can be processed in time domain rather than voltage domain. This can be achieved using a voltage-to-time converter (VTC) 106 as shown in
Presenting the data Vin in time domain has several advantages where both time and capacitance scale better with technology than voltage. In addition, it has less variations and provides better noise immunity compared voltage domain where the signal-to-noise ratio is degraded due to voltage scaling.
During the sampling phase as shown in
as given in Eq. 5. The Iavg value depends on the amount of charge stored in the capacitors, which varies linearly with Vin given that VDDvtc is fixed. Thus, td has a linear relationship with Vin. Equation. 6 shows the time delay when Vin=VDDvtc, which depends on the difference between VDDvtc and Vsp.
The VTC circuit 106 was designed, implemented, and simulated in 65 nm industry standard CMOS technology. The input voltage is set between 0.1 V to 1.0 V at VDDvtc=1.0 V. so that linear voltage-to-time conversion is achieved. The capacitors C1 and C2 and the transistor M4 are sized to support a minimum time delay of 165 ps at the minimum Vin of 0.1 V. Metal insulator metal (MIM) capacitors of C1=27 fF and C2=10 fF are utilized. The M4 size of 500 nm/140 nm controlled by its gate voltage of Vb=0.5 V provides a current source of 14 μA. The inverter is carefully sized to provide the desired Vsp. Hence, the aspect ratio of M9 is 5 times the aspect ratio of M8 such that Vsp=0.35 V. Table 1 summarizes the specifications of the VTC design.
To quantify the impact of process variation on pulse width value, Monte Carlo Spice simulation with 200 samples and with mismatch model is investigated.
Example C3PU Crossbar Architecture for IMC Applications
The operation of the example 5×4 C3PU crossbar architecture 200 depends on two phase functions: computation and isolation. In the computation phase when the clock signal Vclk=1, the MAC operation is achieved by multiplying the Vpw,i pulse widths with the capacitance ratios Cc,ij/(Cc,ij+Cb,ij+Cg,ij). Then, the transistors transfer this multiplication into current that is summed on each bitline. The summed currents are integrated over a period of time t1-t2 using a virtual ground current integrator op-amp in order to provide the outputs as voltage levels V1-4 as given in Eq. 7.
The value of output voltages depends on two main parameters: a) time that the current will be accumulated t1-t2 and b) capacitor size Cj. The time t1-t2 can be fixed and represent the pulse width of the clock. This time is set to be greater than the maximum pulse width of Vpw,i. The maximum pulse width of Vpw is approximately 2 ns when the maximum input voltage Vin=1. Thus, the pulse width of the clock can be set to 3 ns to ensure the computation and accumulation of the currents. In addition, the Cj size plays an important role in determining the scaling factor that is required to approximately allow V1-4 to reach the expected output levels. The scaling factor is calculated by dividing the obtained MAC output voltages V1-4 by the expected values and hence the Cj size is set. Once the approximate voltages are achieved, the C3PU elements are isolated from the outputs by setting Vclk=0 to enter the isolation phase. The isolation phase is essential in order to allow the functionality of the VTC and to initialize the output stage of a virtual ground op-amp 203. The period T including computation and isolation time taken to operate the MAC calculations is 6 ns. Table 2 shows the specifications of the C3PU crossbar architecture 200.
The 5×4 C3PU crossbar architecture 200 can be implemented employing 65 nm technology. The input voltages can be fed to the C3PU crossbar architecture 200 for 30 continuous clock cycles. Each cycle can have different sets of input voltage levels that are converted into modulated pulse width signals.
In order to evaluate the 5×4 C3PU crossbar architecture 200, a 5×4 fixed point (FXP) crossbar units have been implemented using ASIC design flow in 65 nm CMOS. Table 3 shows the 3×3-bit, 4×4-bit, 8×4-bit and 8×8-bit FXP crossbars performance compared to the 5×4 C3PU crossbar 200. The error of the C3PU crossbar 200, 5.6%, is approximately close to the error of the 8×4-bit MAC unit, 6.52%. However, the advantage of the C3PU crossbar 200 is the lower energy and area consumption by 3.4 times and 2.4 times compared with the 8×4-bit MAC unit.
C3PU Demonstrator For ANN Applications
The advantage of the C3PU 100 is demonstrated by accelerating the MAC operations found in an ANN using an iris flower database. The iris flower data set consists of 150 samples in total divided equally between the three different classes of the iris flower namely, Setosa, Versicolour, and Virginica. Each sample holds the following features all in cm: sepal length, sepal width, petal length, and petal width. The architecture of the ANN consists of two layers: four nodes for the input layer each representing one of the input features, followed by three hidden neurons and lastly three output neurons for each class. In order to implement the MAC operations in the ANN, the iris features are considered as the first operands and are mapped into voltage values. The weights are considered as second operands and are stored as capacitance ratios in the capacitive unit of the C3PU. A simple linear mapping algorithm is used between the neural weights and capacitance ratios.
The training phase is performed offline using MATLAB by dividing the data set between training and testing as 80% and 20%, respectively. Post-training weights can have values with both positive and negative polarities. Hence, before mapping these weights into capacitance ratio values, they need to be shifted by the minimum weight value wmin. After performing the multiplication between the inputs and shifted weights, the effect of the shifting operation must be removed by subtracting the following term from all weights Σi=1n=IN×|Wmin|, where IN is the input to the hidden/output layer and n is the number of input nodes. Mapping such operation into C3PU architecture requires adding an additional column to the hidden and output crossbars to store the wmin value in each layer.
Once V1-4 are generated, the classifier switches to phase 2 in order to process them to the second layer. But before that, the impact of shift operation that is implemented on the weights needs to be removed by subtracting V4 from V1-3. Then, the subtracted outputs are passed through Relu activation function. In the ANN classifier, the subtraction operation and Relu function are implemented in time domain. In order to achieve such implementation, V1-4 are first converted to pulse width modulated signals using VTCs and then passed to the time domain subtractor and Relu activation function to generate Vo-pw1-3. These output signals may have small pulse widths due to the subtraction operation which does not correspond to the expected subtraction outputs. Therefore, the pulse widths of the Vo-pw1-3 are scaled by a constant factor depending on the expected subtraction output from the ANN using MATLAB and the observed outputs from the ANN using C3PU. After that, the scaled pulse width signals Vo-pw1-3-s are fed to the 4×4 C3PU weight matrix. The output voltages from the weight matrix Vo1-4 are passed to the subtractor and then softmax function in order to generate the proper class based on the input features.
The ANN classifier has been designed and simulated in 65 nm CMOS technology with a supply voltage of 1V except the 5×4 and 4×4 weight matrices that operate at a supply voltage of 0.3 V. The input voltages Vin1-4 have a range of 0.0 V to 1.0 V in addition to Vbias=1.0 V. The five input voltages are converted into modulated pulse width signals Vpw1-5 that have pulse widths in the range of 165 ps to 2 ns. The modulated pulse width input signals Vo1-4 of the second weight matrix have a pulse width in the range of 1.6 ns to 7.5 ns. The pulse width T1 of Vclk is set to 3 ns and the pulse width T2 of ˜Vclk-d is set to 9 ns. The example ANN classifier using C3PU shown in
The advantage of utilizing a cross-coupling capacitor for storage and processing element is that it can perform simultaneously as a high density and a low energy storage. One operand in the C3PU can be stored in the capacitive unit. While the second operand can be a modulated pulse width signal using voltage-to-time converter. The multiplication outputs can be transferred to an output current using CMOS transistors and then integrated using current integrator op-amp. The 5×4 C3PU crossbar 200 was developed to run all data simultaneously realizing fully parallel vector-matrix multiplication in one cycle. The energy consumption of the 5×4 C3PU is 66.4 fJ/MAC at 0.3V voltage supply with an error of 5.4% in 65 nm technology. The inference accuracy for the ANN architecture has been evaluated using the example C3PU for an iris flower data set achieving a 90% classification accuracy.
Other variations are within the spirit of the present invention. Thus, while the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention, as defined in the appended claims.
The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. The term “connected” is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2021/054330 | 5/19/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63027681 | May 2020 | US |