The present invention relates generally to a method, system, and apparatus for a resistive processing unit, and more particularly relates to a method, system, and apparatus for circuit methodology for highly linear and symmetric resistive processing unit.
Deep Neural Networks (DNNs) demonstrated significant commercial success in the last years with performance exceeding sophisticated prior methods in speech and object recognition. However, training the DNNs is an extremely computationally intensive task that requires massive computational resources and enormous training time that hinders their further application. Instead of a usual time complexity, the problem can be reduced therefore to a constant time independent of the array size. However, the addressable problem size is limited to the number of nodes in the array that is challenging to scale up to billions even with the most advanced CMOS (complementary-metal-oxide-semiconductor) technologies.
Recent implementations have the problem where estimated acceleration factors are limited by device specifications intrinsic to their application as NVM (non-volatile memory) cells.
Device characteristics usually considered beneficial or irrelevant for memory applications such as high on/off ratio, digital bit-wise storage, and asymmetrical set and reset operations, are becoming limitations for acceleration of DNN training. These non-ideal device characteristics can potentially be compensated with a proper design of peripheral circuits and a whole system, but only partially and with a cost of significantly increased operational time.
There is a need to provide an RPU circuit which can be highly linear and symmetric in order to implement practical ANNs (artificial neural networks).
In view of the foregoing and other problems, disadvantages, and drawbacks of the aforementioned background art, an exemplary aspect of the present invention provides a system, apparatus, and method of providing a method, system, and apparatus for a circuit methodology for highly linear and symmetric resistive processing unit.
One aspect of the present invention provides a resistive processing unit (RPU), including a circuit having at least two current mirrors connected in series, and a capacitor connected with the at least two current mirrors, the capacitor providing a weight based on a charge level of the capacitor. The capacitor is charged or discharged by one of the at least two current mirrors.
Another aspect of the present invention provides a method of a resistive processing unit (RPU), the method including charging or discharging a capacitor of the resistive processing unit by one of at least two series connected current mirrors, and providing a weight based on a charge level of the capacitor connected to the current mirrors.
Yet another aspect of the present invention provides array of resistive processing units (RPUs), each RPU including a circuit having at least two current mirrors that are connected, and a capacitor connected with the at least two current mirrors, the capacitor providing a weight based on a charge level of the capacitor. The capacitor is charged or discharged by one of the at least two current mirrors.
There has thus been outlined, rather broadly, certain embodiments of the invention in order that the detailed description thereof herein may be better understood, and in order that the present contribution to the art may be better appreciated. There are, of course, additional embodiments of the invention that will be described below and which will form the subject matter of the claims appended hereto.
The exemplary aspects of the invention will be better understood from the following detailed description of the exemplary embodiments of the invention with reference to the drawings.
The invention will now be described with reference to the drawing figures, in which like reference numerals refer to like parts throughout. It is emphasized that, according to common practice, the various features of the drawing are not necessary to scale. On the contrary, the dimensions of the various features can be arbitrarily expanded or reduced for clarity. Exemplary embodiments are provided below for illustration purposes and do not limit the claims.
As mentioned, training the DNNs is an extremely computationally intensive task that requires massive computational resources and enormous training time that hinders their further application. For example, a 70% relative improvement has been demonstrated for a DNN with 1 billion connections that was trained on a cluster with 1000 machines for three days. Training the DNNs relies in general on the backpropagation algorithm that is intrinsically local and parallel. Various hardware approaches to accelerate DNN training that are exploiting this locality and parallelism have been explored with a different level of success starting from the early 90s to current developments with GPU, FPGA or specially designed ASIC. Further acceleration is possible by fully utilizing the locality and parallelism of the algorithm. For a fully connected DNN layer that maps neurons to neurons significant acceleration can be achieved by minimizing data movement using local storage and processing of the weight values on the same node and connecting nodes together into a massive systolic array where the whole DNN can fit in. Instead of a usual time complexity, the problem can be reduced therefore to a constant time independent of the array size. However, the addressable problem size is limited to the number of nodes in the array that is challenging to scale up to billions even with the most advanced CMOS technologies. Novel nano-electronic device concepts based on non-volatile memory (NVM) technologies, such as phase change memory (PCM) and resistive random access memory (RRAM), have been explored recently for implementing neural networks with a learning rule inspired by spike-timing-dependent plasticity (STDP) observed in biological systems.
Only recently, their implementation for acceleration of DNN training using backpropagation algorithm have been considered with reported acceleration factors ranging from 27× to 900×, and even 2140× and significant reduction in power and area. All of these bottom-up approaches of using previously developed memory technologies looks very promising, however the estimated acceleration factors are limited by device specifications intrinsic to their application as NVM cells.
Device characteristics usually considered beneficial or irrelevant for memory applications such as high on/off ratio, digital bit-wise storage, and asymmetrical set and reset operations, are becoming limitations for acceleration of DNN training. These non-ideal device characteristics can potentially be compensated with a proper design of peripheral circuits and a whole system, but only partially and with a cost of significantly increased operational time.
Therefore, as mentioned, there is a need to provide an RPU circuit which can be highly linear and symmetric in order to implement practical ANNs.
Resistive processing units (RPUs) indicate trainable resistive crosspoint circuit elements which can be used to build artificial neural networks (ANNs) and dramatically accelerate the ability of ANNs by providing local data storage and local data processing. Since a highly symmetric and linear programming property of RPU device is required to implement practical ANNs, finding a linear and symmetric RPU implementation is a key to take advantage of the RPU-based ANN implementation. Here, it is proposed a CMOS-based RPU circuit that can be highly linear and symmetric.
In a related art, there is a disclosure of how the learning rate can be controlled using the length of stochastic bits streams or the population probability of the stochastic bits streams. Those techniques make it possible to control the learning rate although each has some drawbacks. For very large learning rates increasing the bit length slows down the overall performance of the training. Similar for very small learning rates reducing the population probability of the streams would make the updates too stochastic and training may not achieve low enough accuracies. In the present invention, it is shown how the learning rate can be controlled by varying the voltage of the pulses so that the learning rate can be varied in a large range without sacrificing on training time and accuracies.
The present invention provides a proposed new class of devices (RPU) that can be used as processing units to accelerate various algorithms including neural network training. In the present invention it is shown how the operating voltage of these array of RPU devices can be controlled to tune the learning rate for the neural network training. One way of tuning the learning rate is by controlling time duration of the pulses, however for very large learning rates this approach would be significantly slow as very long duration might be needed for the update cycle. Whereas here the present invention proposes that the operating voltage can be controlled to achieve larger or smaller learning rates.
One of the features of the invention is to use a voltage pulse height control to vary the learning rate of DNN training on RPU hardware so that system does not sacrifice neither time (for large learning rates) nor accuracy (for small learning rates).
The described method has the advantage of controlling learning rate without changing the time needed for the update cycle. The present approach should therefore be faster than the approaches where the duration of pulses control the learning rate.
Artificial neural networks (ANNs) can formed from crossbar arrays of RPUs that provide local data storage and local data processing without the need for additional processing elements beyond the RPU. The trainable resistive crosspoint devices are referred to as resistive processing units (RPUs).
Crossbar arrays (crosspoint arrays or crosswire arrays) are high density, low cost circuit architectures used to form a variety of electronic circuits and devices, including ANN architectures, neuromorphic microchips and ultra-high density nonvolatile memory. A basic crossbar array configuration includes a set of conductive row wires and a set of conductive column wires formed to intersect the set of conductive row wires. The intersections between the two sets of wires are separated by so-called crosspoint devices, which may be formed from thin film material.
Crosspoint devices, in effect, function as the ANN's weighted connections between neurons. Nanoscale two-terminal devices, for example memristors having conduction state switching characteristics, are often used as the crosspoint devices in order to emulate synaptic plasticity with high energy efficiency. The conduction state (e.g., resistance) of the memristive material may be altered by controlling the voltages applied between individual wires of the row and column wires.
The backpropagation algorithm is composed of three cycles, forward, backward and weight update that are repeated many times until a convergence criterion is met. The forward and backward cycles mainly involve computing vector-matrix multiplication in forward and backward directions. This operation can be performed on a 2D crossbar array of two terminal resistive devices as it was proposed more than 50 years ago. In forward cycle, stored conductance values in the crossbar array form a matrix, whereas the input vector is transmitted as voltage pulses through each of the input rows. In a backward cycle, when voltage pulses are supplied from columns as an input, then the vector-matrix product is computed on the transpose of a matrix. These operations achieve the required O(1) time complexity, but only for two out of three cycles of the training algorithm.
In contrast to forward and backward cycles, implementing the weight update on a 2D crossbar array of resistive devices locally and all in parallel, independent of the array size, is challenging. It requires calculating a vector-vector outer product which consist of a multiplication operation and an incremental weight update to be performed locally at each cross-point as illustrated in
wij←wij+ηxiδj (1)
where wij represents the weight value for the ith row and the jth column (for simplicity layer index is omitted) and xi is the activity at the input neuron, δj is the error computed by the output neuron and η is the global learning rate.
In order to implement a local and parallel update on an array of two-terminal devices that can perform both weight storage and processing (Resistive Processing Unit or RPU) we first propose to significantly simplify the multiplication operation itself by using stochastic computing techniques. It has been shown that by using two stochastic streams the multiplication operation can be reduced to a simple AND operation.
wij←wij±ΔwminΣn=1BLAin∧Bjn (2)
where BL is length of the stochastic bit stream at the outputs of STRs that is used during the update cycle, Δwmin is the change in the weight value due to a single coincidence event, Δin and Bjn are random variables that are characterized by Bernoulli process, and a superscript n represents bit position in the trial sequence. The probabilities that Ain and Bjn are equal to unity are controlled by Cxi and Cδj, respectively, where C is a gain factor in the STR.
One possible pulsing scheme that enables the stochastic update rule of Eq.2 is presented in
Network training with RPU array using stochastic update rule is shown in the following. To test the validity of this approach, we compare classification accuracies achieved with a deep neural network composed of fully connected layers with 784, 256, 128 and 10 neurons, respectively. This network is trained with a standard MNIST training dataset of 60,000 examples of images of handwritten digits using cross-entropy objective function and backpropagation algorithm. Raw pixel values of each 28×28 pixel image are given as inputs, while sigmoid and softmax activation functions are used in hidden and output layers, respectively. The temperature parameter for both activation functions is assumed to be unity.
Specifically,
In order to make a fair comparison between the baseline model and the stochastic model in which the training uses the stochastic update rule of Eq.2, the learning rates need to match. In the most general form the average change in the weight value for the stochastic model can be written as
E(Δwij)=BLΔwminC2xiδj (3)
Therefore the learning rate for the stochastic model is controlled by three parameters, Δwmin, and C that should be adjusted to match the learning rates that are used in the baseline model.
Although the stochastic update rule allows substituting multiplication operation with a simple AND operation, the result of the operation, however, is no longer exact, but probabilistic with a standard deviation to mean ratio that scales with 1/√{square root over (BL)}. Increasing the stochastic bit stream length BL would decrease the error, but in turn would increase the update time. In order to find an acceptable range of BL values that allow to reach classification errors similar to the baseline model, we performed training using different BL values while setting Δwmin=η/BL and C=1 in order to match the learning rates used for the baseline model as discussed above. As it is shown in
To determine how strong non-linearity in the device switching characteristics is required for the algorithm to converge to classification errors comparable to the baseline model, a non-linearity factor is varied as shown
These results validate that although the updates in the stochastic model are probabilistic, classification errors can become indistinguishable from those achieved with the baseline model. The implementation of the stochastic update rule on an array of analog RPU devices with non-linear switching characteristics effectively utilizes the locality and the parallelism of the algorithm. As a result the update time is becoming independent of the array size, and is a constant value proportional to BL, thus achieving the required O(1) time complexity.
The current vector I1 to I4 508 is the output vector “y”, while the input vector “x” is shown as the vector V1 to V3 510 with the conductance matrix σ.
Basically, there are two pairs of two terminal circuits, where two terminals are for updating and two terminals are for reading. First, there is Vin1 and Vin2, that are two terminals for update input at the logic AND gate (or other configurations using, for example, a NAND gate and an Inverter) 702. Whenever the inputs Vin1 and Vin2 matches or coincide, when both are in the ON state, the output signal is in the ON state at the Out 716 of the AND gate 702. Only when the OUTPUT 716 is “1”, is there active connection to the two current sources 718 and 708 as seen in local switches 732 and 730. The local switches 730 and 732 are ON when the OUTPUT 716 returns an ON signal. At the transistor 704, there are two terminals (source/drain) 722 and 724 that are used to measure the resistance of this RPU device 700.
The current source 718 supplies the current into the capacitor 706 and current source 708 discharges the capacitor 706. The capacitor 706 stores the rate of the RPU device. Depending on the voltage applied to that capacitor 706, stored in that capacitor 706, the resistance of the transistor 704 changes as the control terminal or base of the transistor 704 is directly connected to the capacitor 706. Therefore, whenever the Vin1 and Vin2 at AND gate (or other configurations) 702 coincide, one of the current sources 708 or 718 (not both) are in the ON state at one time allowing the capacitor 706 to discharge or charge.
The charging or discharging is controlled by other control signals, shown as Bias voltages Vb,up and Vb,dn applied at local switches 714 and 712, respectively. The bias voltages Vb,up and Vb,dn applied to the gate terminals 742 and 740 of current mirror transistors 718 and 708, respectively, are supplied from an external circuit and also used as a global signal of programming mode (Vprog) that is used for all the RPUs 700 in an array. Vprog is globally determined to all the cells in the array. Vprogram (Vprog) at input 712 and the inverted Vprog at local switch 714 are used globally.
Therefore, when Vprog is “1” at switch 714, then the charging current source is turned ON. Also, the OUT 716 of AND gate 702 has to be ON also. Then, the current source 718 is turned ON, thus charging via current 720 to the capacitor 702 to charge the capacitor 702. Then, that changes the voltage at the transistor 704 gate 710, which changes the resistance. The complimentary switches 714 and 712 are for global Up/Down programming signal (Vprog). Vb,up provides the global UP programming signal at 714 allowing for the charging of the capacitor 702.
Referring again to
Therefore, the UP/DOWN cycle control is through using current mirror bias voltage as a signal, and the external Vprog switch. The coincidence detection is through the AND gate (or other alternative configuration using, for example, a NAND gate and inverter) and the local switches. Meanwhile, the charge storage and output is through the capacitor 706, current mirror 708 and 718, and the read transistor (read out transistor) 704.
With a constant current supply:
The Vcap voltage supplied to the capacitor is determined through the current i program (iprog) over a change of time t, and N is the number of pulses. N∝Vcap∝Iread, where the Iread the read current. The Vcap and the N are noted in the equation.
When the read transistor is in deep triode region:
The ID and VGS are noted in the equation. The same holds for the discharge case.
Therefore, in the proposed circuit, a highly symmetric and linear weight update is achieved using current source-based circuit. Therefore, as shown above a mixed signal RPU circuit with silicon technology elements is proposed, which shows ideal RPU characteristic.
Referring back to
There is a weight capacitor 706 and a read transistor 704. The weight capacitor 706 stores the weight in the form of electric charge serving as a current integrator. The read transistor 704 converts the voltage at the weight capacitor 706 to resistance which can be accessed from source-drain terminals 722 and 724 by applying a read voltage at the gate 710 of the transistor 704.
Another set of elements is the current mirrors 708 and 718. Two current mirror transistors 718 and 708 serve as constant current sources to charge and discharge, respectively, the weight capacitor 706 with a constant current. The bias voltages to the gate terminal of current mirror transistors 718 and 708 are supplied from an external circuit and also used as a global signal of programming mode (Vprog).
Another element is the AND gate 702, which is a Coincidence detector. The AND gate 702 receives the voltage input signals from connected column and row and perform multiplication.
Other configurations can be made, for example, the NAND gate can also be connected in series to an inverter, thereby using both the output of NAND gate and the inverter (AND gate logic of the Vin1 and Vin2 and also the NAND logic output by the inverter) to control the activation of the mirror current sources 708 and 718. Other configurations can also be included.
Some of the many advantages achieved are as follows. There is highly linear and symmetric weight update. Analog, incremental weight change is also implemented. There is a high frequency update due to the potential to be low power. The present invention also provides an implementation to in a small area with deep trench capacitor, advanced silicon technology such as nanowire FET (field effect transistor), carbon nanotube FET and FinFET.
The many features and advantages of the invention are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the invention which fall within the true spirit and scope of the invention. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.
The present application is a Continuation Application of U.S. patent application Ser. No. 16/367,497, filed on Mar. 28, 2019, now U.S. Pat. No. 10,950,304, which is a Continuation Application of U.S. patent application Ser. No. 15/831,059, filed on Dec. 4, 2017, now U.S. Pat. No. 10,269,425, which is a Continuation Application of U.S. patent application Ser. No. 15/335,171, filed on Oct. 26, 2016, now U.S. Pat. No. 9,852,790, the entire contents of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5136176 | Castro | Aug 1992 | A |
5668508 | Pulvirenti et al. | Sep 1997 | A |
5825218 | Colli et al. | Oct 1998 | A |
6597598 | Tran et al. | Jul 2003 | B1 |
6597898 | Iwata et al. | Jul 2003 | B1 |
6937025 | Fong et al. | Aug 2005 | B1 |
6949937 | Knoedgen | Sep 2005 | B2 |
7079436 | Perner et al. | Jul 2006 | B2 |
8275727 | Elmegreen et al. | Sep 2012 | B2 |
9595310 | Zhang et al. | Mar 2017 | B2 |
9852790 | Gokmen | Dec 2017 | B1 |
10269425 | Gokmen | Apr 2019 | B2 |
10345348 | Gobbi | Jul 2019 | B2 |
10950304 | Gokmen | Mar 2021 | B2 |
20060294034 | Fuji | Dec 2006 | A1 |
20140214738 | Pickett | Jul 2014 | A1 |
20140344200 | Schie | Nov 2014 | A1 |
20150170025 | Wu et al. | Jun 2015 | A1 |
20160049195 | Yu et al. | Feb 2016 | A1 |
20180053089 | Gokmen et al. | Feb 2018 | A1 |
Entry |
---|
United States Notice of Allowance dated Dec. 13, 2018, in U.S. Appl. No. 15/831,059. |
United States Non-Final Office Action dated May 15, 2018, in U.S. Appl. No. 15/831,059. |
U.S. Notice of Allowance dated Aug. 15, 2017 in U.S. Appl. No. 15/335,171. |
U.S. Non-Final Office Action dated May 2, 2017 in U.S. Appl. No. 15/335,171. |
URL: arxiv.org/ftp/arxiv/papers/1603/1603.07341.pdf, Paper, authored by Gokmen et al., entitled “Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices”. |
Number | Date | Country | |
---|---|---|---|
20210151102 A1 | May 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16367497 | Mar 2019 | US |
Child | 17137615 | US | |
Parent | 15831059 | Dec 2017 | US |
Child | 16367497 | US | |
Parent | 15335171 | Oct 2016 | US |
Child | 15831059 | US |