The subject matter disclosed herein relates to the field of neural networks and more particularly relates to a magnetic tunnel junction (MTJ) based hardware synapse implementation for binary and ternary deep neural networks.
Deep neural networks (DNNs) are the state-of-the-art solution for a wide range of applications, such as image and natural language processing. Classical DNNs are compute-intensive, i.e., they require numerous multiply and accumulate (MAC) operations with frequent memory accesses. As such, DNN performance is limited by computing resources and available power. Working with DNNs is composed of two stages: training and inference, when the computation complexity of training exceeds the inference. Both the training and inference stages in DNNs are usually executed by commodity hardware (e.g., mostly FPGA and GPU platforms), but effort has been devoted to developing dedicated hardware, optimized for executing DNN tasks. The two main approaches to accelerate DNN execution are: (1) to move the computation closer to the memory, and (2) improving the performance of the MAC operation.
Efforts have been made to design dedicated hardware for DNNs. Current DNN models, however, are power hungry and not suited to run on low power devices. Therefore, discrete neural networks, such as ternary and binary neural network (TNNs, BNNs), are being explored as a way to reduce the computational complexity and memory consumption of DNNs. By reducing the weights and activation function resolution to binary {−1, 1} or ternary {−1, 0, 1} values, the MAC operations in discrete neural networks are replaced by much less demanding logic operations, and the number of required memory accesses is significantly reduced. This insight triggered recent research efforts to design novel algorithms that can support binary and/or ternary DNNs without sacrificing accuracy. Recently, the GXNOR algorithm for training discrete neural networks, especially for TNNs and BNNs was proposed. This algorithm uses a stochastic update function to facilitate the training phase. This algorithm does not need to keep the full value (e.g., floating point) of the weights and activations.
A disadvantage of the large data structures associated with prior art synapse and activations is that they cause (i) frequent memory accesses due to memory-computation separation of von Neumann based solutions with digital CMOS MAC operations resulting in high power consumption and increased execution latency, and (ii) non-practical on-chip memory capacity (at least tens of MBs).
In addition, digital MAC circuits are computation intensive while supporting the GXNOR algorithms requires a stochastic step engine design that is difficult to implement in standard digital logic.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.
The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, not limiting in scope.
There is provided, in an embodiment, a synapse device comprising: first and second magnetic tunnel junction (MTJ) devices, wherein each of the MTJ devices has a fixed layer port and a free layer port, and wherein the fixed layer ports of the first and second MTJ devices are connected to each other; a first control circuit connected to the free layer port of the first MTJ device and configured to provide a first control signal; and a second control circuit connected to the free layer port of the second MTJ device and configured to provide a second control signal; wherein the first and second control circuits are configured to perform a gated XNOR (GXNOR) operation between a synapse value and activation values; and wherein an output of the GXNOR is represented by a sum of the output currents through both of the first and second MTJ devices.
In some embodiments, the synapse device is configured to store a ternary or binary synapse weight represented by a state of the MTJ devices.
In some embodiments, the synapse weight is defined as and stored as a combination of respective resistance values of each of the first and second MTJ devices.
In some embodiments, the synapse device is further configured to perform in-situ stochastic update of the ternary or binary synapse weights.
There is also provided, in an embodiment, an array of synapse devices comprising: a plurality of synapse devices arranged in an array of rows and columns, wherein each of the synapse devices comprises: first and second magnetic tunnel junction (MTJ) devices, wherein each of the MTJ devices has a fixed layer port and a free layer port, and wherein the fixed layer ports of the first and second MTJ devices are connected to each other, a first control circuit connected to the free layer port of the first MTJ device and configured to provide a first control signal, and a second control circuit connected to the free layer port of the second MTJ device and configured to provide a second control signal, wherein the first and second control circuits are configured to perform a gated XNOR (GXNOR) operation between synapse and activation values; and wherein an output of the GXNOR is represented by the output current through both of the first and second MTJ devices, wherein all of the synapse devices arranged in any one of the columns share an input voltage, wherein all of the synapse devices arranged in any one of the rows share the first and second control signals, and wherein outputs of all of the synapse devices arranged in any one of the rows are connected.
In some embodiments, each of the synapse devices is configured to store ternary or binary synapse weights represented by a state of the MTJ devices.
In some embodiments, the synapse weight is defined as and stored as a combination of respective resistance values of each of the first and second MTJ devices.
In some embodiments, each of the synapse devices is further configured to perform in-situ stochastic update of the ternary or binary synapse weights.
In some embodiments, the array forms a trainable neural network.
In some embodiments, the neural network represents a synaptic weight matrix comprising all of the synapse weights of each of the synapse devices in the array.
In some embodiments, an output vector of the neural network is calculated as a weighted sum of all of the input voltages multiplied by the synaptic weightings matrix.
There is further provided, in an embodiment, a method comprising: providing an array of synapse devices arranged in rows and columns, wherein each of the synapse devices comprises: first and second magnetic tunnel junction (MTJ) devices, wherein each of the MTJ devices has a fixed layer port and a free layer port, and wherein the fixed layer ports of the first and second MTJ devices are connected to each other, a first control circuit connected to the free layer port of the first MTJ device and configured to provide a first control signal, and a second control circuit connected to the free layer port of the second MTJ device and configured to provide a second control signal, wherein the first and second control circuits are configured to perform a gated XNOR (GXNOR) operation between synapse and activation values, and wherein an output of the GXNOR is represented by the output current through both of the first and second MTJ devices, wherein all of the synapse devices arranged in any one of the columns share an input voltage, wherein all of the synapse devices arranged in any one of the rows share the first and second control signals, and wherein outputs of all of the synapse devices arranged in any one of the rows are connected; and at a training stage, training the array of synapse devices by: (i) inputting all of the input voltages associated with each of the columns, (ii) setting the first and second control signals associated with each of the rows to perform the GXNOR operation, and (iii) calculating an output vector of the array as a weighted sum of the input voltages multiplied by a synaptic weightings matrix comprising synapse weights of all of the synapse devices in the array.
In some embodiments, the training further comprises comparing the output vector to a training dataset input, wherein the comparing leads to an adjustment of the synaptic weightings matrix.
In some embodiments, each of the synapse devices is configured to store the synapse weight represented by a state of the MTJ devices, wherein the synapse weight is ternary or binary.
In some embodiments, the synapse weight is defined as and stored as a combination of respective resistance values of each of the first and second MTJ devices.
In some embodiments, each of the synapse devices is further configured to perform in-situ stochastic update of the ternary or binary synapse weights.
In some embodiments, the array forms a trainable neural network.
In some embodiments, the neural network represents the synaptic weight matrix comprising all of the synapse weights of each of the synapse devices in the array.
In some embodiments, the output vector of the neural network is calculated as a weighted sum of all of the input voltages multiplied by the synaptic weightings matrix.
There is further provided, in an embodiment, a computer memory structure comprising: a plurality of synapse devices, each comprising: first and second magnetic tunnel junction (MTJ) devices, wherein each of the MTJ devices has a fixed layer port and a free layer port, and wherein the fixed layer ports of the first and second MTJ devices are connected to each other, a first control circuit connected to the free layer port of the first MTJ device and configured to provide a first control signal, and a second control circuit connected to the free layer port of the second MTJ device and configured to provide a second control signal, wherein the first and second control circuits are configured to perform a gated XNOR (GXNOR) operation between a synapse value and activation values; and wherein an output of the GXNOR is represented by a sum of the output currents through both of the first and second MTJ devices.
In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the figures and by study of the following detailed description.
Exemplary embodiments are illustrated in referenced figures. Dimensions of components and features shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. The figures are listed below.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be understood by those skilled in the art, however, that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings.
In some embodiments, the present disclosure provides for a novel MTJ-based synapse circuit. In some embodiments, the present MTJ-based synapse circuit may be employed in a neural network, and especially a TNN and/or a BNN, which may be trained without sacrificing accuracy. The proposed MTJ-based synapse circuit enables in-situ, highly parallel and energy efficient execution of weight-related computation. Such a circuit can accelerate TNN inference and training execution on low-power devices, such as IoT and consumer devices.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method. Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system.
Disclosed herein are a memory device and an associated method. While the following description will be described in terms of memory synaptic devices for clarity and placing the invention in context, it should be kept in mind that the teachings herein may have broad application to all types of systems, devices and applications.
A synapse is an active memory element, which may include a bi-polar memory element having polarity-dependent switching.
In some embodiments, the present disclosure provides for a stochastic synapse for use in a neural network. In some embodiments, a stochastic synapse of the present disclosure comprises magnetic tunnel junction (MTJ) devices, wherein each of the MTJ devices has a fixed layer port and a free layer port, and wherein the fixed layer ports of the MTJ devices are connected to each other. In some embodiments, control circuits operationally connected to the MTJ devices are configured to perform a gated XNOR operation between synapse and activation values, wherein an output of the gated XNOR is represented by the output current through both of the MTJ devices.
Quantized neural networks are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary and binary neural networks. Further reduction of the power consumption and latency can be obtained by designing dedicated hardware for parallel, in-situ execution of those algorithms with low power consumption.
Accordingly, in some embodiments, the present disclosure provides for a novel hardware synapse circuit that uses magnetic tunnel junction (MTJ) devices to support GXNOR training methods.
As noted above, binary neural networks (BNNs) and ternary neural networks (TNNs) are being explored as a way to reduce the computational complexity and memory footprint of DNNs. By reducing the weight resolution and activation function precision to quantized binary {−1,1} or ternary {−1,0,1} values, the MAC operations are replaced by much less demanding logic operations, and the number of required memory accesses is significantly reduced. Such networks are also known as quantized neural networks (QNNs). This insight triggered recent research efforts to design novel algorithms that can support binary and/or ternary DNNs without sacrificing accuracy.
The GXNOR algorithm for training networks uses a stochastic update function to facilitate the training phase. Unlike other algorithms, GXNOR does not require keeping the full value (e.g., in a floating point format) of the weights and activations. Hence, GXNOR enables further reduction of the memory capacity during the training phase.
Emerging memory technologies such as Spin-Transfer Torque Magnetic Tunnel Junction (STT-MTJ) can be used to design dedicated hardware to support in-situ DNN training, with parallel and energy efficient operations. Furthermore, the near-memory computation enabled by these technologies reduces overall data movement.
An MTJ is a binary device with two stable resistance states. Switching the MTJ device between resistance states is a stochastic process, which may limit the use of STT-MTJ device as a memory cell.
Accordingly, in some embodiments, the stochastic behavior of the MTJ is used to support GXNOR training.
In some embodiments, the present disclosure provides for an MTJ-based synapse circuit comprising, e.g.:
The present inventors have evaluated TNN and BNN training using the MTJ-based synapse of the present disclosure over provided datasets. The results show that using the MTJ-based synapse for training yielded similar results as an ideal GXNOR algorithm, with a small accuracy loss of 0.7% for the TNN and 2.4% for the BNN. Moreover, the proposed hardware design is energy efficient, achieving
for feedforward and
for weight update.
An MTJ device is composed of two ferromagnetic layers, a fixed magnetization layer and a free magnetization layer, separated by an insulator layer, as shown in
where α, Ms, V, P, Meff are the Gilbert damping, the saturation magnetization, the free layer volume, the spin polarization of the current, and the effective magnetization, respectively.
In a low current regime where I<<Ic
where r is the mean switching time, and Δt is the write duration. Due to the exponential dependency on the current value, denoted by r, long write periods are needed to reach high switching probabilities (Psw→1).
In the high current regime where I>>Ic
where γ is the gyromagnetic ratio, and θ is the initial magnetization anglE, given by a normal distribution θ˜(0,θ0), θ0=√{square root over (kBT/(μ0HkMsV))}, where Hk is the shape anisotropy field.
Unlike the high- and low-current regimes, which can be described by an analytic model, the intermediate current regime has no simple model that describes it. The low-current regime exhibits long switching time (r>>>ns) which limits its practical use for computation. Therefore, in some embodiments, the present invention focuses on the high current regime.
In recent years, efforts have been made to make DNN models more efficient and hardware-compatible. Compression methods have been explored, where the DNN weights and activations are constrained to discrete values such as binary {−1, 1} or ternary {−1, 0, 1}.
Recently, a framework for constraining the weights and activations to the discrete space was suggested. Compared to other state-of-the-art algorithms, GXNOR eliminates the need for saving the full-precision weight values during the network training. The MAC operations in TNNs and BNNs are replaced with simple logic operations, i.e., XNOR, and the network's memory footprint is reduced dramatically. The GXNOR algorithm is a framework for constraining the weights and activations to the quantized space while training the network. An example GXNOR neural network is shown in
The quantized space is defined by
where N is a non-negative integer which defines the space values and zNn∈[−1,1]. For example, the binary space is given for N=0 and the ternary space for N=1. The quantized space resolution, the distance between two adjacent states, is given by
The quantized activation is a step function, where the number of steps is defined by the space. To support backpropagation through the quantized activations, the derivative of the activation function is approximated. Accordingly, in some embodiments, a simple window function may be used which replaces the ideal derivative, given by a sum of delta functions.
In GXNOR networks, the activation function (
To support training with weights which are constrained to the discrete weight space (DWS), the GXNOR algorithm uses a stochastic gradient based method to update the weights. First, a boundary function must be defined to guarantee that the updated value will not exceed the [−1, 1] range.
In some embodiments, the boundary function is
where is the synaptic weight between neuron j and neuron i of the following layer (l+1), ΔWijl is the gradient based update value, and k is the update iteration. Then, the update function is
W
ij
l(k+1)=Wijl(k)+Δwijl(k), (6)
where Δwijl(k)=((ΔWijl(k)))∈ is the discrete update value, obtained by projecting (ΔWijl(k)) to a quantized weight space. () is a probabilistic projection function defined by
where κij and vij are, respectively, the quotient and the remainder values of (ΔWij(k)) divided by ΔzN, and
where m is a positive adjustment factor. Hence,
Δwijl=κijΔzN+sign(vij)Bern(τ(vij))ΔzN, (9)
where Bern(τ(vij)) is a Bernoulli variable with parameter τ(vij).
In some embodiments, the present disclosure focuses on TNNs and BNNs. The binary weight space (BWS) is given by N=0 and Δz0=2. The ternary weight space (TWS) is given by N=1 and Δz1=1.
In some embodiments, the present disclosure provides for a ternary synapse circuit to support stochastic GXNOR training. In some embodiments, the stochastic behavior of the MTJ device may be leveraged to support the stochastic update function.
Table 1 below lists the different values of the synapse weight, W. This weight is defined and stored as the combination of the two resistances of the MTJ devices. The zero state in the present ternary synapse has two representations, as opposed to one in a regular ternary synapse. Moreover, thanks to the bi-stability of the MTJ, the proposed synapse value is limited to {−1,0,1}; thus, the boundary function in (5) is enforced by the hardware synapse.
The synapse circuits are the basic cells of an array structure, as shown in
As described in Table 1 above, the synapse state is defined and stored as the combination of the MTJ resistances. R1 represents the resistance of MTJ device M1 (
During feedforward (i.e., inference) operation u1 and u2 represent the value of the activation function. Note that u1 and u2 can be {−1, 0, 1}, for ternary activations, as well as {−1, 1} for binary activations. Note also that {1, 0, 1} and {1, 1} represent logic values. In the circuit implementation {−1,0,1}={−Vrd,0,Vrd} Vrd≤Ic0Ron, to ensure that the MTJ does not change its resistance during the feed forward operation mode.
During backpropagation, specifically during the update operation, the weights are updated according to an error function. u1 and u2 are fixed to the values +1 and −1, respectively. An update value of zero indicates that the weight already stored in the synapse does not change.
As shown in
To perform the gated-XNOR logic operation between the synapse and activation values, the input neuron values are denoted as the voltage sources. The logic values {−1,0,1} are represented by u E {−Vrd, 0, Vrd}, where Vrd is set to guarantee the low current regime of an MTJ, so the switching probability is negligible. During this operation, u1=u and
I
out=(G1−G2)u, (10)
where G{1,2} are the conductance of the two MTJs. As listed in Table 1 above, the polarity of Iout depends on the input voltage and the synapse weight. If u=0 or W={0w,0s}, the output current is Iout≈0. If the weight and input have the same polarity, then sign(Iout)=1 else sign(Iout)=−1.
To perform feedforward with the GXNOR operation, the row output is connected to ground potential and the output currents from all synapses are summed based on KCL. Thus, the current through row i is
where Gj,R
In some embodiments, in order to support DNN training, the present disclosure provides for a synaptic array which supports various optimization algorithms such as SGD, momentum, Adagrad and ADAM. These algorithms differ in the type and number of operations, where a higher level of parallelism can be achieved for SGD.
In some embodiments, the present disclosure provides for two weight update schemes, one which supports SGD and another which can support more sophisticated gradient-based optimization algorithms such as ADAM. In both schemes the update is done in the high current domain, guaranteed by the update input voltage Vup, for the update period marked by Tup. The weight update is influenced by the current direction and the time interval in which the current flows through the MTJs.
The row control signal, ei,j, connects the MTJ to one of the voltage sources {ui,ūi} per update operation, for time interval Δt. Hence, an input voltage pulse u=±Vup is applied to the MTJ, with pulse width Δt∈[0,Tup]. Therefore, using (2a), the switching probability of each MTJ is
where
and R is the resistance of the device. The update period, Tup, and Vup are set to ensure that if Δt=Tup then Psw≈1. To update the MTJ with respect to a given real value λ, the pulse width is set to Δt=min(|λ|Tup,Tup). Thus, Psw is a function of λ.
The control signals select the current direction through the synapses, as a function of sign(λ). For λ>0 (λ<0), {u1,ū2} ({ū1,u2}) are connected; thus, the current flows from R1 (R2) to R2 (R1).
To support advanced optimization algorithms, the weight columns are updated iteratively, i.e., a single synapse array column is updated at each iteration. During this operation, the input voltages are set to u1=u2=Vup>0 for all the synapses. To support the probabilistic projection, the MTJ is updated proportionally to κij=*ΔWij and vij=Remainder(ΔWij), meaning that for a single synapse, one MTJ is updated using a pulse width of Δt=|κij|Tup and the other with Δt=|vij|Tup. It is assumed that the K and v data are inputs to the synapse array. Using this work scheme, the synapse weight is updated as follows. κij is an integer, so if κij≠0, then the MTJ switching probability is approximately 1 and can be described as an indicator variable sign(κij)1κ≠0. vij is a fraction, so the switching probability of the MTJ with respect to vij is a Bernoulli variable with probability Psw(vij). Thus, the MTJ-based synapse update is given by Δwij=sign(ΔWij)(1κ≠0+Bern(Psw(vij))).
The control signals are given by
To obtain the required functionality of control signal ein, voltage comparators may be used (
When the SGD algorithm is used to train the network, all the synapses in the array are updated in parallel. To support SGD training, minor changes need to be made to the proposed update scheme. Using SGD, the update is given by the gradient value, and is equal to ΔW=uTy, where y is the error propagated back to the layer, using the backpropagation algorithm, and u is the input. For TNN and BNN the input activations are u E {−1,0,1}={−Vup,0,Vup} and u E {−1,1}={−Vup, Vup}, respectively; thus, ΔWi,j=yiui=sign(uj)yi or ΔWi,j=0 for u=0. In this scheme, the voltage sources keep the activation values, so u1=u2=u (whereas in the general scheme the voltage sources are set to u1=u2=Vup). The control signals are a function of the error y, whereas in ADAM and other optimization algorithms they are a function of the update value ΔW. The control signal functionality for SGD is
The functionality of the control signals remains unchanged, the voltage source is selected according to y, and the voltage sign and the effective update duration are set as a function of κ and v, the integer and remainder values of y, respectively. Therefore, the update equation is given by
Δwij=sign(yi)sign(uj)(1κ≠0+Bern(Psw(vij))) (19)
In some embodiments, to train the TNN, backpropagation of the error must be performed. Thus, an inverse matrix vector multiplication WTy is supported. Similarly to, the output row interface is used as an input. This allows reusing the same synapse array. Due to the synapse structure, the data is separated into two columns, as shown in
To clarify the update scheme proposed by the present disclosure, two examples of synapse updates are given.
Therefore, R2 will switch with probability
In this example, the synapse weight will be updated from −1→0 with probability
and might switch to 1 with probability
P
−1→1
=P
sw,1
P
sw,2
≈P
sw,2. (22)
Note that when W=−1, {R1,R2}={Roff,Ron}. Thus, if ΔW<0, the current flow direction will be from R2 to R1 and the MTJ cannot switch.
Therefore, R1 will switch with probability
In this example, the synapse weight is updated from 0w→−1 with probability P=Psw,1. Although theoretically no current should flow through R2, with probability Psw,2≈0 it might switch from Ron to Roff due to leakage currents. It is important to note that the switching probability is a function of the resistance; therefore, the switching probability of 0s={Roff,Roff} is lower than 0w={Ron,Ron}.
To support BNN instead of TNN, the GXNOR operation is replaced by a simple XNOR operation and the quantized space resolution is Δz0=2.
To support BWS, a 2T1R synapse is used as illustrated in
is added per synapse, and is connected in parallel to ū of the corresponding synapse.
In some embodiments, the ternary synapse may be separated into two binary synapses with e1,n=e2,n and e1,p=e2,p. Unfortunately, due to the use of the comparator, the ternary array cannot support the inverse read from all the columns; thus, it cannot support the backpropagation when the ternary synapse is split to two binary synapses. The 2T1R synapse can be used to design a dedicated engine for BNN; such a design does not need the comparators.
Table 2 below defines the values of the weights when a 2T1R synapse is used. MTJ resistance of Ron leads to W=1 and resistance of Roff leads to W=−1. To compute the XNOR operation between the weights and activation, u, the synapse current is compared to the reference value
The result of the XNOR operation is given in the right column of Table 2 below. While other methods to support binary weights can be considered (for example, using the resistance threshold value to separate the ±1 weight values), this solution was chosen due to the low ratio between Roff and Ron, which is a common property of MTJ devices.
If the proposed synapse array is used, each weight can use only one branch of the ternary synapse; thus the synapse can represent only a single bit, and half of the array is deactivated using binary mode. The reference resistors added to each row are located together (see
As in the GXNOR operation, the input neuron values may be denoted as the voltage sources. The logic values {−1,1} are represented by u E {−Vrd,Vrd}. The result of each XNOR operation is
I
out
=Gu, (24)
where G is the conductance of the MTJ. During feedforward, the control signal ebr=‘1’, and hence the reference resistors are connected and the current through each row is
where Gij is the MTJ conductivity of synapse j in row i, M is the number of synapses per row, M+1,i is the total number of positive products in row i, and M−1,i is the total number of negative products in row i.
In a manner similar to the TNN update scheme disclosed herein, the MTJ device of each binary synapse is updated to support the GXNOR algorithm.
The control signals are set as follows. First, the reference resistors are disconnected, and thus
e
br=‘0’. (26)
The row control signals are
so branch 2 of each synapse is deactivated. Signals e1,p and e1,n, which control the weight update, are given by
where ω=max(|κij|,|vij|).
To compute the value of each multiplication, the current read from the activated synapse must be compared to the reference value
As in the feedforward solution, a reference resistor is added per synapse in the column, and voltage y is applied across it. The resistors are located together as illustrated in
where N is the number of synapses per column.
The present inventors have conducted an evaluation of the synapse circuit and array, and the circuit parameters and behavior were extracted and used for the training simulations. Herein, the software and the MTJ-based implementations of the GXNOR algorithm are referred to as GXNOR and MTJ-GXNOR, respectively.
The synapse circuit was designed and evaluated in Cadence Virtuoso for the GlobalFoundries 28 nm FD-SOI process. The MTJ device parameters are listed in Table 3 below. The read voltage, Vrd, was set to guarantee a low-current regime and negligible switching probability for the feedforward and inverse read operations. Likewise, the update voltage, vup, was set to guarantee a high-current regime. The update time period was set to match Psw(Tup)≈1.
To evaluate the MTJ transition resistance and the impact of the MTJ transient response on the synapse circuit operation, the present inventors ran a Monte-Carlo simulation of the MTJ operation. The simulation numerically solves the LandauLifshitz Gilbert (LLG) differential equation (assuming the MTJ is a single magnetic domain) with the addition of a stochastic term for the thermal fluctuations and Slonczewski's STT term. For each iteration of the Monte-Carlo simulation, a different random sequence was introduced to the LLG equation and the resulting MTJ resistance trace was retrieved. The equation was solved using a standard midpoint scheme and was interpreted in the sense of Stratonovich, assuming no external magnetic field and a voltage pulse waveform. The resistance of the MTJ was taken as
where θ is the angle between magnetization moments of the free and fixed layers and P is the spin polarization of the current. To approximate the time-variation resistance of an MTJ during the switch between states, all the traces from the Monte-Carlo simulation were aligned using the first time that the resistance of the MTJ reached
After the alignment, a mean trace was extracted and used for the fit. This fit was used as the time-variation resistance when the MTJ made a state switch.
The GXNOR operation for a single synapse is shown in
The GXNOR result for a 128×128 synapse array and four active synapses in a single row were also simulated, for simplicity. The synapses were located at row 128, and columns [0,32,96,128], to maximize the effect of wire parasitic resistance and capacitance on the results. The simulation results are listed in Table 4 below, which shows GXNOR and accumulate for four synapses. The activation value of the input (a), weight value of the ternary synapse (w), and the current per synapse (Isyn) are listed. ti w is the expected output, Iout is the current measured at the output of each row.
·
To evaluate the training performance of the MTJ-based synapse, the present inventors simulated the training of two TNN and BNN architectures using the MTJ-based synapse over the MNIST and SVHN datasets in PyTorch (see, Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278-2324, November 1998; Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011). The network architecture for MNIST is “32C5-MP2-64C5-MP2-512FC-SVM,” and for SVHN it is “2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-SVM.” The synapse circuit parameters were extracted from the SPICE simulations. Table 5 lists the test accuracy of MTJ-GXNOR as compared to GXNOR and other state-of-the-art algorithms. BNNs and BWNs constrain the weights and activation to the ternary and binary spaces. However, in contrast to GXNOR, these networks keep the full-precision weights during the training phase, which increases the frequency of memory access and requires supporting full-precision arithmetic. The results of the TNN training using the MTJ-based synapse (MTJ-GXNOR TNN) are similar to the results of the GXNOR training. When the ternary synapse is used, the activation can be constrained to binary values using a sign function, although the weights cannot be constrained to the binary space. Therefore, a mixed precision network that uses binary activations with ternary weights (MTJ-GXNOR Bin-Activation) is also explored. When trained on the SVHN dataset, the test accuracy of MTJ-GXNOR BNN is lower than that of GXNOR BNN, while the test accuracy of MTJ-GXNOR Bin-Activation is closer to that of GXNOR TNN.
Variation in the device parameters and environment may affect the performance of the proposed circuits. Herein, the sensitivity of the TNN training performance to process variation is evaluated.
Two cases of process variation were considered: resistance variation; and variation in θ distribution. Variation in the device resistance and θ distribution may lead to different switching probability per MTJ device. To evaluate the sensitivity of the training to the device-to-device variation, the MNIST-architecture training was simulated with variations in the resistance and θ distributions. Several Gaussian variabilities were examined with different relative standard deviations (RSD). Table 6 lists the training accuracy for resistance variation and θ variation. The resistance RSD was found to be approximately 5%, while the present simulations show that the training accuracy is robust to the resistance variation even for higher RSD values (e.g. only 0.46% accuracy degradation for RSD=30%). The training accuracy is more sensitive to variations in θ. Nevertheless, high standard deviation of θ values results in better training accuracy. The performance of the MTJ-GXNOR algorithm improves for higher variations in θ. Table 7 lists the training results for different θ0 values; the θ0 value used in this work is marked in bold. Larger θ0 values, which correspond to higher randomness of the MTJ switching process, yield better accuracy.
The operation most sensitive to voltage variation is the weight update operation, where the update probability is a function of the voltage drop across the MTJ device. Therefore, evaluate the test accuracy obtained for variation in the voltage source is evaluated.
Test accuracy for different voltage source values. Lower voltage magnitude decreases the switching probability, thus lowering the network accuracy. The dashed vertical line marks V=1[V], which is the value selected for Vup in the present disclosure. Test accuracy for different voltage source values. Lower voltage magnitude decreases the switching probability, thus lowering the network accuracy. The dashed vertical line marks V=1 [V], which is the value selected for Vup herein. Increasing the voltage leads to higher switching probability and θ0 variance. Hence, increasing the voltage magnitude increases the randomness of the MTJ switching. Therefore, the voltage magnitude can be used to improve the stochastic switching process and to improve the network training performance when using an MTJ device with low θ0 variance. In the case simulated in this work, increasing the voltage magnitude above Vup=1.1V only slightly improves test accuracy; hence, herein, Vup=1V to constrain the power consumption of the present design.
The ambient temperature affects the switching behavior of the MTJ. When the temperature increases, the Roff resistance decreases. The Ron resistance value has a much weaker temperature dependency and it is nearly constant. The transistors can be described as variable current sources, where for high temperatures the drivability of the MOS transistor is degraded because the electron mobility decreases. Hence, the ambient temperature has opposite effects on the Roff of the MTJ and the drivability of the MOS transistor, which affect the switching probability. Additionally, the initial magnetization angle, θ, depends on the temperature by the normal distribution θ≠(0,θ0), where the standard deviation is θ0=√{square root over (kBT/(μ0HkMsV))}. Hence, θ0 increases for higher temperature.
As mentioned above, the training performance is highly dependent on the variance of θ. To estimate the sensitivity of the MTJ-based synapse to the temperature, MTJ-based training with different temperatures in the range [260K,373K] was simulated, where the resistances are extrapolated to emulate the temperature dependence. Table 8 below lists the test accuracy obtained for different temperatures. Although better accuracy is obtained for higher temperatures, the training phase and network accuracy are robust to temperature variations.
The power consumption and area were evaluated for a single synapse and synapse array, including the interconnect parasitics. The results are listed in Table 9 below. During the read operation, all the synapses are read in parallel; therefore, the feedforward power is higher than the write power, where the columns are updated serially.
QNNs were proposed as a way to reduce the overall power consumption and complexity of the full precision DNNs; hence, the energy efficiency (in units of
of the present design was evaluated. For the feedforward phase in a 128×128 synapse array, 128×(128+128) GXNOR and accumulate operations are done in parallel (1OP=1b GXNOR/Accumulate/update). Therefore, the synapse array can reach
in this phase. When performing update, each update is counted as a single operation; the energy efficiency when updating the weights is thus
During the update phase the voltage source is set to guarantee a high current domain; the energy efficiency of the update operation is therefore bounded by the MTJ device properties.
To evaluate the performance when integrating the present design to a full system, the following (but not the only possible) setup may be considered, when the performance will change for different setups. The synapse array is used as an analog computation engine and as memory for the weights; hence, the input and output to the array are converted to using 1-bit DAC and 8-bit ADC. In the inverse read phase, a bit-streaming method is provided in to compute the multiplication with the full-precision error data; thus, only a 1-bit DAC is needed. To generate the control signals, an 8-bit DAC, and voltage comparators are needed. The power and area of those components are listed in Table 9. The respective energy efficiency in the feedforward and update phases is
where the power consumption of the data converters limits the overall performance. For the bit-streaming method with 8-bit precision for the error data, the energy efficiency of the inverse read operation is
DNN architecture is structured by layers of neurons connected by synapses. Each synapse is weighted, and the functionality of the network is set by supplying different values to those weights. To find the values suitable for a specific task, machine learning algorithms are used to train the network. After the training is complete, the network is provided with new data and it infers the result based on its training; this stage is called the inference stage.
The basic computation element in a DNN is the neuron. DNNs are constructed from layers of neurons, each of which determines its own value from a set of inputs connected to the neuron through a weighted connection called a synapse. Therefore, the value of the output is given by the weighted sum of the input,
r
n=Σm=1MWnmxm, (31)
where xm, Wmn, and rn are, respectively, the input neuron m, the connection weights (synapse weights) between neuron n and neuron m, and output n. In the general case, each connection has its own weight, and thus the output vector r is determined by a matrix-vector multiplication,
r=Wx, (32)
To perform matrix-vector multiplication, several multiply and accumulate (MAC) operation are needed. Applying new input to the network and computing the output is also referred to as feed-forward. When training a network, after the feed-forward the weights are updated in another phase called back-propagation.
In ternary neural networks, i.e. networks with weights and activation of {−1,0,1}, the complex MAC operation is replaced by a simple logic gated XNOR and popcount operations. The gated XNOR operation is described in Table 8 below:
Thus, to support ternary neural networks the hardware needs to support gated XNOR operation.
During the back-propagation phase, the new value of the weights (i.e., the update) is calculated using gradient-based optimization algorithms. During this phase, the error at the network output layer needs to be back-propagate to the internal layers of the network. As part of the computation another matrix-vector multiplication is performed, y=WTδ. The matrix vector cannot be replaced by the gated-XNOR operation, and the multiplication is performed as described in Table 9 below:
After the back-propagation phase, the weights updated values, ΔW, are calculated. Then the weights are updated according to the GXNOR algorithm.
Note that the MTJ devices of the synapse of the present invention are used to store weights and perform XNOR operation. The synapse exploits stochastic writing of the weights to support stochastic training and process-in-memory (PIM) yielding reduced power consumption, reduced required memory capacity requirements, and faster training speeds.
To perform the gated XNOR logic operation between the synapse and activation values, the voltage sources denote the input value to the neuron. The logic values −1,0,1 are represented by u E {−−Crd,0,Vrd}. During this operation, u1=u and u2=−u are connected. The result is the output current sign
I
out=(G1−G2)u (33)
where G1, G2 are the conductance of the two MTJs. As shown in Table 10 below, the polarity of Iout depends on the input voltage and the synapse state. If u=0 or s=0w, 0s, the output current is Iout≈0. However, if the state and input activation have the same polarity, then sign(Iout)=1 else sign(Iout)=−1.
To perform feed forward with the GXNOR operation, the row output is grounded and the output currents from all synapses are summed based on KCL. Thus, the current through row i is given by
where Gj,n/p, N, N+1,i and N−1,i are the conductivity of each MTJ, the number of synapses per row, the total number of positive synapses, and the total number of negative synapses in row i, respectively.
Regarding backpropagation, the error function o=WTδ is used to determine the outputs used for the updates, where oj and δj may be 8 or 16-bit values, for example. The data is split into ‘positive’ and ‘negative’ columns. The δ value is input to each row which represents the inputs to one of M neurons. The current in each positive and negative column is summed and the difference ON between positive and negative sums are generated by a comparator or op amp. The N columns represent the inputs with each column representing a different input (or neuron output from a previous layer). The output from column i is given by
O
i=Σj=1M(G+−G−)δj=Σ1MS1δ1 (35)
Regarding weight updates, by exploiting the stochastic nature of the MTJ devices, the stochastic update is done in-situ. The weight update is done at the high current regime, guaranteed by the update input voltage Vup=Vin+Ic0Rmid>Ic0Roff, where Rmid=(Roff+R072)/2. Thus, the switching probability of each MTJ is defined by
where
Δt is the update duration, u is the voltage drop over the device, and R is the resistance of the device. Note that the probability is a function of Δt and uin. The update duration is set to be P(Tup)≈1. To support gated XNOR algorithm, each column is updated once per cycle, where ΔW=sign(ΔW)|ΔW|, u1=−u2=Vup, and e1,n\p and e2,n\p are control signals that (1) select the sign of the update sign(ΔW)=sign(u1,2), and (2) open the source transistor to Δt=|ΔW|.
To support advanced optimization algorithms such as the well-known ADAM algorithm with the synapse array, it is assumed that the update value is computed outside the synapse array and it provided as an input to the synapse array. The update process is iterative, where a single column is updated at each iteration. However, a higher level of parallelism can be achieved for stochastic gradient descent. The update value Δ is represented by the update duration and the voltage drop over the device. So, Δt=abs(Δ) and sign(u)=sign(Δ). To support this scheme, the voltage sources are set to u1=u2=Vup>0 at all columns, and the update period Tup is chosen to ensure that Pswitch(TupVup)≈1. The control signals are used to select the update sign and update duration per row. If sign(Δ)>0, the control signals selects {u1,
The control signal functionality is given by
Thus, the switching probability is a function of the effective update duration and the current polarity, both defined by the control signals. When the update sign is positive M1 is updated as a function of k, and M2 is updated as a function of v. The different zero states have different switching probabilities, but for each zero state the probability of switching to −1 and 1 is equal. The dashed line represents the switching probability for the GXNOR algorithm for S=−1,1.
The drawback of this circuit is the double representation of zero, which has non-symmetric switching behavior. The above-mentioned update scheme is a partial solution to make the switching response more symmetric.
To implement the control signal functionality comparators may be used. The positive port of the comparator is fed with the voltage signal, Vp=vVdd, and the other port is connected to a saw signal, which maintains Vsaw(Tup)=Vdd. Thus, if Vi=viVdd>0, the −Vsaw is always smaller than v. Therefore, ei1,n=−Vdd and Np is closed. ei2,n=Vdd as long as Vsaw<vVdd, meaning that N2 will be open for Twr,eff=vTwr.
In a first example, consider an update including W=−1, ΔWij=1.5, then kij=1 and vij=0.5. Thus, Psw,1≈1, and
Device M1 moves from Roff to Ron and M2 may move from Ron to Roff with probability
The control transistor P1 is open for a duration Δt=Tup while the control transistor N2 is open for a duration Δt=vijTwr related to the probability P(0.5). The state of the synapse moves from −1 to 0 to 1. Note that the move from 0 to 1 is not deterministic. The move from 0 to 1 occurs with probability
In a second example, consider an update including W=0w, ΔWij=−0.5, then kij=0 and vij=−0.5. Thus,
Psw,1≈0. Device M1 moves from Ron to Roff and M2 moves from Ron to Roff in a non-deterministic manner. The control transistor N2 is closed while the control transistor N1 is open for a duration Δt=vijTwr leading to a switching probability of P(0.5). The state of the synapse moves from 0w to −1 with probability
Regarding inverse reads, to train the TNN, backpropagation of the error should be performed. Thus, an inverse matrix vector multiplication WTy is supported using the output row interface as input. This allows the same synapse array to be reused. Due to the synapse structure, the data is separated into two columns, where the output data is given by Ii,p−Ii,n, the currents through each column. Therefore, the data may be converted into voltage and used as the voltage comparator.
Those skilled in the art will recognize that the boundaries between logic and circuit blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures may be implemented which achieve the same functionality.
Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermediary components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.
Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first,” “second,” etc. are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. As numerous modifications and changes will readily occur to those skilled in the art, it is intended that the invention not be limited to the limited number of embodiments described herein. Accordingly, it will be appreciated that all suitable variations, modifications and equivalents may be resorted to, falling within the spirit and scope of the present invention. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
This application claims the benefit of priority of U.S. Provisional Patent Application No. 62/943,887, filed on Dec. 5, 2019, the contents of which are incorporated by reference as if fully set forth herein in their entirety.
Number | Date | Country | |
---|---|---|---|
62943887 | Dec 2019 | US |