The present disclosure relates generally to machine-learning accelerators, and more particularly, to multiply-and-accumulate (MAC) acceleration for improving the efficiency of machine learning operations.
Non-Volatile Memory (NVM)-based crossbar architectures provide an alternative mechanism for performing MAC operations in machine-learning algorithms, particularly, neural-networks. The mixed-signal approach using NVM-bit cells relies upon Ohm's law to implement multiply operations by taking advantage of the resistive nature of emerging NVM technologies (e.g., phase change memory (PCM), resistive random-access memory (RRAM), correlated electron random access memory (CeRAM), and the like). An application of a voltage-bias across an NVM-bit cell generates a current that is proportional to the product of the conductance of the NVM element and the voltage-bias across the cell.
Currents from multiple bit cells are added in parallel to implement an accumulated sum. Thus, a combination of Ohm's law and Kirchoff's current law implements multiple MAC operations in parallel. These, however, can be energy-intensive when implemented using explicit multipliers and adders in the digital domain.
In neural networks, MAC acceleration utilizing NVM crossbars requires programming NVM elements with precision conductance levels that represent a multi-bit weight parameter. Due to inherent device limitations, the bit-precision that can be represented is limited to 4 or 5 bits, which provides 16 to 32 distinct conductance levels. This complicates the weight programming step since the entire crossbar array of NVM bits needs to be precisely programmed (capacities of 1-10 Mb are typical).
In accordance with the present disclosure, there is provided an improved technique for refactoring MAC operations to reduce programming steps in such systems.
The present disclosure is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements.
Specific embodiments of the disclosure will now be described in detail regarding the accompanying figures. For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the examples described herein. However, it will be understood by those of ordinary skill in the art that the examples described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the examples described herein. Also, the description is not to be considered as limiting the scope of the examples described herein.
It will be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.
In the following detailed description of embodiments of the disclosure, numerous specific details are set forth in order to provide a more thorough understanding of the disclosure. However, it will be apparent to those skilled in the art that the disclosure may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
It is to be understood that the terminology used herein is for the purposes of describing various embodiments in accordance with the present disclosure and is not intended to be limiting. The terms “a” or “an,” as used herein, are defined as one or more than one. The term “plurality,” as used herein, is defined as two or more than two. The term “another,” as used herein, is defined as at least a second or more. The terms “including” and/or “having,” as used herein, are defined as comprising (i.e., open language). The term “coupled,” as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term “providing” is defined herein in its broadest sense, e.g., bringing/coming into physical existence, making available, and/or supplying to someone or something, in whole or in multiple parts at once or over a period.
As used herein, the terms “about” or “approximately” apply to all numeric values, irrespective of whether these are explicitly indicated. Such terms generally refer to a range of numbers that one of skill in the art would consider equivalent to the recited values (i.e., having the same function or result). These terms may include numbers that are rounded to the nearest significant figure. In this document, any references to the term “longitudinal” should be understood to mean in a direction corresponding to an elongated direction of a personal computing device from one terminating end to an opposing terminating end.
a←a+(bc)(1) (Eq. 1)
The composition of a group of MACs may represent dot-products and vector-matrix multiplication. MAC operations are utilized in Machine Learning (ML) applications, and more specifically Deep Neural Networks (DNN).
BL
K
=Σc-1c[(G011*Vc11+ . . . +G022*Vc22)] (Eq. 2)
In
Such arrangements differ from SRAM behavior (which has significantly higher write endurance) and are not amenable to reprogramming the weights during inference. As a consequence, the entire network needs to be unrolled into an on-chip crossbar and fixed during inference. While this has the advantage of eliminating DRAM power consumption, it undesirably limits the maximum size of the network that can be programmed on-chip. Further, it also incurs an area penalty as mapping larger networks requires instantiation of crossbars that are megabits in capacity. This consumes higher area and increases susceptibility to chip-failures due to yield loss. Moreover, instantiating multiple crossbars requires instantiation of multiple ADCs/DACs, all of which need to be programmed, trimmed and compensated for drift.
An NVM/CeRAM element is a particular type of random access memory formed (wholly or in part) from a correlated electron material. The CeRAM may exhibit an abrupt conductive or insulative state transition arising from electron correlations rather than solid state structural phase changes such as, for example, filamentary formation and conduction in resistive RAM devices. An abrupt conductor/insulator transition in a CeRAM may be responsive to a quantum mechanical phenomenon, in contrast to melting/solidification or filament formation.
A quantum mechanical transition of a CeRAM between an insulative state and a conductive state may be understood in terms of a Mott transition. In a Mott transition, a material may switch from an insulative state to a conductive state if a Mott transition condition occurs. When a critical carrier concentration is achieved such that a Mott criteria is met, the Mott transition will occur and the state will change from high resistance/impedance (or capacitance) to low resistance/impedance (or capacitance).
A “state” or “memory state” of the CeRAM element may be dependent on the impedance state or conductive state of the CeRAM element. In this context, the “state” or “memory state” means a detectable state of a memory device that is indicative of a value, symbol, parameter or condition, just to provide a few examples. In a particular implementation, a memory state of a memory device may be detected based, at least in part, on a signal detected on terminals of the memory device in a read operation. In another implementation, a memory device may be placed in a particular memory state to represent or store a particular value, symbol or parameter by application of one or more signals across terminals of the memory device in a “write operation.”
A CeRAM element may comprise material sandwiched between conductive terminals. By applying a specific voltage and current between the terminals, the material may transition between the aforementioned conductive and insulative states. The material of a CeRAM element sandwiched between conductive terminals may be placed in an insulative state by application of a first programming signal across the terminals having a reset voltage and reset current at a reset current density, or placed in a conductive state by application of a second programming signal across the terminals having a set voltage and set current at set current density.
In accordance with embodiments of the disclosure, a vector-matrix multiplication performs the following MAC operations, where an Input vector V={vi}, i∈[0, M−1], matrix W={wij}, j∈[0, N−1], and an output vector O is composed of:
O
j=Σiwijai (Eq. 3)
The NVM equivalent is represented by:
I
j=Σiwgijvi (Eq. 4)
where matrix I represents the currents flowing through the bitlines, and V is the input voltages vector and g is the conductance of the NVM elements. For a K-bit weight representation, there can only be 2K unique weight values. For low-precision weight encoding (3 or 4-bit values), this leads to only 8 or 16 such unique weight values:
g
ij=Σk=0K-1g′ijk(RON+Δk) (Eq. 5)
where g′ijk∈{0,1}, k∈[0, K−1] and Δ=(ROFF−RON)/2K.
Voltages V′={v′i} are defined as following:
where v′i=0 if activation ai does not have gijk as a multiplicand, otherwise where v′i=vi, Eq. 5 can be rewritten as:
given that g′ijk=0 should be v′i=0.
Refactoring as represented by Eq. 10 leads to a simpler implementation where all input multiplicands are initially added to conditionally add together the input activations depending on whether they factor into the MAC operation with a specific weight value. The initial addition operation can be done using NVM elements. However, in accordance with embodiments of the disclosure, these NVM elements need not be precisely programmed. A binary weight encoding (RON/ROFF) is utilized to connect an input activation to a weight value without need for precision programming.
outputj=sum(gijk)sum(v′ik) (Eq. 11)
where K provides 2K levels. In block 502, the process is initialized and proceeds to block 504 of the low-precision write loop. In block 504 ijk are updated. In block 506 the g_ijk (binary) resistance is read. Then in block 508, g_ijk (binary) resistance is written. The process terminates at block 510. In this implementation, the M×N×2K cells are programmed to either a “0” (ROFF) or a “1” (RON) where 2K defines the number of levels. From here, 2K×N<<M×N non-volatile memory cells are precision programmed.
outputj=sum(v_i*g_ij) (Eq. 12)
where v_i, g_ij are high precision numbers in K levels.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the system. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Embodiments of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
Some portions of the detailed descriptions, like the processes may be presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. An algorithm may be generally conceived to be steps leading to a desired result. The steps are those requiring physical transformations or manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “deriving” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The operations described herein can be performed by an apparatus. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Accordingly, embodiments and features of the present disclosure include, but are not limited to, the following combinable embodiments.
In one embodiment, a method of performing multiply-accumulate acceleration in a neural network includes generating, in a summing array having a plurality of non-volatile memory elements arranged in columns, a summed signal by the columns of non-volatile memory elements in the summing array, each non-volatile memory element in the summing array being programmed to either a high or low resistance state; and inputting the summed signal from the summing array to a multiplying array having a plurality of non-volatile memory elements, each non-volatile memory element in the multiplying array being precisely programmed to a conductance level proportional to a weight in the neural network.
In another embodiment, the summing array and multiplying array is an M×N crossbar having K-bit weights, where M×N×2{circumflex over ( )}Kelements are programmed to either the high or low resistance state and 2×K×N elements are precisely programmed to the conductance level proportional to the weight in the neural network.
In another embodiment, a plurality of input activations are conditionally summed depending upon specific weight values.
In another embodiment, a plurality of input activations are significantly greater than a plurality of weight values.
In another embodiment, summing array comprises a plurality of high and low resistance levels.
In another embodiment, the method further comprises the M×N×2{circumflex over ( )}K elements programmed to either a resistance “off” state or a resistance “on” state for 2{circumflex over ( )}K levels.
In another embodiment, 2{circumflex over ( )}K×N<<M×N non-volatile memory cells are fine-tuned.
In another embodiment, the method further comprises scaling an output in a multiplier/scaling module.
In a further embodiment, an architecture for performing multiply-accumulate operations in a neural network includes a summing array having a plurality of non-volatile memory elements arranged in columns, the summing array generating a summed signal by the columns of non-volatile memory elements in the summing array, each non-volatile memory element in the summing array being programmed to either a high or low resistance state; and a multiplying array having a plurality of non-volatile memory elements that receive a summed signal from the summing array, each non-volatile memory element in the multiplying array being precisely programmed to a conductance level proportional to a weight in the neural network.
In another embodiment, the summing array and multiplying array is an M×N crossbar having K-bit weights, where M×N×2{circumflex over ( )}Kelements are programmed to either the high or low resistance state and 2{circumflex over ( )}K×N elements are precisely programmed to the conductance level proportional to the weight in the neural network.
In another embodiment, a plurality of input activations is conditionally summed depending upon specific weight values.
In another embodiment, a plurality of input activations is significantly greater than a plurality of weight values.
In another embodiment, the architecture further comprises a plurality of resistors and where the summing array comprises a plurality of high and low resistance levels.
In another embodiment, the architecture further comprises the M×N×2{circumflex over ( )}K elements programmed to either a resistance “off” state or a resistance “on” state for 2{circumflex over ( )}K levels.
In another embodiment, 2{circumflex over ( )}K×N<<M×N non-volatile memory cells are fine-tuned.
In another embodiment, the architecture further comprises a multiplier/scaling module for scaling an output.
In another further embodiment, an architecture for performing multiply-accumulate operations in a neural network includes a crossbar including a plurality of crossbar nodes arranged in an array of rows and columns, each crossbar node being programmable to a first resistance level or a second resistance level, the crossbar being configured to sum a plurality of analog input activation signals over each column of crossbar nodes and output a plurality of summed activation signals; and a multiplier, coupled to the crossbar, including a plurality of multiplier nodes, each multiplier node being programmable to a resistance level proportional to one of a plurality of neural network weights, the multiplier being configured to sum the plurality of summed activation signals over the multiplier nodes and output an analog output activation signal.
In another embodiment, each crossbar node includes one or more non-volatile elements (NVMs), and each multiplier node includes a plurality of NVMs.
In another embodiment, the crossbar includes M rows, N columns, K-bit weights and M×N×2K programmable NVMs, and the multiplier includes N multiplier nodes and N×2K programmable NVMs.
In another embodiment, the architecture further comprises a plurality of digital-to-analog converters (DACs) coupled to the crossbar, each DAC being configured to receive a plurality of digital input activation signals and output the plurality of analog input activation signals.
In accordance with the foregoing, a method and architecture for performing multiply-accumulate acceleration is disclosed. Having thus described the disclosure of the present application in detail and by reference to embodiments thereof, it will be apparent that modifications and variations are possible without departing from the scope defined in the appended claims as follows: