The present application relates generally to analog memory-based artificial neural networks and more particularly to techniques that compensate fixed asymmetries for multiply and accumulate operations.
Artificial neural networks (ANNs) can include a plurality of node layers, such as an input layer, one or more hidden layers, and an output layer. Each node can connect to another node, and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. ANNs can rely on training data to learn and improve their accuracy over time. Once an ANN is fine-tuned for accuracy, it can be used for inference (e.g., classifying and predicting input data).
Analog memory-based neural network may utilize, by way of example, storage capability and physical properties of memory devices to implement an artificial neural network. This type of in-memory computing hardware increases speed and energy efficiency, providing potential performance improvements. Rather than moving data from memory devices to a processor to perform a computation, analog neural network chips can perform computation in the same place (e.g., in the analog memory) where the data is stored. Because there is no movement of data, tasks can be performed faster and require less energy.
The summary of the disclosure is given to aid understanding of a system and method of compensating fixed asymmetries for multiply and accumulate (MAC) operations in analog memory-based artificial neural networks, which can provide efficiency, and not with an intent to limit the disclosure or the invention. It should be understood that various aspects and features of the disclosure may advantageously be used separately in some instances, or in combination with other aspects and features of the disclosure in other instances. Accordingly, variations and modifications may be made to the system and/or their method of operation to achieve different effects.
In one embodiment, a memory device for compensating multiply and accumulate (MAC) operations is generally described. The memory device can include a plurality of memory elements arranged into a plurality of memory blocks. Each memory block can include a first set of memory elements of the plurality of memory elements and a second set of memory elements of the plurality of memory elements. The first set of memory elements can be configured to store a synaptic weight of a trained artificial neural network (ANN). The second set of memory elements can be configured to store an inverse of the synaptic weight stored in the first set of memory elements.
Advantageously, the memory device in an aspect can compensate fixed asymmetries, and improve an accuracy, of MAC operations.
In one embodiment, a method for compensating multiply and accumulate (MAC) operations is generally described. The method can include sending an input vector to a first portion of a memory device. The first portion can store synaptic weights of a trained artificial neural network (ANN). The method can further include reading a first result of a multiply and accumulate (MAC) operation performed on the input vector and the synaptic weights stored in the first portion. The method can further include sending an inverse of the input vector to a second portion of the memory device. The method can further include reading a second result of a MAC operation performed on the inverse of the input vector and an inverse of synaptic weights stored in the second portion. The method can further include combining the first result and the second result to generate a final result. The final result can be a compensated result of the MAC operation performed on the input vector and the synaptic weights stored in the first portion.
Advantageously, the method in an aspect can compensate fixed asymmetries, and improve an accuracy, of MAC operations.
In one embodiment, a system for compensating multiply and accumulate (MAC) operations is generally described. The system can include a memory device and a processor. The processor can be configured to send an input vector to a first portion of the memory device. The first portion can store synaptic weights of a trained artificial neural network (ANN). The processor can be further configured to read a first result of a multiply and accumulate (MAC) operation performed on the input vector and the synaptic weights stored in the first portion. The processor can be further configured to send an inverse of the input vector to a second portion of the memory device. The processor can be further configured to read a second result of a MAC operation performed on the inverse of the input vector and an inverse of synaptic weights stored in the second portion. The processor can be further configured to combine the first result and the second result to generate a final result. The final result can be a compensated result of the MAC operation performed on the input vector and the synaptic weights stored in the first portion.
Advantageously, the system in an aspect can compensate fixed asymmetries, and improve an accuracy, of MAC operations.
Further features as well as the structure and operation of various embodiments are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.
Analog neural network chips can perform parallel vector-multiply operations, such as multiply and accumulate (MAC) operations. An analog neural network chip can receive input data that can be excitation vectors. These excitation vectors can be applied onto multiple row-lines of the analog neural network chip in order to perform MAC operations across a matrix of stored weights encoded into conductance values of the analog memory elements. In an aspect, analog memory elements in the analog neural network chip can be sensitive to fixed asymmetries, including but not limited to shift between positive and negative weights, positive and negative inputs, or peripheral circuit asymmetries. These fixed asymmetries can impact accuracy of the MAC operation computation.
In an embodiment, device 114 can include a plurality of multiply accumulate (MAC) hardware having a crossbar structure or array. There can be multiple crossbar structure or arrays, which can be arranged as a plurality of tiles, such as a tile 102. While
In an aspect, each tile 102 can represent at least a portion of a layer of an ANN. Each memory element 112 can be connected to a respective one of a plurality of input lines 104 and to a respective one of a plurality of output lines 106. Memory elements 112 can be arranged in an array with a constant distance between crossing points in a horizontal and vertical dimension on the surface of a substrate. Each tile 102 can perform vector-matrix multiplication. By way of example, tile 102 can include peripheral circuitry such as pulse width modulators at 120 and peripheral circuitry such as readout circuits 122.
Electrical pulses 116 or voltage signals can be input (or applied) to input lines 104 of tile 102. Output currents can be obtained from output lines 106 of the crossbar structure, for example, according to a multiply-accumulate (MAC) operation, based on the input pulses or voltage signals 116 applied to input lines 104 and the values (synaptic weights values) stored in memory elements 112.
Tile 102 can include N input lines 104 and M output lines 106. A controller 108 (e.g., global controller) can program memory elements 112 to store synaptic weights values of an ANN, for example, to have electrical conductance (or resistance) representative of such values. Controller 108 can include (or can be connected to) a signal generator (not shown) to couple input signals (e.g., to apply pulse durations or voltage biases) into the input lines 104 or directly into the outputs.
In an embodiment, readout circuits 122 can be connected or coupled to read out the M output signals (electrical currents) obtained from the M output lines 106. Readout circuits 122 can be implemented by a plurality of analog-to-digital converters (ADCs). Readout circuit 122 may read currents as directly outputted from the crossbar array, which can be fed to another hardware or circuit 118 that can process the currents, such as performing compensations or determining errors.
Processor 110 can be configured to input (e.g., via the controller 108) a set of input activation vectors into the crossbar array. In one embodiment, the set of input activation vectors, which is input into tile 102, is encoded as electrical pulse durations. In another embodiment, the set of input activation vectors, which is input into tile 102, can be encoded as voltage signals. Processor 110 can also be configured to read, via controller 108, output activation vectors from the plurality of output lines 106 of tile 102. The output activation vectors can represent outputs of operations (e.g., MAC operations) performed on the crossbar array based on the set of input activation vectors and the synaptic weight stored in memory elements 112. In an aspect, the input activation vectors get multiplied by the value (e.g., synaptic weight) stored on memory elements 112 of tile 102, and the resulting products are accumulated (added) column-wise to produce output activation vectors in each one of those columns (output lines 106). These output activation vectors can further pass through a respective activation function for activating a respective neurons.
Further, processor 110 can be further configured to train an ANN by adjusting the synaptic weights values, of the ANN, stored in the crossbar array. Processor 110 can repeatedly adjust the synaptic weights values stored in the crossbar array until the error between the expected outcome and a predicted outcome by the ANN converges to a target accuracy. Once the error converges to the target accuracy, processor 110 can deploy the ANN to perform inference, such as classifying or predicting input data. In an aspect, once the ANN is deployed, the synaptic weights values stored in the crossbar array can remain fixed or unchanged. However, the synaptic weights values stored in the crossbar array can be adjustable if the ANN is being retrained using new training data.
The plurality of memory elements 202 can be analog non-volatile memory elements, such as resistive random-access memory (RRAM), conductive bridging random access memory (CBRAM), ferroelectric field-effect transistors (FeFET), ferroelectric tunneling junction, or electro-chemical random-access memory (ECRAM). If a tile 102 has N rows of memory elements 202 and M columns of memory elements 202, then tile 102 can include N×M/4 memory blocks 204 and tile 102 can store N×M/4 synaptic weights values.
In one embodiment, processor 110 can sequentially enable first portion 210 and second portion 212 of memory elements 202 in tile 102. Processor 110 can also sequentially provide input data to tile 102 for inference (e.g., classification, predicting, clustering, or other types of inference). By way of example, processor 110 can send control signals to enable first portion 210 of memory elements 202 that include columns of memory elements storing synaptic weights values wk. If first portion 210 is enabled, then second portion 212 of memory elements 202 storing the inverse values −wk can be deactivated. First portion 210 and second portion 212 of memory elements 202 can be enabled separately, such that when first portion 210 is enabled by processor 110, second portion 212 is deactivated, and vice versa.
In one embodiment, in response to enabling first portion 210, processor 110 can provide first input data 206 representing a vector U to first portion 210. Memory elements among the enabled first portion 210 can receive input data 206 and perform a MAC operation on vector elements of vector U and synaptic weights values wk. Processor 110 can read a result 216 of the MAC operation performed on input data 206 and synaptic weights values wk from first portion 210. In response to reading result 216, processor 110 can enable (or activate) second portion 212 of memory elements 202 and disable (or deactivate) first portion 210.
In response to enabling second portion 212, processor 110 can provide second input data 208 representing a vector −U to second portion 212. Vector −U can be an inverse of vector U, such that vector elements of vector −U can be inverse of a corresponding vector elements of vector U. Tile 102 can receive input data 208 and perform a MAC operation on vector elements of vector −U and the inverse values −wk stored in second portion 212. Processor 110 can read a result 218 of the MAC operation performed on input data 208 and inverse values −wk from second portion 212. Processor 110 can combine results 216, 218 to generate a final MAC operation result that can be a compensated version of result 216.
By way of example, a switch S11 and a resistor R11 can form a first memory element in a memory block 300. A switch S21 and resistor R21 can form a second memory element in memory block 300. The first and second memory element in memory block 300 can be among first portion 210 of tile 102 (see
Switches among first portion 210 can be connected to a control line 310. For example, gate terminals of switches S11, S21, S12, S22 can be connected to control line 310. Processor 110 can send control signals to switches among first portion 210 using control line 310 to enable or disable memory elements among first portion 210. Switches among second portion 212 can be connected to a control line 312. For example, gate terminals of switches S31, S41, S32, S42 can be connected to control line 312. Processor 110 can send control signals to switches among second portion 212 using control line 312 to enable or disable memory elements among second portion 212.
In one embodiment, for each memory element 202, the analog memory element (e.g., resistor) can be connected to a column line and the switch can be connected between the analog memory element and a row line. Using resistor R11 and switch S11 as an example, when switch S11 is enabled, resistor R11 remains connected to both the column line 306 and row line 304. When switch S11 is deactivated, resistor R11 is disconnected from row line 304 and current does now flow through resistor R11. In one embodiment, processor 110 can switch control lines 310, 312 on or off separately. By way of example, processor 110 can switch on control line 310 and switch off control line 312, and provide a control signal to the entire crossbar array, such that switches connected to control line 310 can be enabled by the control signal but switches connected to control line 312 are deactivated. Similarly, processor 110 can switch off control line 310 and switch on control line 312, and provide the control signal to the entire crossbar array, such that switches connected to control line 310 can be deactivated by the control signal but switches connected to control line 312 are enabled. Therefore, the separated control of gate terminals of the switches using different control lines 310, 312 can allow processor 110 to selectively enabled or disable first portion 210 and second portion 212.
In response to enabling first portion 210, processor 110 can provide input data 206 to first portion 210 via row lines. Input data 206 can be an input that needs to undergo inference, such as classification and/or clustering, using synaptic weights values wk. In the example shown in
In response to enabling second portion 212, processor 110 can provide input data 208 to tile 102. Input data 208 can be an inverse of input data 206, thus can include vector elements −u1 and −u2. In the example shown in
In response to completing a first MAC operation on input data 206 and synaptic weights values wk (see
Under ideal situations, results 216, 218 are identical because, for example, u1w1 is equivalent to (−u1)(−w1). However, results 216, 218 can be different from one another due to any fixed mismatch or asymmetries between positive or negative weights, positive or negative input, or peripheral circuitry (e.g., mismatches between the current paths of the first MAC operation and the second MAC operation). The systems and methods described herein can average out the differences since currents generated during the first MAC operations and currents generated during the second MAC operation to provide compensation for the fixed asymmetries. The storage of inverse values −wk, and performing the second MAC operation on the inverse of an input and −wk, can provide compensation to the first MAC operation and the compensation can improve an accuracy of the first MAC operation.
Process 400 can begin at block 402. At block 402, a processor (e.g., processor 110 in
Process 400 can proceed from block 404 to block 406. At block 406, the processor can send an inverse of the input vector to a second portion of the memory device. Process 400 can proceed from block 406 to block 408. At block 408, the processor can read a second result of a MAC operation performed on the inverse of the input vector and an inverse of synaptic weights stored in the second portion.
Process 400 can proceed from block 408 to block 410. At block 410, the processor can combine the first result and the second result to generate a final result. The final result can be a compensated result of the MAC operation performed on the input vector and the synaptic weights stored in the first portion. In one embodiment, the processor can combine the first result and the second result by averaging the first result and the second result to generate the final result.
In one embodiment, the memory device can include a plurality of memory elements arranged into a plurality of memory blocks. The first portion can include a first set of memory elements in each memory block among the plurality of memory blocks. The second portion can include a second set of memory elements in each memory block among the plurality of memory blocks. In one embodiment, the first set of memory elements can include a first pair of memory elements, and the second set of memory elements can include a second pair of memory elements. In one embodiment, the memory device can be an analog non-volatile memory device.
In one embodiment, the processor can enable the first portion of the memory device, and sending the input vector to the first portion can be performed in response to enabling the first portion. The processor can, in response to reading the first result, disable the first portion. The processor can, in response to disabling the first portion, enable the second portion. The processor can further enable the second portion of the memory device, and sending the inverse of the input vector to the second portion is performed in response to enabling the second portion.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be implemented substantially concurrently, or the blocks may sometimes be implemented in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the term “or” is an inclusive operator and can mean “and/or”, unless the context explicitly or clearly indicates otherwise. It will be further understood that the terms “comprise”, “comprises”, “comprising”, “include”, “includes”, “including”, and/or “having,” when used herein, can specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the phrase “in an embodiment” does not necessarily refer to the same embodiment, although it may. As used herein, the phrase “in one embodiment” does not necessarily refer to the same embodiment, although it may. As used herein, the phrase “in another embodiment” does not necessarily refer to a different embodiment, although it may. Further, embodiments and/or components of embodiments can be freely combined with each other unless they are mutually exclusive.
As used herein, a “module” or “unit” may include hardware (e.g., circuitry, such as an application specific integrated circuit), firmware and/or software executable by hardware (e.g., by a processor or microcontroller), and/or a combination thereof for carrying out the various operations disclosed herein. For example, a processor or hardware may include one or more integrated circuits configured to perform function mapping or polynomial fits based on reading currents outputted from one or more of the output lines of the crossbar array at different time points, and/or apply the function to subsequent outputs to correct or compensate for temporal conductance variations in the crossbar array. The same or another processor may include circuits configured to input activation vectors encoded as electric pulse durations and/or voltage signals across the input lines for the crossbar array to perform its operations.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements, if any, in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.