With advances in modern day semiconductor manufacturing processes and the continually increasing amounts of data generated each day, there is an ever greater need to store and process large amounts of data, and therefore a motivation to find improved ways of storing and processing large amounts of data. Although it is possible to process large quantities of data in software using conventional computer hardware, existing computer hardware can be inefficient for some data-processing applications.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the provided subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. For example, the formation of a first feature over, or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Further, spatially relative terms, such as “beneath,” “below,” “lower,” “above,” “upper” “top,” “bottom” and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. The spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. The apparatus may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein may likewise be interpreted accordingly.
In this regard, machine learning has emerged as an effective way to analyze and derive value from such large quantities of data. Generally, machine learning is a field of computer science that involves algorithms that allow computers to “learn” (e.g., improve performance of a task) without being explicitly programmed. Machine learning can involve different techniques for analyzing data to improve upon a task. One such technique (such as deep learning) is based on neural networks. However, machine learning performed on conventional computer systems can involve excessive data transfers between memory and the processor, leading to high power consumption and slow compute times.
Compute-in-Memory (CiM) (which can also be referred to as in-memory processing) involves performing compute operations within a memory array. Stated another way, compute operations are performed directly on the data read from the memory cells instead of transferring the data to a digital processor for processing. By avoiding transferring some data to the digital processor, the bandwidth limitations associated with transferring data back and forth between the processor and memory in a conventional computer system are reduced.
One application for such a CiM is artificial intelligence (AI), and specifically machine learning. For example, a computing system (e.g., a CiM system) can use multiple layers of computational nodes, where lower layers perform computations based on results of computations performed by higher layers. These computations sometimes may rely on the computation of dot-products and absolute difference of vectors, typically computed with MAC (operations) performed on the parameters, input data and weights. The term “MAC” can refer to multiply-accumulate, multiplication/accumulation, or multiplier accumulator, in general referring to an operation that includes the multiplication of two values, and the accumulation of a sequence of multiplications.
The present disclosure provides various embodiments of a CiM system that can efficiently output a number of MAC values on a number of input signals. For example, the CiM system, as disclosed herein, can include a number of macros formed as an array, and a control circuit operatively coupled to the array. Each macro can output a number of MAC values of a first input signal and a second input signal. Each of the first and second input signals can include a respective plural number of (e.g., binary) bits. The macro can compute or otherwise determine a MAC value on a first one of the bits of the first input signal and a first one of the bits of the second input signal obtained in a current cycle. Further, the macro can determine the MAC value in the current cycle as either a fixed logic value or being computed based on the respective first bits obtained in the current cycle. In various embodiments, prior to computing the MAC value (of the respective first bits), the control circuit can output a control signal to the macro based on the first bits, and the macro can determine whether there is a need to toggle its inputs to the first bits. As such, as a frequency of the cycles increases (e.g., thereby computing the MAC values in a higher frequency), the macro can significantly decrease an amount of toggling to bits of the input signals, which can advantageously reduce power consumption of the whole CiM system while maintaining the high speed computation.
A neuron's total input stimulus corresponds to the combined stimulation of all of its weighted input connections. According to various implementations, if a neuron's total input stimulus exceeds some threshold, the neuron is triggered to perform some, e.g., linear or non-linear mathematical function on its input stimulus. The output of the mathematical function corresponds to the output of the neuron which is subsequently multiplied by the respective weights of the neuron's output connections to its following neurons.
Generally, the more connections between neurons, the more neurons per layer and/or the more layers of neurons, the greater the intelligence the network is capable of achieving. As such, neural networks for actual, real-world artificial intelligence applications are generally characterized by large numbers of neurons and large numbers of connections between neurons. Extremely large numbers of calculations (not only for neuron output functions but also weighted connections) are therefore involved in processing information through a neural network.
As mentioned above, although a neural network can be completely implemented in software as program code instructions that are executed on one or more traditional general purpose central processing unit (CPU) or graphics processing unit (GPU) processing cores, the read/write activity between the CPU/GPU core(s) and system memory that is needed to perform all the calculations is extremely intensive. The overhead and energy associated with repeatedly moving large amounts of read data from system memory, processing that data by the CPU/GPU cores and then writing resultants back to system memory, across the many millions or billions of computations needed to effect the neural network have not been entirely satisfactory in many aspects.
As shown, the CiM system 200 includes a CiM array 202 and a control circuit 252, in accordance with various embodiments. The CiM array 202 includes a number of (e.g., CiM) macros: 212A, 212B, 212C, 212D, 212E, 212F, 212G, and 212H. Although eight macros are shown, it should be understood that the CiM array 202 can include any number of macros while remaining within the scope of present disclosure. These macros of the CiM array 202 are sometimes collectively referred to as macros 212. In some embodiments, the macros 212 can be arranged across multiple columns and rows. For example in
As will be discussed in further detail with respect to
In some embodiments, the control circuit 252 includes a number of logic gates that each can generate the control signal for a respective column of the CiM array 202. For example in
Referring to
The storage components 302 to 310 can each store at least two respective bits of a first input signal and a second input signal. The input storage components 302 to 308 are configured to store respective bits of the first and second input signals received or otherwise obtained for a current CiM operation, while the backup storage component 310 is configured to store two (e.g., last computed) bits of the first and second input signals received or otherwise obtained for a previous CiM operation. Further, the storage component 302 may correspond to respective most significant bits (MSB) of the first and second input signals obtained in the current CiM operation, while the storage component 308 may correspond to respective lease significant bits (LSB) of the first and second input signals obtained in the current CiM operation.
Within each CiM operation, the macro 212A may perform a MAC operation on the bits stored in each of the input storage components 302 to 308 during a respective one of a number of different cycles. The macro 212A can sequentially perform the MAC operations according to a value of the bits of the first and second input signals, in some embodiments. For example, the macro 212A can perform a first MAC operation on the respective MSBs of the first and second input signals (stored in 302A and 302B of the input storage component 302, respectively) in a first cycle; a second MAC operation on the respective next MSBs of the first and second input signals (stored in 304A and 304B of the input storage component 304, respectively) in a second cycle; a third MAC operation on the respective next LSBs of the first and second input signals (stored in 306A and 306B of the input storage component 306, respectively) in a third cycle; and a fourth MAC operation on the respective LSBs of the first and second input signals (stored in 308A and 308B of the input storage component 308, respectively) in a fourth cycle. Accordingly, the backup storage component 310 may store, in 310A and 310B, respectively, the LSBs of the first and second input signals obtained in the previous CiM operation.
However, it should be understood that the macro 212A can sequentially perform the MAC operations in a different order, while remaining within the scope of present disclosure. For example, the macro 212A can perform the MAC operations starting with the LSBs of the first and second input signals (in the current CiM operation). In such a scenario, the backup storage component 310 may store the MSBs of the first and second input signals in the previous CiM operation. Additionally, the macro 212A can “selectively” perform each of the MAC operations based on a control signal, which will be discussed in further detail below.
The macro 212A further includes a number of switches 322, 324, 326, 328, and 330. The switches 322 to 330 are coupled to the input/backup storage components 302 to 310, respectively. Further, in each cycle, only one of the switches 322 to 330 can be turned on to toggle or otherwise couple the corresponding storage component to a MAC computation unit 331 of the macro 212A. In accordance with various embodiments, the switches 322 to 328 may be sequentially turned on in respective cycles, unless the switch 330 is turned on. The switch 330 can be turned on based on the control signal, XTRL[0], specifically, a logic inverse value of the control signal, XTRL [0].
As discussed with respect to
The macro 212A further includes at least a first multiplier 340, a second multiplier 342, and an adder 354, which can form the MAC computation unit 331. The first multiplier 340 and second multiplier 342 are each configured to multiple a bit of one of the first or second input signals (e.g., obtained in a current cycle) by a respective weight. In some embodiments, the first multiplier 340 can retrieve one of the bits of the input signal, XIN[0], upon the corresponding switch being turned on, and multiple the retrieved bit by a weight 341; and the second multiplier 342 can retrieve one of the bits of the input signal, XIN[1], upon the corresponding switch being turned on, and multiple the retrieved bit by a weight 343. Next, the adder 354 can sum the multiplication results provided by the multipliers 340 and 342, and output the sum as an intermediate MAC value 355.
For example, in response to the switch 322 being turned on, 302A and 302B of the storage components 302 can be coupled to the multipliers 340 and 342, respectively. Next, the multiplier 340 can multiple the bit obtained from 302A by the weight 341, and the multiplier 342 can multiple the bit obtained from 302B by the weight 341. The adder 354 can then sum the multiplied bits as the MAC value 354 in the current cycle. On the other hand (where the switch 322 is not turned on as originally scheduled, and in turn, the switch 330 is turned on), the macro 212A can skip the MAC operation in this cycle and output a final MAC value 357 as a fixed logic value.
The macro 212A can store the weights 341 and 343 in respectively different memory (or bit) cells 352 of a coupled memory array 350. Although in the illustrated embodiment of
Operatively coupled to the MAC computation unit 331, the macro 212A further includes a logic gate (e.g., an AND gate) configured to receive the intermediate MAC value 355 (regardless of being computed or not) and the control signal, XTRL[0], as inputs, and to perform an AND operation on these two inputs to output the final MAC value 357. As discussed above, a logic value of the control signal XTRL[0] is determined by OR'ing the bits of the input signals, XIN[0] and XIN[1], in a certain cycle. For example, if the bits are each equal to a logic 0, the control signal XTRL[0] is equal to a logic 0, which can cause a final MAC value 357 to be a logic 0 regardless of the intermediate MAC value 355. Alternatively stated, the macro 212A can determine or otherwise identify the bits of the first and second input signals in a certain cycle based on the control signal, XTRL[0]. If both of the bits are logic 0s, the macro 212A can skip toggling the corresponding switch (one of the switches 322 to 328) and performing the MAC operation to directly output the final MAC value as a fixed logic 0.
In brief overview, the method 400 starts with operation 402 of receiving a first input signal (e.g., XIN[0]) and a second input signal (e.g., XIN[1]). The method 400 proceeds to operation 404 of determining whether respective bits of the first and second inputs signals are each equal to a logic 0. In response to determining that the bits are both equal to logic 0s, the method 400 continues to operation 406 of maintaining inputs of a MAC computation unit unchanged. Next, the method 400 continues to operation 408 of outputting a final MAC value as a fixed logic value. In response to determining that at least one of the bits is not equal to a logic 0, the method 400 continues to operation 410 of coupling the bits of the input signals to the MAC computation unit. Next, the method 400 continues to operation 412 of outputting the final MAC value based on MAC computation.
To further elaborate the method 400,
Referring first to
Referring next to
Referring next to
Referring next to
Referring then to
In one aspect of the present disclosure, an integrated circuit is disclosed. The integrated circuit includes a first logic gate configured to receive a first input signal and a second input signal, and generate a first control signal based on a first bit of first input signal and a first bit of the second input signal obtained in a current cycle. The integrated circuit includes a first backup storage component configured to store a second bit of the first input signal and a second bit of the second input signal obtained in a previous cycle. The integrated circuit includes a plurality of first macros each configured to selectively compute, based on the first control signal, a first multiply-accumulate (MAC) value for the first bit of the first input signal and the first bit of the second input signal.
In another aspect of the present disclosure, an integrated circuit is disclosed. The integrated circuit includes an array comprising a plurality of macros. Each macro is configured to output a plurality of multiply-accumulate (MAC) values of a first input signal and a second input signal in respectively different cycles. Each macro is configured to determine a first one of the plurality of MAC values in a current one of the cycles as either a fixed logic value or being computed based on a first bit of the first input signal and a first bit of the second input signal obtained in the current cycle.
In yet another aspect of the present disclosure, a method for operating a CiM system is disclosed. The method includes receiving a first input signal and a second input signal. The method includes in response to determining that at least one of a first bit of the first input signal or a first bit of the second input signal obtained in a current cycle is not equal to a first logic value, computing a multiply-accumulate (MAC) value of the first bit of the first input signal and the first bit of the second input signal. The method includes in response to determining that the first bit of the first input signal and the first bit of the second input signal obtained in the current cycle are each equal to the first logic value, outputting the MAC value as the first logic value.
As used herein, the terms “about” and “approximately” generally mean plus or minus 10% of the stated value. For example, about 0.5 would include 0.45 and 0.55, about 10 would include 9 to 11, about 1000 would include 900 to 1100.
The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
This application claims priority to and the benefit of U.S. Provisional Application No. 63/283,018, filed Nov. 24, 2021, entitled “ZERO SKIP FOR COMPUTING IN MEMORY,” which is incorporated herein by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63283018 | Nov 2021 | US |