Computer artificial intelligence (AI) has been built on machine learning, for example, using deep learning techniques. With machine learning, a computing system organized as a neural network computes a statistical likelihood of a match of input data with prior computed data. A neural network refers to a number of interconnected processing nodes that enable the analysis of data to compare an input to “trained” data. Trained data refers to computational analysis of properties of known data to develop models to use to compare input data. An example of an application of AI and data training is found in object recognition, where a system analyzes the properties of many (e.g., thousands or more) of images to determine patterns that can be used to perform statistical analysis to identify an input object.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the provided subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. For example, the formation of a first feature over, or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Further, spatially relative terms, such as “beneath,” “below,” “lower,” “above,” “upper” “top,” “bottom” and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. The spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. The apparatus may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein may likewise be interpreted accordingly.
Neural networks compute “weights” to perform computation on new data (an input data “word”). Neural networks use multiple layers of computational nodes, where deeper layers perform computations based on results of computations performed by higher layers. Machine learning currently relies on the computation of dot-products and absolute difference of vectors, typically computed with multiply-accumulate (MAC) operations performed on the parameters, input data and weights. The computation of large and deep neural networks typically involves so many data elements, and thus it is not practical to store them in processor cache. Accordingly, these data elements are usually stored in a memory.
Thus, machine learning is very computationally intensive with the computation and comparison of many different data elements. The computation of operations within a processor is orders of magnitude faster than the transfer of data elements between the processor and main memory resources. Placing all the data elements closer to the processor in caches is prohibitively expensive for the great majority of practical systems due to the memory sizes needed to store the data elements. Thus, the transfer of data elements becomes a major bottleneck for AI computations. As the data sets increase, the time and power/energy a computing system uses for moving data elements around can end up being multiples of the time and power used to actually perform computations.
In this regard, compute-in-memory (CIM) circuits or systems have been proposed to perform such MAC operations. Similar to a human brain, a CIM circuit instead conducts data processing in situ within a suitable memory circuit. The CIM circuit suppresses the latency for data/program fetch and output results upload in corresponding memory (e.g. a memory array), thus solving the memory (or von Neumann) bottleneck of conventional computers. Another key advantage of the CIM circuit is the high computing parallelism, thanks to the specific architecture of the memory array, where computation can take place along several current paths at the same time. The CIM circuit also benefits from the high density of multiple memory arrays with computational devices, which generally feature excellent scalability and the capability of 3D integration. As a non-limiting example, the CIM circuit targeted for various machine learning applications can perform the MAC operations locally within the memory (i.e., without having to send data elements to a host processor) to enable higher throughput dot-product of neuron activation and weight matrices, while still providing higher performance and lower energy compared to computation by the host processor.
The data elements, processed by the CIM circuit, have various data types or forms, such as an integer data type and a floating point data type. The integer data types, each of which represents a range of mathematical integers, may be of different sizes. For example, the integer data types are of 4 bits (sometimes referred to as an INT4 data type), 8 bits (sometimes referred to as an INT8 data type), etc. The floating point data type is typically represented by a sign portion, an exponent portion, and a significand (mantissa) portion that consists of the significant digits of the number. For example, one floating point number format specified by the Institute of Electrical and Electronics Engineers (IEEE®) has sixteen bits in size (sometimes referred to as an FP16 data type), which includes ten mantissa bits, five exponent bits, and one sign bit. Another floating point number format also has sixteen bits in size (sometimes referred to as a BF16 data type), which includes seven mantissa bits, eight exponent bits, and one sign bit.
In machine learning applications, the CIM circuit is frequently configured to process dot product multiplications based on performing MAC operations on a large number of data elements (e.g., an input word vector and a weight matrix), which may each be in the integer data type or the floating point data type, and then process addition (or accumulation) of such dot products. However, in the existing technologies, nearly no CIM circuit has been configured to process data elements in both of the integer data type and the floating point data type. For example, dedicated hardware circuit components are generally needed for processing different data types, which disadvantageously lowers the hardware utilization rate. In turn, such CIM circuits may occupy an additional portion of the precious real estate of a substrate. Thus, the existing CIM circuits have not been entirely satisfactory in certain aspects.
The present disclosure provides various embodiments of a compute-in-memory (CIM) circuit that can switch between a first mode and a second mode, in which the first mode is configured for processing a number of input data elements and a corresponding number weight data elements that are each provided as an integer data type, and the second mode is configured for processing a number of input data elements and a corresponding number weight data elements that are each provided as a floating point data type. For example, the CIM circuit, as disclosed herein, can perform in-memory computations (e.g., multiply-accumulate (MAC) operations) on the input data elements and the weight data elements. Based on whether the input/weight data elements are provided as the integer or floating point data type, the CIM circuit can use the same hardware components to perform the MAC operations. In various embodiments, the disclosed CIM circuit may include a number of multi-mode local computing cells (LCCs). Based on the data type received or identified, each of the LCCs can selectively perform MAC operations on a pair of weight data elements and a pair of input data elements (when, e.g., each of the input/weight data elements is provided with the INT8 data type), a quadruple of weight data elements and a quadruple of input data elements (when, e.g., each of the input/weight data elements provided with the INT4 data type), or a single weight data element and a single input data element (when, e.g., each of the input/weight data elements is provided with the FP16 or BF16 data type).
As shown, the memory circuit 100 includes a memory circuit 102, an input circuit 104, a number of local computing cells 106, and an adder circuit (or adder tree) 108. Each of the components shown in
The memory circuit 102 may include one or more memory arrays and one or more corresponding circuits. The memory arrays are each a storage device including a number of storage elements 103, each of the storage elements 103 including an electrical, electromechanical, electromagnetic, or other device configured to store one or more data elements, each data element including one or more data bits represented by logical states. In some embodiments, a logical state corresponds to a voltage level of an electrical charge stored in a portion or all of a storage element 103. In some embodiments, a logical state corresponds to a physical property, e.g., a resistance or magnetic orientation, of a portion or all of a storage element 103.
In some embodiments, the storage element 103 includes one or more static random-access memory (SRAM) cells. In various embodiments, an SRAM cell includes a number of transistors, e.g., a five-transistor (5T) SRAM cell, a six-transistor (6T) SRAM cell, an eight-transistor (8T) SRAM cell, a nine-transistor (9T) SRAM cell, etc. In some embodiments, the storage element 103 includes one or more dynamic random-access memory (DRAM) cells, resistive random-access memory (RRAM) cells, magnetoresistive random-access memory (MRAM) cells, ferroelectric random-access memory (FeRAM) cells, NOR flash cells, NAND flash cells, conductive-bridging random-access memory (CBRAM) cells, data registers, non-volatile memory (NVM) cells, 3D NVM cells, or other memory cell types capable of storing bit data.
In addition to the memory array(s), the memory circuit 102 can include a number of circuits to access or otherwise control the memory arrays. For example, the memory circuit 102 may include a number of (e.g., word line) drivers operatively coupled to the memory arrays. The drivers can apply signals (e.g., voltages) to the corresponding storage elements 103 so as to allow those storage elements 103 to be accessed (e.g., programmed, read, etc.). For another example, the memory circuit 102 may include a number of programming circuits and/or read circuits that are operatively coupled to the memory arrays.
The memory arrays of the memory circuit 102 are each configured to store a number of the weight data elements WtDE. In some embodiments, the programming circuits may write the weight data elements WtDE into corresponding storage elements 103 of the memory arrays, respectively, while the reading circuit may read bits written into the storage elements 103, so as to verify or otherwise test whether the written weight data elements WtDE are correct. The drivers of the memory circuit 102 can include or be operatively coupled to a number of input activation latches that are configured to receive and temporarily store the input data elements InDE. In some other embodiments, such input activation latches may be part of the input circuit 104, which can further include a number of buffers that are configured to temporarily store the weight data elements WtDE retrieved from the memory arrays of the memory circuit 102. As such, the input circuit 104 can receive the input data elements InDE and the weight data elements WtDE.
In some embodiments, the input word vector (including, e.g., the input data elements InDE) and the weight matrix (including, e.g., the weight data elements WtDE), on which the memory circuit 100 is configured to perform MAC operations, can be configured in any of at least the following data types: the INT8 data type, the INT4 data type, the FP16 data type, and the BF16 data type. However, it should be understood that. In some other embodiments, each of the input data elements InDE and the weight data elements WtDE can have any of various other integer or floating point data types such as, for example, an INT16 data type, a FP32 data type, a FP64 data type, a FP128 data type, etc., while remaining within the scope of the present disclosure.
When configured as the INT8 data type, each of the input data elements InDE and weight data elements WtDE includes 8 bits. When configured as the INT4 data type, each of the input data elements InDE and weight data elements WtDE includes 4 bits. When configured as the FP16 data type, each of the input data elements InDE and weight data elements WtDE includes 1 sign bit, 5 exponent bits, and 10 mantissa bits. When configured as the BF16 data type, each of the input data elements InDE and weight data elements WtDE includes 1 sign bit, 8 exponent bits, and 7 mantissa bits.
Referring still to
In response to identifying that the input data elements InDE and weight data elements WtDE are provided as an integer data type (e.g., the INT8 data type), each of the local computing cells 106 can provide one multiply-accumulate (MAC) result of the corresponding pair of the input data elements InDE and weight data elements WtDE. Such a MAC result is a sum of (i) a product of a first one of the input data elements InDE (e.g., IN0) and a first one of the weight data elements WtDE (e.g., W0); and (ii) a product of a second one of the input data elements InDE (e.g., IN1) and a second one of the weight data elements WtDE (e.g., W1).
The MAC result may be an accumulated sum of multiple partial MAC results, each of which represents a sum of (i) a product of a corresponding bit of the first input data element InDE and the first weight data element WtDE (e.g., IN0×W0); and (ii) a product of a corresponding bit of the second input data element InDE and the second weight data element WtDE (e.g., IN1×W1).
Further, each of the local computing cell 106 can include a first part and a second part that are generally referred to as an adder part and a multiplexer part, respectively. In response to identifying the data type of the input/weight data elements, the first part can sum the pair of weight data elements WtDE and provide it to the second part, causing the second part to calculate the MAC result based on a logic combination of the pair of input data elements InDE, which will be discussed in further detail with respect to
In response to identifying that the input data elements InDE and weight data elements WtDE are provided as another integer data type (e.g., the INT4 data type), each of the local computing cells 106 can provide four MAC results of the corresponding quadruple of the input data elements InDE and weight data elements WtDE. A first one of the MAC results is a sum of (i) a product of a first one of the input data elements InDE (e.g., IN0) and a first one of the weight data elements WtDE (e.g., W0); and (ii) a product of a second of the input data elements InDE (e.g., IN2) and a second one of the weight data elements WtDE (e.g., W2). A second one of the MAC results is a sum of (i) a product of the first input data element InDE and a third one of the weight data elements WtDE (e.g., W1); and (ii) a product of the second input data element InDE and a fourth one of the weight data elements WtDE (e.g., W3). A third one of the MAC results is a sum of (i) a product of a third one of the input data elements InDE (e.g., IN1) and the first weight data element WtDE; and (ii) a product of a fourth of the input data elements InDE (e.g., IN3) and the second weight data element WtDE. A fourth one of the MAC results is a sum of (i) a product of the third input data element InDE and the third weight data element WtDE; and (ii) a product of the fourth input data element InDE and the fourth weight data element WtDE.
The MAC result may be an accumulated sum of multiple partial MAC results, each of which represents a sum of (i) a product of a corresponding bit of the first/third input data element InDE and the first/third weight data element WtDE (e.g., IN0×W0, IN0×W1, IN1×W0, IN1×W1); and (ii) a product of a corresponding bit of the second/fourth input data element InDE and the second/fourth weight data element (e.g., IN2×W2, IN2×W3, IN3×W2, IN3×W3).
Further, each of the local computing cell 106 can include a first part and a second part that are generally referred to as an adder part and a multiplexer part, respectively. In response to identifying the data type of the input/weight data elements, the the first part can sum the second and fourth weight data elements WtDE, sum the first and third weight data elements WtDE, and provide them to the second part, causing the second part to calculate the MAC results based on a logic combination of the first and second input data elements InDE and a logic combination of the third and fourth input data elements InDE, which will be discussed in further detail with respect to
In response to identifying that the input data elements InDE and weight data elements WtDE are provided as a floating point data type (e.g., the BF16 data type), each of the local computing cells 106 can provide a pair of MAC elements of the corresponding input data element InDE and the corresponding weight data element WtDE. Such MAC elements include: (i) a sum of an exponent portion of the input data element InDE (e.g., INE0) and an exponent portion of the weight data element (e.g., WE0); and (ii) a product of a mantissa portion of the input data element InDE (e.g., INM0) and a mantissa portion of the weight data element WtDE (e.g., WM0).
The MAC element (e.g., the mantissa product) may be an accumulated sum of multiple partial mantissa products, each of which represents a product of a corresponding bit of the mantissa portion of the input data element InDE and the mantissa portion of the weight data element WtDE (e.g., INM0×WM0).
Further, each of the local computing cell 106 can include a first part and a second part that are generally referred to as an adder part and a multiplexer part, respectively. In response to identifying the data type of the input/weight data elements, the first part can provide the sum of the exponent portions, and the second part can provide the product of the mantissa portions, which will be discussed in further detail below with respect to
The adder tree 108 can receive the MAC results/elements from all of the local computing cells 106, and sum them up to generate a final MAC result (PS) of the N input data elements InDE and the N weight data elements WtDE. For example, in response to identifying that a data type of the input/weight data elements is the INT8, the adder tree 108 can sum the N/2 MAC results provided by the local computing cells 106, respectively, and provide the PS result through one output channel. In another example, in response to identifying that a data type of the input/weight data elements is the INT4, the adder tree 108 can sum the N/4 MAC results provided by the local computing cells 106, respectively, and provide the PS result through four output channels. In yet another example, in response to identifying that a data type of the input/weight data elements is the BF16, the adder tree 108 can sum the N MAC elements (mantissa products) provided by the local computing cells 106, respectively, and provide the PS result through one output channel.
As shown, the local computing cell 200 includes a multi-mode data selector 210, a configurable adder 220, and a number of multiplexers (MUXs) 230. In various embodiments of the present disclosure, regardless of the data type of the input data elements InDE and weight data elements WtDE being received, the local computing cell 200 can use the same hardware components, e.g., 210-230, to process the corresponding input data element(s) InDE and weight data element(s) WtDE and provide the MAC result(s)/element(s). For example, based on the identified data types, the components 210 to 230 can respond differently (or operated in different modes) to provide respective outputs. Accordingly, each of the hardware components of the local computing cell 200 will be introduced as follows, and will be further described when operating under different modes in
In some embodiments, the local computing cell 200 can process MAC operations on data elements of 16 bits each time (e.g., each clock cycle or each time duration). For example, the local computing cell 200 can perform MAC operations on 2 input data elements InDE and 2 weight data elements WtDE, each of which has 8 bits. In another example, the local computing cell 200 can perform MAC operations on 4 input data elements InDE and 4 weight data elements WtDE, each of which has 4 bits. In yet another example, the local computing cell 200 can perform MAC operations on 1 input data element InDE and 1 weight data element WtDE, each of which has 16 bits. However, the local computing cell 200 can process other number of bits while remaining within the scope of the present disclosure. Further, the number of the multiplexers 230 of each local computing cell 200 may correspond to the number of processed bits. For example, the number of multiplexers 230 may be equal to one half of the number of processed bits.
Upon receiving the input data elements InDE and weight data elements WtDE, the local computing cell 200 can separate the weight data elements WtDE into a signal A and a signal B. When the data elements are in the INT8 data type, the signal B and the signal A may represent a first weight data element WtDE (e.g., W0) and a second weight data element WtDE (e.g., W1), respectively. Further, in the example where the data elements each have 16 bits, the signal B may have 8 bits, which may be expressed as W0[0:7], and the signal A may also have 8 bits, which may be expressed as W1[0:7]. When the data elements are in the INT4 data type, the signal B may represent first and second weight data elements WtDE (e.g., W0 and W1), and the signal A may represent third and fourth weight data elements WtDE (e.g., W2 and W3). In the same example where the data elements each have 16 bits, the signal B may have a total of 8 bits, which may be expressed as W0[0:3] and W1[0:3], and the signal A may also have 8 bits, which may be expressed as W2[0:7] and W3[0:3]. When the data elements are in the BF16 data type, the signal B and the signal A may represent the mantissa portion of a weight data input WtDE (e.g., WM0) and the exponent portion of the weight data input WtDE (e.g., WE0), respectively. Still with the example where the data elements each have 16 bits, the signal B may have 8 bits, which may be expressed as WM0[7:0], and the signal A may also have 8 bits, which may be expressed as WE0[7:0].
The multi-mode data selector 210 can receive the signal B and a signal C (which represents the exponent portion of an input data element InDE, e.g., INE0), and select one of them as its output based on a control signal 211. In the 16-bit example, the signal C (e.g., INE0) may also have 8 bits, when the data elements are provided in the BF16 data type, which may be expressed as INE0[7:0]. The control signal 211 may be generated based on identifying the data type of the input data elements InDE and weight data elements WtDE. For example, when the data type is an integer type (e.g., the INT8 data type, the INT4 data type), the control signal 211 may be indicated as “INT,” which causes the multi-mode data selector 210 to select the signal B; and when the data type is a floating point type (e.g., the FP16 data type, the BF16 data type), the control signal 211 may be indicated as “FP,” which causes the multi-mode data selector 210 to select the signal C. The multi-mode data selector 210 can provide the selected signal as a D_SEL signal (e.g., either the signal B or C) to the configurable adder 220. Continuing with the 16-bit example, the multi-mode data selector 210 may include an 8-bit 2-to-1 multiplexer.
The configurable adder 220 can sum the signal A and the D_SEL signal, and output the result as a signal SUM. Continuing with the 16-bit example, the signal SUM may have 10 bits. In some embodiments, the configurable adder 220 may have a number (e.g., 8) of full adders that can be configured differently based on a control signal 221. The control signal 221 may be generated based on identifying the data type of the input data elements InDE and weight data elements WtDE. For example, when the data elements are identified as the INT8 data type, the control signal 221 may be indicated as “8b,” which causes all 8 full adders to sum the 8-bit signal A (e.g., W1[7:0]) and the 8-bit D_SEL signal (e.g., W0[7:0]). As such, the signal SUM can represent W0[7:0]+W1[7:0]. In another example, when the data elements are identified as the INT4 data type, the control signal 221 may be indicated as “4b,” which causes first 4 of the 8 full adders to sum a first half of the 8-bit signal A (e.g., W2[3:0]) and a first half of the 8-bit D_SEL signal (e.g., W0[3:0]), and second 4 of the 8 full adders to sum a second half of the 8-bit signal A (e.g., W3[3:0]) and a second half of the 8-bit D_SEL signal (e.g., W1[3:0]). As such, the signal SUM can represent W0[3:0]+W2[3:0] and W1[3:0]+W3[3:0]. In yet another example, when the data elements are identified as the BF16 data type, the control signal 221 may be indicated as “X,” which causes all 8 full adders to sum the 8-bit signal A (e.g., WE[7:0]) and the 8-bit D_SEL signal (e.g., INE0[7:0]). As such, the signal SUM can represent WE0[7:0]+INE0[7:0], which is sometimes referred to as an exponent sum.
Each of the multiplexers 230 can select one of the signal A, the signal B, the signal SUM, or a fixed voltage (e.g., VSS/ground) based on a number of corresponding bits of the input data elements InDE. Such bits to control the multiplexers 230 may sometimes be referred to as MUX control bits. In some embodiments, each of the multiplexers 230 is configured to receive 2 MUX control bits, at least one of which corresponds to the corresponding input data element InDE or to a mantissa portion of the corresponding input data element InDE. Based on the MUX control bits, the multiplexers 230 can each provide an output signal. For example, based on different logic combinations of these 2 MUX control bits, each of the multiplexers 230 can provide a respective output signal that is a logically processed version of the signal A, the signal B, the signal SUM, or VSS. The term “logically processed version” may refer to a signal having each of its terms/components multiplied by a corresponding logical value (e.g., either 0 or 1).
In the example where the data elements InDE/WtDE are provided in the INT8 data type, the input data elements InDE, received by the local computing cell 200, may consist of a first input data element (e.g., IN0) and a second input data element (e.g., IN1). The first and second input data elements (e.g., IN0 and IN1) each have 8 bits, and may be respectively expressed as IN0[7:0] and IN1[7:0].
In some embodiments, the 2 MUX control bits, received by each of the multiplexers 230, may consist of a corresponding one of the 8 bits of the IN0 (e.g., IN0[7]) and a corresponding one of the 8 bits of the IN1 (e.g., IN1[7]), respectively. For example, a first one of the multiplexers 230 can receive IN0[7] and IN1[7] as its 2 MUX control bits, respectively; a second one of the multiplexers 230 can receive IN0[6] and IN1[6] as its 2 MUX control bits, respectively; a third one of the multiplexers 230 can receive IN0[5] and IN1[5] as its 2 MUX control bits, respectively; a fourth one of the multiplexers 230 can receive IN0[4] and IN1[4] as its 2 MUX control bits, respectively; a fifth one of the multiplexers 230 can receive IN0[3] and IN1[3] as its 2 MUX control bits, respectively; a sixth one of the multiplexers 230 can receive IN0[2] and IN1[2] as its 2 MUX control bits, respectively; a seventh one of the multiplexers 230 can receive IN0[1] and IN1[1] as its 2 MUX control bits, respectively; and an eighth one of the multiplexers 230 can receive IN0[0] and IN1[0] as its 2 MUX control bits, respectively.
Upon receiving the signal A (e.g., W1), signal B (e.g., W0), signal SUM (e.g., W0+W1), and VSS, each of the multiplexers 230 is configured to select one of these signals and output a signal OUT through multiplying the selected signal by the MUX control bits (e.g., a partial MAC result). The signal OUT may have 10 bits. For example, each of the multiplexers 230 is configured to derive a first product through multiplying the signal B by the corresponding first MUX control bit (e.g., W0[7:0]×IN0[7]) and a second product through multiplying the signal A by the corresponding second MUX control bit (e.g., e.g., W1[7:0]×IN1[7]), and then sum up the first product and the second product as the signal OUT.
Stated another way, each of the multiplexers 230 can provide a partial MAC result derived based on the corresponding MUX control bits of the input data elements InDE received by the local computing cell 200, and either 0 (VSS), the signal A, the signal B, or the signal SUM. Based on this principle, the multiplexers 230 of the local computing cell 200 receiving the input data elements InDE, IN0[7:0] and IN1[7:0], can provide the partial MAC results, W0[7:0]×IN0[7]+W1[7:0]×IN1[7], W0[7:0]×IN0[6]+W1[7:0]×IN1[6], W0[7:0]×IN0[5]+W1[7:0]×IN1[5], W0[7:0]×IN0[4]+W1[7:0]×IN1[4], W0[7:0]×IN0[3]+W1[7:0]×IN1[3], W0[7:0]×IN0[2]+W1[7:0]×IN1[2], W0[7:0]×IN0[1]+W1[7:0]×IN1[1], and W0[7:0]×IN0[0]+W1[7:0]×IN1[0], respectively.
In the example where the data elements InDE/WtDE are provided in the INT4 data type, the input data elements InDE, received by the local computing cell 200, may consist of a first input data element (e.g., IN0), a second input data element (e.g., IN1), a third input data element (e.g., IN2), and a fourth input data element (e.g., IN3). The first to fourth input data elements (e.g., IN0, IN1, IN2, and IN3) each have 4 bits, and may be respectively expressed as IN0 [3:0], IN1[3:0], IN2[3:0], and IN3[3:0].
In some embodiments, the multiplexers 230 may be grouped into a plural number of pairs, each of the pairs can correspond to a corresponding bit of the first to fourth input data elements, IN0, IN1, IN2, and IN3. Accordingly, the 2 MUX control bits, received by a first one of a first multiplexer pair 230, may consist of one of the 4 bits of the IN0 (e.g., IN0[3]) and one of the 4 bits of the IN2 (e.g., IN2[3]), respectively; and the 2 MUX control bits, received by a second one of the first multiplexer pair 230, may consist of one of the 4 bits of the IN1 (e.g., IN1[3]) and one of the 4 bits of the IN3 (e.g., IN3[3]), respectively. Similarly, the 2 MUX control bits, received by a first one of a second multiplexer pair 230, may consist of one of the 4 bits of the IN0 (e.g., IN0[2]) and one of the 4 bits of the IN2 (e.g., IN2[2]), respectively; and the 2 MUX control bits, received by a second one of the second multiplexer pair 230, may consist of one of the 4 bits of the IN1 (e.g., IN1[2]) and one of the 4 bits of the IN3 (e.g., IN3[2]), respectively; and so on. Upon receiving the signal A (e.g., W1 and W3), signal B (e.g., W0 and W2), signal SUM (e.g., W0+W1, W2+W3), and VSS, each of the multiplexers 230 is configured to select one of these signals and output a signal OUT through multiplying the selected signal by the MUX control bits (e.g., two partial MAC results). The signal OUT may have 10 bits.
For example, the first one of the multiplexer pair 230 is configured to derive a first product through multiplying the signal B by the corresponding first MUX control bit (e.g., W0[3:0]×IN0[3]) and a second product through multiplying the signal A by the corresponding second MUX control bit (e.g., W2[3:0]×IN2[3]), and then sum up the first product and the second product as a first sum (e.g., a first partial MAC result). Further, the first one of the multiplexer pair 230 is configured to derive a third product through multiplying the signal A by the corresponding first MUX control bit (e.g., W1[3:0]×IN0[3]) and a fourth product through multiplying the signal B by the corresponding second MUX control bit (e.g., W3[3:0]×IN2[3]), and then sum up the third product and the fourth product as a second sum (e.g., a second partial MAC result). The first one of the multiplexer pair 230 can then provide the first sum and the second sum as its signal OUT. Similarly, the second one of the multiplexer pair 230 can provide a corresponding signal OUT through multiplying the signal A, B, SUM, or VSS by the MUX control bits. Continuing with the same example, the second one of the multiplexer pair 230 can provide a first sum (e.g., W0[3:0]×IN1[3]+W2[3:0]×IN3[3]) and a second sum (e.g., W1[3:0]×IN1[3]+W3[3:0]×IN3[3]) as its signal OUT.
Stated another way, each of the multiplexers 230 can provide a first partial MAC result and a second partial MAC result derived based on the corresponding MUX control bits of two of the input data elements InDE (e.g., IN0, IN1, IN2, and IN3) received by the local computing cell 200, and either 0 (VSS), the signal A, the signal B, or the signal SUM. The multiplexers 230 of each local computing cell 200 can be grouped to a number of multiplexer pairs 230. Specifically, one multiplexer of each multiplexer pair 230 can provide a first pair of partial MAC results based on the input data elements, IN0 and IN2, and the other multiplexer of each multiplexer pair 230 can provide a second pair of partial MAC results based on the input data elements, IN1 and IN3. As a non-limiting example, one of the multiplexer pair 230 receiving the input data elements, IN0[3] and IN2[3], as its MUX control bits can provide the partial MAC results as W0[3:0]×IN0[3]+W2[3:0]×IN2[3] and W1[3:0]×IN0[3]+W3[3:0]×IN2[3]; and the other of the multiplexer pair 230 receiving the input data elements, IN1[3] and IN3[3], as its MUX control bits can provide the partial MAC results as W0[3:0]×IN1[3]+W2[3:0]×IN3[3] and W1[3:0]×IN1[3]+W3[3:0]×IN3[3].
Based on this principle, the multiplexers 230 of the local computing cell 200 receiving the input data elements, IN0[2] and IN2[2], can provide the partial MAC results as W0[3:0]×IN0[2]+W2[3:0]×IN2[2] and W1[3:0]×IN0[2]+W3[3:0]×IN2[2]; and the multiplexers 230 of the local computing cell 200 receiving the input data elements, IN1[2] and IN3[2], can provide the partial MAC results as W0[3:0]×IN1[2]+W2[3:0]×IN3[2] and W1[3:0]×IN1[2]+W3[3:0]×IN3[2]. The multiplexers 230 of the local computing cell 200 receiving the input data elements, IN0[1] and IN2[1], can provide the partial MAC results as W0[3:0]×IN0[1]+W2[3:0]×IN2[1] and W1[3:0]×IN0[1]+W3[3:0]×IN2[1]; and the multiplexers 230 of the local computing cell 200 receiving the input data elements, IN1[1] and IN3[1], can provide the partial MAC results as W0[3:0]×IN1[1]+W2[3:0]×IN3[1] and W1[3:0]×IN1[1]+W3[3:0]×IN3[1]. The multiplexers 230 of the local computing cell 200 receiving the input data elements, IN0[0] and IN2[0], can provide the partial MAC results as W0[3:0]×IN0[0]+W2[3:0]×IN2[0] and W1[3:0]×IN0[0]+W3[3:0]×IN2[0]; and the multiplexers 230 of the local computing cell 200 receiving the input data elements, IN1[0] and IN3[0], can provide the partial MAC results as W0[3:0]×IN1[0]+W2[3:0]×IN3[0] and W1[3:0]×IN1[0]+W3[3:0]×IN3[1].
In the example where the data elements InDE/WtDE are provided in the BF16 data type, the input data elements InDE, received by the local computing cell 200, may consist of one input data element (e.g., IN0). The input data element (e.g., IN0) may have 16 bits, 8 of which represent a mantissa portion of the input data element (e.g., INM0[7:0]). Further, the multiplexers 230 of the local computing cell 200 may each receive a corresponding bit of the mantissa portion of the input data element (e.g., one of the INM0[7:0]) as a first one of its 2 MUX control bits. Each of the multiplexers 230 can receive VSS as a second one of the 2 MUX control bits. For example, a first one of the multiplexers 230 can receive INM0[7] as one of its 2 MUX control bits; a second one of the multiplexers 230 can receive INM0[6] as one of its 2 MUX control bits; a third one of the multiplexers 230 can receive INM0[5] as one of its 2 MUX control bits; a fourth one of the multiplexers 230 can receive INM0[4] as one of its 2 MUX control bits; a fifth one of the multiplexers 230 can receive INM0[3] as one of its 2 MUX control bits; a sixth one of the multiplexers 230 can receive INM0[2] as one of its 2 MUX control bits; a seventh one of the multiplexers 230 can receive INM0[1] as one of its 2 MUX control bits; and an eighth one of the multiplexers 230 can receive INM0[0] as one of its 2 MUX control bits.
Upon receiving the signal B (e.g., WM0) and VSS, each of the multiplexers 230 is configured to provide a signal OUT (as, e.g., an MAC element) through multiplying the signal B by the MUX control bits. The signal OUT may have 10 bits. For example, each of the multiplexers 230 is configured to derive a product through multiplying the signal B by the corresponding first MUX control bit (e.g., WM0[7:0]×INM0[7]), and then output the product as the signal OUT. Stated another way, each of the multiplexers 230 can provide an MAC element derived based on the corresponding first MUX control bit of the input data elements InDE received by the local computing cell 200 and VSS. Based on this principle, the multiplexers 230 of the local computing cell 200 receiving the mantissa portions of the input data elements, INM0[7:0], can provide the MAC elements, WM0[7:0]×INM0[7], WM0[7:0]×INM0[6], WM0[7:0]×INM0[5], WM0[7:0]×INM0[4], WM0[7:0]×INM0[3], WM0[7:0]×INM0[2], WM0[7:0]×INM0[1], and WM0[7:0]×INM0[0], respectively. These MAC elements (sometimes referred to as mantissa products) can be shifted by aligning their respective exponent sums (WE0[7:0]+INE0[7:0]) with a maximum one of the exponent sums. Next, such shifted MAC elements can be summed, with an exponent of the maximum exponent sum, to provide a MAC result.
Referring first to
As shown, the local computing cell 106 can receive the W0[7:0] and W1[7:0] as the signal B and the signal A, respectively. The multi-mode data selector 210 can select the signal B and output it as the D_SEL signal (e.g., 8 bits) to the configurable adder 220. The configurable adder 220 can also receive the signal A, and sum the signal A and the D_SEL signal as the signal SUM which may have 10 bits. Each of the multiplexers 230 can receive the signal A, the signal B, the signal SUM, and VSS. Based on the respective MUX control bits, which are provided by a corresponding bit of the first input data element InDE (e.g., IN0[7]) and a corresponding bit of the second input data element InDE (e.g., IN1[7]), the multiplexers 230 can each provide the corresponding signal OUT by performing MAC operations on the signal A, the signal B, the signal SUM, or VSS. In the example where the multiplexer 230 receives the IN0[7] and IN1[7] as its MUX controls bits, the multiplexer 230 can provide the signal OUT as W0[7:0]×IN0[7]+W1[7:0]×IN1[7], which may have 10 bits. In some embodiments, the multi-mode data selector 210 and the configurable adder 220 may operatively form the first (adder) part of the local computing cell 106, and the multiplexers 230 may operatively form the second (multiplexer) part of the local computing cell 106.
A schematic diagram of the configurable adder 220 is also shown in
Referring next to
As shown, the local computing cell 106 can receive the W0[3:0] and W2[3:0] as the signal B, and the W1[3:0] and W3[3:0] as the signal A. The multi-mode data selector 210 can select the signal B and output it as the D_SEL signal (e.g., 8 bits) to the configurable adder 220. The configurable adder 220 can also receive the signal A, and sum the signal A and the D_SEL signal as the signal SUM which may have 10 bits. In the INT4 example, the multiplexer 220M can be configured to provide VSS to the next full adder 220E. That is, the first half of the full adders 220A-D may sum W0[3:0] and W2[3:0], and the second half of the full adders 220E-H may sum W1[3:0] and W3[3:0]. Each of the multiplexers 230 can receive the signal A, the signal B, the signal SUM, and VSS. Based on the respective MUX control bits, which are provided by a corresponding bit of a first one of the four input data element InDE (e.g., IN0[3]) and a corresponding bit of a second one of the four input data element InDE (e.g., IN2[3]), the multiplexers 230 can each provide the corresponding signal OUT by performing MAC operations on the signal A, the signal B, the signal SUM, or VSS. In the example where the multiplexer 230 receives the IN0[3] and IN2[3] as its MUX controls bits, the multiplexer 230 can provide the signal OUT as W0[3:0]×IN0[3]+W2[3:0]×IN2[3], which may have 5 bits, and W1[3:0]×IN0[3]+W3[3:0]×IN2[3], which may also have 5 bits. In some embodiments, the multi-mode data selector 210 and the configurable adder 220 may operatively form the first (adder) part of the local computing cell 106, and the multiplexers 230 may operatively form the second (multiplexer) part of the local computing cell 106.
Referring then to
As shown, the local computing cell 106 can receive the mantissa portion of the weight data element W0 as the signal A, which is expressed as WM0[7:0], and the exponent portion of the weight data element W0 as the signal B, which is expressed as WE0[7:0]. The multi-mode data selector 210 can select the signal C representing the exponent portion of the input data element IN0 (e.g., INE0[7:0]) and output it as the D_SEL signal (e.g., 8 bits) to the configurable adder 220. The configurable adder 220 can also receive the signal A, and sum the signal A and the D_SEL signal as the signal SUM which may have 9 bits. The signal SUM may represent a sum of the exponent portions of the input data element IN0 and the weight data element W0, which may be expressed as INE0[7:0]+WE0[7:0]. Each of the multiplexers 230 can receive the signal B and VSS. Based on the respective MUX control bits, which are provided by a corresponding bit of the mantissa portion of the input data element IN0 (e.g., INM0[7]) and VSS, the multiplexers 230 can each provide the corresponding signal OUT by performing a multiplication on the signal B. In the example where the multiplexer 230 receives the INM0[7] and VSS as its MUX controls bits, the multiplexer 230 can provide the signal OUT as WE0[7:0]×INM0[7], which may have 10 bits. In some embodiments, the multi-mode data selector 210 and the configurable adder 220 may operatively form the first (adder) part of the local computing cell 106, and the multiplexers 230 may operatively form the second (multiplexer) part of the local computing cell 106.
In each of
In
In
In
As shown, when operating with the INT8 data type, the signal B and the signal A received by the local computing cell may include a first 8-bit weight data element W0[7:0] and a second 8-bit weight data element W1[7:0], respectively. Based on the control signal 211 (which indicates the data type as INT), a multi-mode data selector of the local computing cell (e.g., 210) can output the signal B as the signal D_SEL. Based on the control signal 221 (which indicates the integer type as 8b), a configurable adder of the local computing cell (e.g., 220) can utilize all of its eight full adders to sum the signal A and the signal D_SEL so as to provide the signal SUM as W0[7:0]+W1[7:0]. Based on the MUX control bits (MUX0[n] and MUX1[n]), which correspond to one the 8 bits of a first input data element IN0[7:0] and one of the 8 bits of a second input data element IN0[7:0], respectively, each multiplexer of the local computing cell (e.g., 230) can provide the corresponding signal OUT as 0, the signal A, the signal B, or the signal SUM.
When operating with the INT4 data type, the signal B received by the local computing cell may include a first 4-bit weight data element W0[3:0] and a second 4-bit weight data element W1[3:0], and the signal A received by the local computing cell may include a third 4-bit weight data element W2[3:0] and a fourth 4-bit weight data element W3[3:0]. Based on the control signal 211 (which indicates the data type as INT), a multi-mode data selector of the local computing cell (e.g., 210) can output the signal B as the signal D_SEL. Based on the control signal 221 (which indicates the integer type as 4b), a configurable adder of the local computing cell (e.g., 220) can utilize one half of its eight full adders to sum W3[3:0] and W1[3:0], and the other half of the full adders to sum W2[3:0] and W0[3:0], so as to provide the signal SUM as W2[7:0]+W0[7:0] and W3[7:0]+W1[7:0]. Based on the MUX control bits (MUX0[n] and MUX1[n]), which correspond to one the 4 bits of a first or third input data element, IN0[7:0] or IN2[7:0], and one of the 4 bits of a second or fourth input data element, IN1[7:0] or IN3[7:0], respectively, each multiplexer of the local computing cell (e.g., 230) can provide the corresponding signal OUT as 0, the signal A, the signal B, or the signal SUM.
When operating with the BF16 data type, the signal B and the signal A received by the local computing cell may include one 8-bit exponent portion of a weight data element WE0[7:0] and one 8-bit mantissa portion of the weight data element WM0[7:0], respectively. Based on the control signal 211 (which indicates the data type as FP), a multi-mode data selector of the local computing cell (e.g., 210) can output the signal C, which includes one 8-bit exponent portion of an input data element INE0[7:0], as the signal D_SEL. Based on the control signal 221 (which indicates the integer type as “X”), a configurable adder of the local computing cell (e.g., 220) can utilize all of its eight full adders to sum the signal A and the signal D_SEL so as to provide the signal SUM as WE0[7:0]+INE0[7:0]. Based on the MUX control bits (MUX0[n] and MUX1[n]), which correspond to VSS and one the 8 bits of a mantissa portion of the input data element INM0[7:0], respectively, each multiplexer of the local computing cell (e.g., 230) can provide the corresponding signal OUT as 0 or the signal A.
In the example of
In the example of
In the example of
In some embodiments, the memory circuit 100 may configure three CIM banks, 1310, 1320, and 1330 to perform MAC operations on the data elements InDE/WtDE that are provided in the FP16 data type. In the example of
The CIM bank 1310 can have 64 local computing cells to provide, through an adder tree, a first MAC result 1312 based on the first group of the mantissa portions of the first weight data element WM0 (e.g., WM0[11:4]), exponent portions of the first weight data element WM0 (e.g., WE0[7:0]), and the input data elements InDE (e.g., IN0, IN1 . . . IN63). The CIM bank 1320 can have 64 local computing cells to provide, through an adder tree, a second MAC result 1322 based on the second group of the mantissa portions of first weight data element WM0 (e.g., WM0[3:0]) and the input data elements InDE (e.g., IN0, IN1 . . . IN63), and a third MAC result 1324 based on the second group of the mantissa portions of second weight data element WM1 (e.g., WM1[3:0]) and the input data elements InDE (e.g., IN0, IN1 . . . IN63). The CIM bank 1330 can have 64 local computing cells to provide, through an adder tree, a fourth MAC result 1332 based on the first group of the mantissa portions of the second weight data element WM1 (e.g., WM1[11:4]), exponent portions of the second weight data element WM1 (e.g., WE1[7:0]), and the input data elements InDE (e.g., IN0, IN1 . . . IN63). In some embodiments, the memory circuit 100 can include a global adder tree to sum the MAC results 1312 and 1322 as MAC result 1350 and provide it through a first output channel, and sum the MAC results 1324 and 1332 as MAC result 1360 and provide it through a second output channel. Specifically, the CIM banks 1310 and 1330 may be configured similarly to the operation mode when processing the BF16 data type (e.g.,
The method 1500 starts with operations 1510 in which a plurality of input data elements InDE and a plurality of weight data elements WtDE are received through a memory device. In some embodiments, the input data elements InDE may be received by a memory device (e.g., 102 of
In various embodiments of the present disclosure, the input data elements InDE and the weight data elements WtDE can be provided or otherwise identified in various data types such as, for example, the INT8 data type, the INT4 data type, the BF16 data type, the FP data type, etc. Upon identifying the data type of the input data elements InDE and the weight data elements WtDE, the memory circuit 100 can configure the same hardware components (e.g., the local computing cells 106) in a respective mode to process (e.g., MAC) the data elements. For example, in response to identifying that the data type is an integer type (e.g., the INT8 data type, the INT4 data type, etc.), the method 1500 may proceed to operation 1520; and in response to identifying that the data type is a floating point type (e.g., the BF16 data type, the FP16 data type, etc.), the method 1500 may proceed to operation 1530.
In operation 1520, each of the local computing cells 106 can provide at least a first sum of a first product and a second product. Further, the local computing cell 106 can provide the respective first sum when the integer type is the INT8 data type. In some embodiments (referring to
In operation 1530, each of the local computing cells 106 can provide at least a second sum and a third product. In some embodiments (referring to
In one aspect of the present disclosure, a compute-in-memory (CIM) circuit is disclosed. The CIM circuit includes an input circuit configured to receive a first number of input data elements and the first number of weight data elements. The CIM circuit includes a second number of local computing cells operatively coupled to the input circuit. Each of the local computing cells is configured to provide, in response to identifying that the input data elements and weight data elements are provided as a first data type, at least a first sum, the first sum including (i) a first product of a first one of the input data elements and a first one of the weight data elements; and (ii) a second product of a second one of the input data elements and a second one of the weight data elements. Each of the local computing cells is configured to provide, in response to identifying that the input data elements and weight data elements are provided as a second data type, (i) a second sum of a first portion of a third one of the input data elements and a first portion of a third one of the weight data elements; and (ii) a third product of a second portion of the third input data element and a second portion of the third weight data element.
In another aspect of the present disclosure, a compute-in-memory (CIM) circuit is disclosed. The CIM circuit includes a first local computing cell operatively coupled to a first number of input data elements and the first number of weight data elements, the first local computing cell comprising a first part and a second part operatively coupled to each other. The first part is configured to provide a first sum of the first number of weight data elements, in response to identifying that the input data elements and the weight data elements are provided as an integer data type, which causes the second part to (i) select one of the first number of weight data elements, the first sum, or a fixed logic state; and (ii) based on the first number of input data elements, provide a plurality of first multiply-accumulate (MAC) results of the first number of input data elements and the first number of weight data elements. The first part is configured to provide a second sum of respective exponent portions of the first number of input data elements and the first number of weight data elements, in response to identifying that the input data elements and the weight data elements are provided as a floating point data type, which causes the second part to (i) select a mantissa portion of the first number of weight data elements; and (ii) based on a mantissa portion of the first number of input data elements, provide a plurality of first products of the respective mantissa portions of the first number of input data elements and the first number of weight data elements.
In yet another aspect of the present disclosure, a method for operating a compute-in-memory (CIM) circuit is disclosed. The method includes receiving, through a memory device, a plurality of input data elements and a plurality of weight data elements. The method includes providing, in response to identifying that the input data elements and weight data elements are provided as an integer data type, at least a first sum, wherein the first sum includes (i) a first product of a first one of the input data elements and a first one of the weight data elements; and (ii) a second product of a second one of the input data elements and a second one of the weight data elements. The method includes providing, in response to identifying that the input data elements and weight data elements are provided as a floating point data type, (i) a second sum of an exponent portion of a third one of the input data elements and an exponent portion of a third one of the weight data elements; and (ii) a third product of a second portion of the third input data element and a second portion of the third weight data element.
As used herein, the terms “about” and “approximately” generally indicates the value of a given quantity that can vary based on a particular technology node associated with the subject semiconductor device. Based on the particular technology node, the term “about” can indicate a value of a given quantity that varies within, for example, 10-30% of the value (e.g., +10%, ±20%, or ±30% of the value).
The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
This application claims priority to and the benefit of U.S. Provisional Application No. 63/582,921, filed Sep. 15, 2023, and also to U.S. Provisional Patent App. No. 63/611,413, filed Dec. 18, 2023, both of which are incorporated herein by reference in their entireties for all purposes.
Number | Date | Country | |
---|---|---|---|
63582921 | Sep 2023 | US | |
63611413 | Dec 2023 | US |