The present application claims priority under 35 U.S.C. 119 (a) to Korean Application No. 10-2023-0071543, filed on Jun. 2, 2023, in the Korean Intellectual Property Office, which is incorporated herein by reference in its entirety.
Various embodiments of the present disclosure generally relate to processing-in-memory (hereinafter, referred to as “PIM”)-based accelerating devices, accelerating systems, and accelerating cards.
Recently, neural network algorithms have shown dramatic performance improvements in various fields such as image recognition, voice recognition, and natural language processing. In the future, the neural network algorithms are expected to be actively used in various fields such as factory automation, medical services, and self-driving cars, and various hardware structures are being actively developed to efficiently process them. The neural network algorithm is a learning algorithm modeled after a neural network in biology. Recently, among multi-layer perceptron (hereinafter, referred to “MLP”) composed of two or more layers, a deep neural network (hereinafter, referred to “DNN”) composed of many layers of 8 or more have been actively studied. Currently, most neural network operations are performed using a graphics processing unit (hereinafter, referred to “GPU”). The GPU has a large number of cores, and thus is known to be efficient in performing simply repetitive operations and operations with high parallelism. However, in the case of DNN, which has been studied a lot recently, the DNN is composed of, for example, one million or more neurons, so the amount of operation is enormous. Accordingly, it is required to develop a hardware accelerator optimized for neural network operation having such a huge amount of operation.
A processing-in-memory (PIM)-based accelerating device according to an embodiment of the present disclosure may include a plurality of PIM devices, a PIM network system configured to control traffic of signals and data for the plurality of PIM devices, and a first interface configured to perform interfacing with a host device. The PIM network system may control the traffic so that the plurality of PIM devices perform different operations, the plurality of PIM devices perform different operations for each group, or the plurality of PIM devices perform the same operation in parallel.
A processing-in-memory (PIM)-based accelerating device according to an embodiment of the present disclosure may include a plurality of PIM devices, a PIM network system configured to control traffic of signal and data for the plurality of PIM devices, and a first interface configured to perform interfacing with a host device. Each of the plurality of PIM devices may include a PIM device constituting a first channel and a PIM device constituting a second channel. The PIM network system may control the traffic such that the plurality of PIM devices perform different operations, the plurality of PIM devices perform different operations in groups, or the plurality of PIM devices perform the same operation in parallel.
A processing-in-memory (PIM)-based accelerating device according to an embodiment of the present disclosure may include a plurality of PIM devices of a first group, a plurality of PIM devices of a second group, a first PIM network system configured to control traffic of signal and data of the plurality of PIM devices of the first group, a second PIM network system configured to control traffic of signal and data of the plurality of PIM devices of the second group, and a first interface configured to perform interfacing with a host device. The first PIM network system may control the traffic such that the plurality of PIM devices of the first group perform different operations, the plurality of PIM devices of the first group perform different operations in groups, or the plurality of PIM devices of the first group perform the same operation in parallel. The second PIM network system may control the traffic such that the plurality of PIM devices of the second group perform different operations, the plurality of PIM devices of the second group perform different operations in groups, or the plurality of PIM devices of the second group perform the same operation in parallel.
A processing-in-memory (PIM)-based accelerating system according to an embodiment of the present disclosure may include a plurality of PIM-based accelerating devices, and a host device coupled to the plurality of PIM-based accelerating devices through a system bus. Each of the plurality of PIM-based accelerating devices may include a first interface coupled to the system bus, and a second interface coupled to another PIM-based accelerating device.
A processing-in-memory (PIM)-based accelerating card according to an embodiment of the present disclosure may include a printed circuit board, a plurality of PIM devices mounted over the printed circuit board in forms of chips or packages, a PIM network system mounted over the printed circuit board in a form of a chip or a package and configured to control signal and data traffic of the plurality of PIM devices, a first interface device attached to the printed circuit board, and a second interface device attached to the printed circuit board.
A processing-in-memory (PIM)-based accelerating card according to an embodiment of the present disclosure may include a printed circuit board, a plurality of groups of a plurality of PIM devices mounted over the printed circuit board in forms of chips or packages, a plurality of PIM network systems mounted over the printed circuit board in forms of chips or packages and configured to control signal and data traffic of the plurality of groups, a first interface device attached to the printed circuit board, and a second interface device attached to the printed circuit board.
In the following description of embodiments, it will be understood that although the terms “first,” “second,” “third,” etc. are used herein to describe various elements, these elements should not be limited by these terms. It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present. The term “preset” means that the value of a parameter is predetermined when using that parameter in a process or algorithm. The value of the parameter may be set when a process or algorithm starts or may be set during a period during which a process or algorithm is performed, depending on embodiments.
A logic “high” level and a logic “low” level may be used to describe logic levels of signals. A signal having a logic “high” level may be distinguished from a signal having a logic “low” level. For example, when a signal having a first voltage correspond to a signal having a logic “high” level, a signal having a second voltage correspond to a signal having a logic “low” level. In an embodiment, the logic “high” level may be set as a voltage level which is higher than a voltage level of the logic “low” level. Meanwhile, the logic levels of signals may be set to be different or opposite according to the embodiments. For example, a certain signal having a logic “high” level in one embodiment may be set to have a logic “low” level in another embodiment, and a certain signal having a logic “low” level in one embodiment may be set to have a logic “high” level in another embodiment.
Each of the first to eighth PIM devices 111-118 may include at least one memory circuit and a processing circuit. In an example, the processing circuit may include a plurality of processing units. In an example, the first to eighth PIM devices 111-118 may be divided into a first PIM group 110A and a second PIM group 110B. The number of PIM devices included in the first PIM group 110A and the number of PIM devices included in the second PIM group 110B may be the same as each other. However, in another embodiment, the number of PIM devices included in the first PIM group 110A and the number of PIM devices included in the second PIM group 110B may be different from each other. As illustrated in
The PIM network system 120 may control the first to eighth PIM devices 111-118. Specifically, the PIM network system 120 may control or adjust both signals and data, sent to and received from each of the first to eighth PIM devices 111-118. The PIM network system 120 may assign or direct each of the first to eighth PIM devices 111-118 to perform the same operation. The PIM network system 120 may assign or direct a subset of the eight PIM devices 111-118 to perform a particular operation and assign or direct each of the other PIM devices, i.e., PIM devices not part of the subset, to perform one or more other operations, which are different from the operation assigned to first subset of PIM devices. The PIM network system 120 may assign a different operation to each of the first to eighth PIM devices 111-118 to perform different operations. The PIM network system 120 may direct the first to eighth PIM devices 111-118 to perform different operations in groups, or direct the first to eighth PIM devices 111-118 to perform the same operation in parallel, i.e., at the same time, or sequentially with.
The PIM network system 120 may be coupled to the first to eighth PIM devices 111-118 through first to eighth signal/data lines 141-148, respectively. For example, the PIM network system 120 may transmit signals to the first PIM device 111 or exchange data with, i.e., send data to as well as receive data from, the first PIM device 111 through the first signal/data line 141. The PIM network system 120 may transmit signals to the second PIM device 112 or exchange data with, i.e., send data to as well as receive data from, the second PIM device 112 through the second signal/data line 142. In the same manner, the PIM network system 120 may transmit signals to the third to eighth PIM devices 113-118 or exchange data with i.e., send data to as well as receive data from, the third to eighth PIM devices 113-118 through the third to eighth signal/data lines 143-148, respectively.
The PIM network system 120 may be coupled to the first interface 131 through a first interface bus 151. In addition, the PIM network system 120 may be simultaneously coupled to the second interface 132 through a second interface bus 152.
As used herein, “interface” should be construed as a hardware or software component that connects two or more other components for the purpose of passing information from one to the other. “Interface” may also be construed as an act or method of connecting two or more components for the purpose of passing information from one to the other. A “bus” is a set (2 or more) electrically-parallel conductors, which form a signal transmission path. With regard to the words “signals” and “data,” both words refer to information. In that regard, a “signal,” which may be a command or an instruction to a processor for example, is nevertheless information. As used herein therefore and depending on the context of its use, the word “information” may refer to a signal, data or both signals and data.
In
The second interface 132 may perform interfacing between the PIM-based accelerating device 100 and another PIM-based accelerating device or a network router. In an example, the second interface 132 may be a device employing a communication standard, for example, an Ethernet standard. In an example, the second interface 132 may be a small, hot-pluggable transceiver for data communication, such as a small form-factor pluggable (SFP) port. In an example, the second interface 132 may be a Quad SFP (QSFP) port in which four SFP ports are combined into one. In this case, the QSFP port may be used as four SFP ports using a breakout cable, or may be bonded to be used at four times the speed of the SFP standard. The second interface 132 may transmit data transmitted from the PIM network system 120 of the PIM-based accelerating device 100 through the second interface bus 152 to a PIM network system of another PIM-based accelerating device directly or through a network router. In addition, the second interface 132 may transmit data transmitted from another PIM-based accelerating device directly or through the network router to the PIM network system 120 through the second interface bus 152.
As used herein, the term “memory bank” refers to a plurality of memory “locations” in one or more semiconductor memory devices, e.g., static or dynamic RAM. Each location may contain (store) digital data transmitted, i.e., copied or stored, into the location and which can be retrieved, i.e., read therefrom. A “memory bank” may have virtually any number of storage locations, each location being capable of storing different numbers of binary digits (bits).
Referring to
In the peripheral circuit region 111B, a second memory circuit and a plurality of data input/output circuits DQs, for example, first to sixteenth data input/output circuits DQ0-DQ15 may be disposed. In an example, the second memory circuit may include a global buffer GB.
Each of the first to sixteenth processing units PU0-PU15 may be allocated to and operationally associated with one of the first to sixteenth memory banks BK0-BK15, respectively. Each processing unit may also be contiguous with its corresponding memory bank. For example, the first processing unit PU0 may be allocated and disposed adjacent to or at least proximate or near the first memory bank BK0. The second processing unit PU1 may be allocated and disposed in the second memory bank BK1. Similarly, the sixteenth processing unit PU15 may be allocated and disposed in the sixteenth memory bank BK15. As shown in
Each of the first to sixteenth memory banks BK0-BK15 may provide a quantity of data to a corresponding one of the first to sixteenth processing units PU0-PU15. In an example, a “first” data may be first to sixteenth weight data. In another example, the first to sixteenth memory banks BK0-BK15 may provide a plurality of pieces of “second” data together with the plurality of pieces of “first” data to one or more of the first to sixteenth processing units PU0-PU15. By way of example, the first data and the second data may be, for example, data used for element-wise multiplication (EWM) operation.
More specifically, one of the first to sixteenth processing units PU0-PU15 may receive one piece of weight data among the first to sixteenth weight data from the memory bank BK to which the processing unit PU is allocated. For example, the first processing unit PU0 may receive the first weight data from the first memory bank BK0. The second processing unit PU1 may receive the second weight data from the second memory bank BK1. In the same manner, the third to sixteenth processing units PU2-PU15 may receive the third to sixteenth weight data from the third to sixteenth memory banks BK2-BK15, respectively.
The global buffer GB may provide the second data to each of the first to sixteenth processing units PU0-PU15. In an example, the second data may be vector data or input activation data, which may be input to each layer of a fully-connected (FC) layer in a neural network operation such as MLP.
Referring again to
When a PIM device 111 does not have the same number of memory banks BK0-BK15 and processing units PU0-PU15, the number of memory banks and the number of processing units PU may be different from each other. In such an embodiment, a first PIM device 111 may have a structure in which two memory banks share one processing unit PU. In another embodiment, the number of processing units PU may be half the number of memory banks. In yet another embodiment, a single or “first” PIM device 111 may have a structure in which four memory banks share one processing unit PU. In such a case, the number of processing units PU may be ¼ of the number of memory banks.
Referring to
As described above with reference to
Referring to
As illustrated in
The input data input to the input layer, the first hidden layer, the second hidden layer, and the output layer may have a format of a vector matrix used for matrix multiplication operation. In the input layer, first matrix multiplication, that is, first multiplying-accumulating (MAC) operation may be performed on the first vector matrix and the first weight matrix, which are the input data INPUT1, INPUT2, and INPUT3. The input layer may perform the first MAC operation to generate a second vector matrix, and transmit the generated second vector matrix to the first hidden layer. In the first hidden layer, a second matrix multiplication for the second vector matrix and the second weight matrix, that is, a second MAC operation may be performed. The first hidden layer may perform the second MAC operation to generate a third vector matrix, and transmit the generated third vector matrix to the second hidden layer. In the second hidden layer, a third matrix multiplication for the third vector matrix and the third weight matrix, that is, a third MAC operation may be performed. The second hidden layer may perform the third MAC operation to generate a fourth vector matrix, and transmit the generated fourth vector matrix to the output layer. In the output layer, a fourth matrix multiplication for the fourth vector matrix and the fourth weight matrix, that is, a fourth MAC operation may be performed. The output layer may perform the fourth MAC operation to generate final output data OUTPUT.
The first to eighth PIM devices 111-118 of
Referring to
The weight matrix 311 may have 16 rows and 64columns. That is, first to sixteenth weight data groups GW1-GW16 may be disposed in the first to sixteenth rows of the weight matrix 311, respectively. The first to sixteenth weight data groups GW1-GW16 may include first to sixteenth weight data each having 64 pieces of data. Specifically, as illustrated in
In an example, the first to sixteenth weight data groups GW1-GW16 of the weight matrix 311 may be stored in the first to sixteenth memory banks BK0-BK15, respectively. For example, the first weight data W1.1-W1.64 of the first weight data group GW1 may be stored in the first memory bank BK0. The second weight data W2.1-W2.64 of the second weight data group GW2 may be stored in the second memory bank BK1. Similarly, the sixteenth weight data W16.1-W16.64 of the sixteenth weight data group GW16 may be stored in the sixteenth memory bank BK15. Accordingly, the first processing unit PU0 may receive the first weight data W1.1-W1.64 of the first weight data group GW1 from the first memory bank BK0. The second processing unit PU1 may receive the second weight data W2.1-W2.64 of the second weight data group GW2 from the second memory bank BK1. In addition, the sixteenth processing unit PU15 may receive the sixteenth weight data W16.1-W16.64 of the sixteenth weight data group GW16 from the sixteenth memory bank BK15. The first to 64th vector data V1.1-V64.1 of the vector matrix 312 may be stored in the global buffer GB. Accordingly, the first to sixteenth processing units PU0-PU15 may receive the first to 64th vector data V1.1-V64.1 from the global buffer GB.
The first to sixteenth processing units PU0-PU15 may perform the MAC operations using the first to sixteenth weight data groups GW1-GW16 transmitted from the first to sixteenth memory banks BK0-BK15 and the vector data V1.1-V64.1 transmitted from the global buffer GB. The first to sixteenth processing units PU0-PU15 may output the result data generated by performing the MAC operations as the MAC result data RES1.1-RES64.1. The first processing unit PU0 may perform the MAC operation on the first weight data W1.1-W1.64 of the first weight data group GW1 and the vector data V1.1-V64.1 and output result data as the first MAC result data RES1.1. The second processing unit PU1 may perform the MAC operation on the second weight data W2.1-W2.64 of the second weight data group GW2 and the vector data V1.1-V64.1 and output result data as the second MAC result data RES2.1. In addition, the sixteenth processing unit PU15 may perform the MAC operation on the sixteenth weight data W16.1-W16.64 of the sixteenth weight data group GW16 and the vector data V1.1-V64.1 and output result data as the sixteenth MAC result data RES16.1.
Depending on the amount of data that can be processed by the first to sixteenth processing units PU0-PU15, the MAC operation for the weight matrix 311 and the vector matrix 312 may be divided into a plurality of sub-MAC operations and performed. Hereinafter, it is assumed that the amount of data that can be processed by the first to sixteenth processing units PU0-PU15 is 16pieces of weight data and 16 pieces of vector data. The first to sixteenth weight data constituting the first to sixteenth weight data groups GW1-GW16 may each be divided into four sets. Similarly, the first to 64th vector data V1.1-V64.1 may also be divided into four sets.
For example, the first weight data W1.1-W1.64 constituting the first weight data group GW1 may be divided into a first set W1.1-W1.16, a second set W1.17-W1.32, a third set W1.33-W1.48, and a fourth set W1.49-W1.64. The first set W1.1-W1.16 of the first weight data W1.1-W1.64 may be composed of elements of the first to sixteenth columns of the first row of the weight matrix 311. The second set W1.7-W1.32 of the first weight data W1.1-W1.64 may be composed of elements of the 17th to 32nd columns of the first row of the weight matrix 311. The third set W1.33-W1.48 of the first weight data W1.1-W1.64 may be composed of elements of the 33rd to 48th columns of the first row of the weight matrix 311. In addition, the fourth set W1.49-W1.64 of the first weight data W1.1-W1.64 may be composed of elements of the 49th to 64th columns of the first row of the weight matrix 311.
Similarly, the second weight data W2.1-W2.64 constituting the second weight data group GW2 may be divided into a first set W2.1-W2.16, a second set W2.17-W2.32, a third set W2.33-W2.48, and a fourth set W2.49-W2.64. The first set W2.1-W2.16 of the second weight data W2.1-W2.64 may be composed of elements of the first to sixteenth columns of the second row of the weight matrix 311. The second set W2.7-W2.32 of the second weight data W2.1-W2.64 may be composed of elements of the 17th to 32nd columns of the second row of the weight matrix 311. The third set W2.33-W2.48 of the second weight data W2.1-W2.64 may be composed of elements of the 33rd to 48th columns of the second row of the weight matrix 311. In addition, the fourth set W2.49-W2.64 of the second weight data W2.1-W2.64 may be composed of elements of the 49th to 64th columns of the second row of the weight matrix 311.
Similarly, the sixteenth weight data W16.1-W16.64 constituting the sixteenth weight data group GW16 may be divided into a first set W16.1-W16.16, a second set W16.17-W16.32, a third set W16.33-W16.48, and a fourth set W16.49-W16.64. The first set W16.1-W16.16 of the sixteenth weight data W16.1-W16.64 may be composed of elements of the first to sixteenth columns of the sixteenth row of the weight matrix 311. The second set W16.7-W16.32 of the sixteenth weight data W16.1-W16.64 may be composed of elements of the 17th to 32nd columns of the sixteenth row of the weight matrix 311. The third set W16.33-W16.48 of the sixteenth weight data W16.1-W16.64 may be composed of elements of the 33rd to 48th columns of the sixteenth row of the weight matrix 311. In addition, the fourth set W16.49-W16.64 of the sixteenth weight data W16.1-W16.64 may be composed of elements of the 49th to 64th columns of the sixteenth row of the weight matrix 311.
The first to 64th vector data V1.1-V64.1 may be divided into a first set V1.1-V16.1, a second set V16.1-V32.1, a third set V33.1-V48.1, and a fourth set V49.1-V64.1. The first set V1.1-V16.1 of the vector data may be composed of elements of the first to sixteenth rows of the vector matrix 312. The second set V17.1-V32.1 of the vector data may be composed of elements of the 17th to 32nd rows of the vector matrix 312. The third set V33.1-V48.1 of the vector data may be composed of elements of the 33rd to 48th rows of the vector matrix 312. In addition, the fourth set V49.1-V64.1 of the vector data may be composed of elements of the 49th to 64th rows of the vector matrix 312.
Hereinafter, a MAC operation process performed by the first processing unit PU0 will be described. The MAC operation process described below may be equally applied to the MAC operation processes performed by the second to sixteenth processing units PU1-PU15. The first processing unit PU0 may perform a first sub-MAC operation on the first set W1.1-W1.16 of the first weight data and the first set V1.1-V16.1 of the vector data to generate first MAC data. The first sub-MAC operation may be performed by a multiplication on the first set W1.1-W1.16 of the first weight data and the first set V1.1-V16.1 of the vector data and an addition on multiplication result data.
A first processing unit PU0 may perform a second sub-MAC operation on the second set W1.17-W1.32 of the first weight data and the second set V17.1-V32.1 of the vector data to generate second MAC data. The second sub-MAC operation may be performed by multiplication on the second set W1.17-W1.32 of the first weight data and the second set V17.1-V32.1 of vector data, addition on multiplication result data, and accumulation on addition operation result data and the first MAC data.
The first processing unit PU0 may perform a third sub-MAC operation on the third set W1.33-W1.48 of the first weight data and the third set V33.1-V48.1 of the vector data to generate third MAC data. The third sub-MAC operation may be performed by multiplication on the third set W1.33-W1.48 of the first weight data and the third set V33.1-V48.1 of the vector data, addition on multiplication result data, and accumulation on addition result data and the second MAC data.
The first processing unit PU0 may perform a fourth sub-MAC operation on the fourth set W1.49-W1.64 of the first weight data and the fourth set V49.1-V64.1 of the vector data to generate fourth MAC data. The fourth sub-MAC operation may be performed by multiplications on the fourth set W1.49-W1.64 of the first weight data and the fourth set V49.1-V64.1 of the vector data, additions on multiplication result data, and accumulation on addition result data and the third MAC data. The fourth MAC data generated by the fourth sub-MAC operation may constitute the first MAC result data RES1.1 corresponding to an element of the first column of the result matrix 313.
The description below may be applied to each processing unit PU PU1-PU15 included in the first PIM device 111. Moreover, the processing unit PU description may be applied to the first to sixteenth processing units PU0-PU15 included in each of the second to eighth PIM devices 112-118 of
Still referring to
The multiplication circuit 410 may be configured to receive the first to sixteenth weight data W1-W16 and the first to sixteenth vector data V1-V16. The first to sixteenth weight data W1-W16 may be provided by, i.e., obtained from, the first memory bank (BK0 of
The first to sixteenth weight data W1-W16 and the first to sixteenth vector data V1-V16 may be the first set W1.1-W1.16 of the first weight data W1.1-W1.64 and the first set V1.1-V16.1 of the vector data V1.1-V64.1 described with reference to
As
The first to sixteenth multipliers MUL0-MUL15 may perform multiplications on the first to sixteenth weight data W1-W16 by the first to sixteenth vector data V1-V16, respectively. The first to sixteenth multipliers MUL0-MUL15 may output data generated as a result of the multiplications as the first to sixteenth multiplication data WV1-WV16, respectively. For example, the first multiplier MUL0 may perform a multiplication of the first weight data W1 and the first vector data V1 to output the first multiplication data WV1. The second multiplier MUL1 may perform a multiplication of the second weight data W2 and the second vector data V2 to output the second multiplication data WV2. In the same manner, the remaining multipliers MUL2-MUL15 may also output the third to sixteenth multiplication data WV3-WV16, respectively. The first to sixteenth multiplication data WV1-WV16 output from the multipliers MUL0-MUL15 may be transmitted to the addition circuit 420.
The addition circuit 420 may be configured by arranging a plurality of adders ADDERs in a hierarchical structure such as a tree structure. The addition circuit 420 may be composed of half-adders as well as full-adders. Eight adders ADD11-ADD18 may be disposed in a first stage of the addition circuit 420. Four adders ADD21-ADD24 may be disposed in the next lower second stage of the addition circuit 420. Not shown in
Each first stage adder ADD11-ADD18 may receive multiplication data WVs from two multipliers of the first to sixteenth multipliers MUL0-MUL15 of the multiplication circuit 410. Each first stage adder ADD11-ADD18 may perform an addition on the input multiplication data WVs to generate and output addition data. For example, the adder ADD11 of the first stage may receive the first and second multiplication data WV1 and WV2 from the first and second multipliers MUL0 and MUL1, and perform an addition on the first and second multiplication data WV1 and WV2 to output addition result data. In the same manner, the adder ADD18 of the first stage may receive the fifteenth and sixteenth multiplication data WV15 and WV16 from the fifteenth and sixteenth multipliers MUL14 and MUL15, and perform an addition on the fifteenth and sixteenth multiplication data WV15 and WV16 to output addition result data.
Each second stage adder ADD21-ADD24 may receive the addition result data from two first stage adders ADD11-ADD18 and perform an addition on the addition result data to output addition result data. For example, the second stage adder ADD21 may receive the addition results from first stage adders ADD11 and ADD12. The addition result data output from the second stage adder ADD21 may therefore have a value obtained by adding all of the first to fourth multiplication data WV1 to WV4. In this way, the fourth stage adder ADD41 may therefore perform an addition of the addition results from two third-stage to generate and output, multiplication addition data DADD, which is data that is output from the addition circuit 420. The multiplication addition data DADD output from the addition circuit 420 may be transmitted to the accumulation circuit 430.
As used herein and the context in which it is used, the word “latch” may refer to a device, which may retain or hold data. “Latch” may also refer to an action or a method by which a data is stored, retained or held. As used herein, the term “accumulative addition” refers to a running and accumulating sum (addition) of a sequence of partial sums of a data set. An accumulative addition may be used to show the summation of data over time.
In
The output circuit 440 may output accumulation data DACC, or may output inverted accumulation data DACC; which is transmitted from the latch circuit 432 of the accumulation circuit 430 depending on a logic level of a resultant read signal RD_RES. In an example, when the MAC operation described with reference to
On the other hand, the accumulation data DACC transmitted from the latch circuit 432 of the accumulation circuit 430 in any one of the first to third sub-MAC operation processes might not constitute the MAC result data RES. In such a case, the resultant read signal RD_RST of a logic “low” level may be transmitted to the output circuit 440. The output circuit 440 might not output the accumulation data DACC as the MAC result data RES in response to the resultant read signal RD_RES of the logic “low” level. Although not shown in
The PIM interface circuit 121 may be coupled to a first interface 131 through a first interface bus 151. Accordingly, the PIM interface circuit 121 may receive a host instruction HOST_INS from a host device through the first interface 131 and the first interface bus 151. Although not shown in
As used herein, “unicast” refers to a transmission mode in which a single message is sent to a single “network” destination, (i.e., one-to-one). “Broadcast” refers to a transmission mode in which a single message is sent to all “network” destinations. “Multicast” refers to a transmission mode in which a single message is sent to multiple “network” destinations but not necessarily all destinations.
The multimode interconnect circuit 123 may transmit the memory request MEM_REQ or the plurality of PIM requests PIM_REQs transmitted from the PIM interface circuit 121 to at least one PIM controller among first to eighth PIM controllers 122(1)-122(8). In an example, the multimode interconnect circuit 123 may operate in any one mode among a unicast mode, a multicast mode, and a broadcast mode.
Each of the first to eighth PIM controllers 122(1)-122(8) may generate at least one memory command MEM_CMD corresponding to the memory request MEM_REQ transmitted from the multimode interconnect circuit 123. In addition, each of the first to eighth PIM controllers 122(1)-122(8) may generate a plurality of PIM commands PIM_CMDs corresponding to the plurality of PIM requests PIM_REQs transmitted from the multimode interconnect circuit 123. The first to eighth PIM controllers 122(1)-122(8) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the first to eighth PIM devices 111-118, respectively. The first to eighth PIM controllers 122(1)-122(8) may be allocated to the first to eighth PIM devices 111-118, respectively. For example, the first PIM controller 122(1) may be allocated to the first PIM device 111. Accordingly, the first PIM controller 122(1) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the first PIM device 111. Similarly, the eighth PIM controller 122(8) may be allocated to the eighth PIM device 118. Accordingly, the eighth PIM controller 122(8) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the eighth PIM device 118.
The card-to-card router 124 may be coupled to the second interface 132 through the second interface bus 152. The card-to-card router 124 may transmit a network packet NET_PACKET to the second interface 132 through the second interface bus 152, based on the network request NET_REQ transmitted from the PIM interface circuit 121. The card-to-card router 124 may process the network packet NET_PAPCKET transmitted from another PIM-based accelerating device or a network router through the second interface 132 and the second interface bus 152. In this case, although not shown in
The host interface 511 may receive the host instruction HOST_INS from the host device through the first interface 131. The host interface 511 may be configured according to a high-speed interfacing protocol employed by the first interface 131. For example, when the host interface 511 adopts the PCIe standard, the host interface 511 may include an interface master and an interface slave, such as an advanced extensible interface (AXI) master and an AXI slave, respectively. The host interface 511 may transmit the host instruction HOST_INS transmitted from the first interface 131 to the instruction decoder/sequencer 512. Although not shown in
It is our experience that “queue” almost always refers to a list. In the following paragraphs, however, the word “queue” seems to refer to structure because it is shown in
The word “queue” may refer to a list in which data items are appended to the last position of the list and retrieved from the first position of the list. Depending on the context in which “queue” is used, however, “queue” it may also refer to a device, e.g., memory, in which data items may be appended to the last position of a list of items stored in the device and retrieved from the first position of the list of stored items.
In
The memory/PIM request generating circuit 513 may generate and output at least one memory request MEM_REQ, the plurality of PIM requests PIM_REQs, or the local memory request LM_REQ, based on the host instruction HOST_INS transmitted from the instruction decoder/sequencer 512. In an example, the memory request MEM_REQ may request a read operation or a write operation for the first to sixteenth memory banks (BK0-BK15 of
In an example, the memory/PIM request generating circuit 513 may generate and output the memory request MEM_REQ, the plurality of PIM requests PIM_REQs, or the local memory request LM_REQ, based on a finite state machine (hereinafter, referred to as “FSM”) 513A. In this case, data included in the host instruction HOST_INS may be used as an input value to the FSM 513A. The memory/PIM request generating circuit 513 may transmit the memory request MEM_REQ or the plurality of PIM requests PIM_REQs to the multimode interconnect circuit (123 in
The local memory circuit 514 may perform a local memory operation, based on the local memory request LM_REQ transmitted from the memory/PIM request generating circuit 513. In an example, the local memory circuit 514 may store the bias data D_B, the operation result data D_R, and the maintenance data D_M transmitted together with the local memory request LM_REQ. In addition, the local memory circuit 514 may return the stored bias data D_B, the operation result data D_R, and the maintenance data D_M to the memory/PIM request generating circuit 513, based on the local memory request LM_REQ. In an example, the local memory circuit 513 may include a static random access memory (SRAM) device.
First, as shown in
Next, as shown in
Next, as shown in
As used herein, “arbiter” refers to a device or a method, which accepts bus requests from bus-requesting devices or methods (modules) and grants control of a bus to one requester at a time. “Physical layer” refers to the layer of the ISO Reference Model that provides mechanical, electrical, functional, and procedural characteristic access required for a transmission medium.
The request arbiter 521 may store the memory request MEM_REQ or the plurality of PIM requests PIM-REQs transmitted from the multimode interconnect circuit (123 of
The bank engine 522 may generate and output the memory command MEM_CMD corresponding to the memory request MEM_REQ transmitted from the request arbiter 521. In an example, the memory command MEM_CMD generated by the bank engine 522 may include a pre-charge command, an activation command, a read command, and a write command.
The PIM engine 523 may generate and output a plurality of PIM commands PIM_CMDs corresponding to the plurality of PIM requests PIM_REQs transmitted from the request arbiter 521. In an example, the plurality of PIM commands PIM_CMDs generated by the PIM engine 523 may include an activation command for the memory banks, MAC operation commands, an activation function command, an element-wise multiplication command, a data copy command from the memory bank to the global buffer, a data copy command from the global buffer to the memory banks, a write command to the global buffer, a read command for MAC result data, a read command for MAC result data processed with activation function, and a write command for the memory banks. In this case, the activation command for the memory banks may target some memory banks among the plurality of memory banks or may target all memory banks. The activation command for the memory banks may be generated for read and write operations on the weight data, or may be generated for read and write operations on activation function data. The MAC operation commands may be divided into a MAC operation command for a single memory bank, a MAC operation command for some memory banks, and a MAC operation command for all memory banks.
The refresh engine 524 may generate and output a refresh command REF_CMD. The refresh engine 524 may generate the refresh engine REF_CMD at regular intervals. The refresh engine 524 may perform scheduling for the generated refresh command REF_CMD.
The command arbiter 525 may receive the memory command MEM_CMD output from the bank engine 522, the plurality of PIM commands PIM_CMDs output from the PIM engine 523, and the refresh command REF_CMD output from the refresh engine 524. The command arbiter 525 may perform a multiplexing operation on the memory command MEM_CMD, the plurality of PIM commands PIM_CMDs, and the refresh command REF_CMD so that the command with priority is output first.
The physical layer 526 may transmit the memory command MEM_CMD, the plurality of PIM commands PIM_CMDs, and the refresh command REF_CMD transmitted from the command arbiter 525 to the first PIM device (111 in
The PIM interface circuit 221 may be coupled to a first interface 131 through a first interface bus 151. Accordingly, the PIM interface circuit 221 may receive a host instruction HOST_INS from a host device through the first interface 131 and the first interface bus 151. Although not shown in
The multimode interconnect circuit 223 may transmit the memory request MEM_REQ or the plurality of PIM requests PIM_REQs transmitted from the PIM interface circuit 221 to at least one PIM controller among the first to eighth PIM controllers 222(1)-222(8). In an example, the multimode interconnect circuit 223 may operate in any one of the unicast mode, the multicast mode, and the broadcast mode, as described with reference to
The first to eighth PIM controllers 222(1)-222(8) may generate at least one memory command MEM_CMD corresponding to the memory request MEM_REQ transmitted from the multimode interconnect circuit 222. In addition, each of the first to eighth PIM controllers 222(1)-222(8) may generate a plurality of PIM commands PIM_CMDs corresponding to the plurality of PIM requests PIM_REQs transmitted from the multimode interconnect circuit 223. The first to eighth PIM controllers 222(1)-222(8) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the first to eighth PIM devices 111-118, respectively. The first to eighth PIM controllers 222(1)-222(8) may be allocated to first to eighth PIM devices 111-118, respectively. For example, the first PIM controller 222(1) may be allocated to the first PIM device 111. Accordingly, the first PIM controller 222(1) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the first PIM device 111. Similarly, the eighth PIM controller 222(8) may be allocated to the eighth PIM device 118. Accordingly, the eighth PIM controller 222(8) may transmit the memory command MEM_CMD or the plurality of PIM commands PIM_CMDs to the eighth PIM device 118. The description of the first PIM controller 122(1) described with reference to
The card-to-card router 224 may be coupled to a second interface 132 through a second interface bus 152. The card-to-card router 224 may transmit a network packet NET_PACKET to the second interface 132 through the second interface bus 152, based on the network request NET_REQ transmitted from the PIM interface circuit 221. The card-to-card router 224 may process the network packets NET_PACKETs transmitted from another PIM-based accelerating device or a network router through the second interface 132 and the second interface bus 152. In this case, although not shown in
The local memory 225 may receive the local memory request LM_REQ from the PIM interface circuit 221. Although not shown in
The local processing unit 226 may receive the local processing request LP_REQ from the PIM interface circuit 221. The local processing unit 226 may perform local processing designated by the local processing request LP_REQ in response to the local processing request LP_REQ. To this end, the local processing unit 226 may receive data required for the local processing from the PIM interface circuit 221 or the local memory 225. The local processing unit 226 may transmit result data generated by the local processing to the local memory 225.
The host interface 511 may receive the host instruction HOST_INS from the first interface 131. As described with reference to
The instruction sequencer 515 may generate and output a memory request MEM_REQ, PIM requests PIM_REQs, a network request NET_REQ, a local memory request M_REQ, or a local processing request LP_REQ, based on the host instruction HOST_INS transmitted from the host interface 511. The instruction sequencer 515 may include an instruction queue 515A, an instruction decoder 515B, and an instruction sequencing FSM 515C. The instruction queue 515A may store the host instruction HOST_INS transmitted from the host interface 511. The instruction queue 515A may decode the stored host instruction HOST_INS to transmit decoded host instruction to the instruction sequencing FSM 515C. The instruction sequencing FSM 515C may generate and output the memory request MEM_REQ, the PIM requests PIM_REQs, the network request NET_REQ, the local memory request LM_REQ, or the local processing request LP_REQ, based on decoding result of the host instruction HOST_INS. The instruction sequencing FSM 515C may transmit the memory request MEM_REQ and the PIM requests PIM_REQs to the multimode interconnect circuit (223 in
Referring to
Referring to
The first to eighth PIM devices 611-618 may include PIM devices each constituting a first channel CH_A (hereinafter, referred to as “first to eighth channel A-PIM devices”) and PIM devices each constituting a second channel CH_B (hereinafter, referred to as “first to eighth channel B-PIM devices”). In this example, the first to eighth PIM devices 611-618 include the first to eighth channel A-PIM devices and the first to eighth channel B-PIM devices constituting two channels, but this is just one example, and the first to eighth PIM devices 611-618 may include three or more channel-PIM devices each constituting three or more channels. In another example, each of the first to eighth channel A-PIM devices and each of the first to eighth channel B-PIM devices may include a plurality of ranks.
The first channel A-PIM device (PIM0-CHA) 611A of the first PIM device 611 may be coupled to the PIM network system 620 through a first channel A signal/data line 641A. The first channel B-PIM device (PIM0-CHB) 611B of the first PIM device 611 may be coupled to the PIM network system 620 through a first channel B signal/data line 641B. The second channel A-PIM device (PIM1-CHA) 612A of the second PIM device 612 may be coupled to the PIM network system 620 through a second channel A signal/data line 642A. The second channel B-PIM device (PIM1-CHB) 612B of the second PIM device 612 may be coupled to the PIM network system 620 through a second channel B signal/data line 642B. Similarly, the eighth channel A-PIM device (PIM7-CHA) 618A of the eighth PIM device 618 may be coupled to the PIM network system 620 through an eighth channel A signal/data line 648A. The eighth channel B-PIM device (PIM7-CHB) 618B of the eighth PIM device 618 may be coupled to the PIM network system 620 through an eighth channel B signal/data line 648B. Each of the first to eighth channel-A PIM devices 611A-618A and each of the first to eighth channel B-PIM devices 611B-618B may be configured the same as the PIM device (111 of
Referring to
Referring to
Referring to
The first PIM network system 720A may include a first high-speed interface, for example, a first PCIe interface 721A and a first chip-to-chip interface (1st C2C I/F) 722A. The second PIM network system 720B may include a second high-speed interface, for example, a second PCIe interface 721B and a second chip-to-chip interface (2nd C2C I/F) 722B. Each of the first PCIe interface 721A and the second PCIe interface 721B may be replaced with a CXL interface or a USB interface. Each of the first PCIe interface 721A of the first PIM network system 720A and the second PCIe interface 721B of the second PIM network system 720B may correspond to the host interfaces 511 described with reference to
The first PIM network system 720A may be coupled to the first group of PIM devices, that is, the first to eighth PIM devices 711A-718A through first to eighth signal/data lines 741A-748A. For example, the first PIM network system 720A may be coupled to the first PIM device 711A through the first signal/data line 741A. The first PIM network system 720A may be coupled to the second PIM device 712A through the second signal/data line 742A. Similarly, the first PIM network system 720A may be coupled to the eighth PIM device 718A through the eighth signal/data line 748A.
The second PIM network system 720B may be coupled to the second group of PIM devices, that is, the first to eighth PIM devices 711B-718B through ninth to sixteenth signal/data lines 741B-748B. For example, the second PIM network system 720B may be coupled to the ninth PIM device 711B through the ninth signal/data line 741B. The second PIM network system 720B may be coupled to the tenth PIM device 712B through the tenth signal/data line 742B. Similarly, the second PIM network system 720B may be coupled to the sixteenth PIM device 718B through the sixteenth signal/data line 748B. Accordingly, traffic control of signals and data for the first to eighth PIM devices 711A-718A may be performed by the first PIM network system 720A. In addition, traffic control of signals and data for the ninth to sixteenth PIM devices 711B-718B may be performed by the second PIM network system 720B.
The first interface 731 may perform interfacing between the PIM-based accelerating device 700A and a host device. In an example, the first interface 731 may operate by a PCIe protocol, a CXL protocol, or a USB protocol. The first interface 731 may transmit signals and data transmitted from the host device to the first PIM network system 720A through a first interface bus 751. The first interface 731 may transmit signals and data transmitted from the first PIM network system 720A through the first interface bus 751 to the host device. In this example, the first interface 731 may be coupled to a first PCIe interface 721A of the first PIM network system 720A. On the other hand, the first interface 731 might not be coupled to a second PCIe interface 721B of the second PIM network system 720B. Therefore, the second PIM network system 720B might not directly communicate with the host device, but may communicate with the host device through the first PIM network system 720A.
The second interface 732 may perform interfacing between the PIM-based accelerating device 700A and another PIM-based accelerating device or a network router. In an example, the second interface 732 may be a device employing a communication standard, for example, the Ethernet standard. In an example, the second interface 732 may be an SFP port. The second interface 732 may transmit data that is transmitted from the first PIM network system 720A of the PIM-based accelerating device 700A through the second interface bus 752 to a first PIM network system of another PIM-based accelerating device. In addition, the second interface 732 may transmit data that is transmitted from another PIM-based accelerating device to the first PIM network system 720A through the second interface bus 752. Such data transmission may be performed through a network router between the PIM-based accelerating devices. Although not shown in
The first chip-to-chip interface 722A of the first PIM network system 720A may be coupled to the second chip-to-chip interface 722B of the second PIM network system 720B through a third interface bus 753. The first PIM network system 720A may transmit signals and data that are transmitted from the host device to the first PCIe interface 721A through the first interface 731 and the first interface bus 751 to the second chip-to-chip interface 722B of the second PIM network system 720B through the first chip-to-chip interface 722A and the third interface bus 753. Similarly, the second PIM network system 720B may transmit the signals and data from the second chip-to-chip interface 722B to the first chip-to-chip interface 722A of the first PIM network system 720A through the third interface bus 753. In this case, the first PIM network system 720A may transmit the signals and data received through the first chip-to-chip interface 722A to the host device through the first PCIe interface 721A, the first interface bus 751, and the first interface 731.
Referring to
Referring to
Referring to
Each of the first interfaces 831(1)-831(K) may be the first interface 131 described with reference to
A system bus 850 may be disposed between the first to “K”th PIM-based accelerating devices 810(1)-810(K) and the host device 820. The first to “K”th PIM-based accelerating devices 810(1)-810(K) may communicate with the system bus 850 through first to “K”th interface buses 860(1)-860(K), respectively. The host device 820 may communicate with the system bus 850 through a host bus 870. The first to “K”th PIM-based accelerating devices 810(1)-810(K) may communicate with each other through a network line, for example, an Ethernet line 880.
Referring to
Referring to
The first interface device 913 may be a high-speed interface terminal conforming to high-speed interfacing for high-speed communication with the host device. In an example, the first interface device 913 may be a PCIe terminal. In another example, the first interface device 913 may be a CXL terminal or a USB terminal. The first interface device 913 may be physically coupled to a high-speed interface slot or port on a board on which the host device is disposed, such as a PCIe slot, a CXL slot, or a USB port. Although omitted from
The second interface device 914 may be configured as a card-to-card interface device for signal and data transmission with another PIM-based accelerating card or a network router. In an example, the second interface device 914 may be an SFP port or an Ethernet port. In this case, the second interface device 914 may be controlled by a network controller in the PIM network system 120. In addition, the second interface device 914 may be coupled to another PIM-based accelerating card or a network router through a network cable. The second interface device 914 may be disposed in a plural number.
Referring to
The first interface device 923 may be a high-speed interface terminal conforming to high-speed interfacing for high-speed communication with the host device. In an example, the first interface device 923 may be a PCIe terminal. In another example, the first interface device 923 may be a CXL terminal or a USB terminal. The first interface device 923 may be physically coupled to a high-speed interface slot or port on a board on which a host device is disposed, such as a PCIe slot, a CXL slot, or a USB port. Although omitted from
The second interface device 924 may be configured as a card-to-card interface device for signal and data transmission with another PIM-based accelerating card or a network router. In an example, the second interface device 924 may be an SFP port or an Ethernet port. In this case, the second interface device 924 may be controlled by a network controller in the PIM network system 620. In addition, the second interface device 924 may be coupled to another PIM-based accelerating card or a network router through a network cable. The second interface device 924 may be disposed in a plural number.
Referring to
The first interface device 933 may be a high-speed interface terminal conforming to high-speed interfacing for high-speed communication with the host device. In an example, the first interface device 933 may be a PCIe terminal. In another example, the first interface device 933 may be a CXL terminal or a USB terminal. The first interface device 933 may be physically coupled to a high-speed interface slot or port on a board on which a host device is disposed, such as a PCIe slot, a CXL slot, or a USB port. Although omitted from
The second interface device 934 may be configured as a card-to-card interface device for signal and data transmission with another PIM-based accelerating card or a network router. In an example, the second interface device 934 may be an SFP port or an Ethernet port. In this case, the second interface device 934 may be controlled by network controllers in the first and second PIM network systems 720A and 720B. In addition, the second interface device 934 may be coupled to another PIM-based accelerating card or a network router through a network cable. The second interface device 934 may be disposed in a plural number.
Referring to
The first interface device 943 may be a high-speed interface terminal conforming to high-speed interfacing for high-speed communication with the host device. In an example, the first interface device 943 may be a PCIe terminal. In another example, the first interface device 943 may be a CXL terminal or a USB terminal. The first interface device 943 may be physically coupled to a high-speed interface slot or port on a board on which a host device is disposed, such as a PCIe slot, a CXL slot, or a USB port. Although omitted from
The second interface device 944 may be configured as a card-to-card interface device for signal and data transmission with another PIM-based accelerating card or a network router. In an example, the second interface device 944 may be an SFP port or an Ethernet port. The second interface device 944 may be coupled to at least one of the first PIM network system 720A and the second PIM network system 720B of the PIM-based accelerating device 700C through the wiring of the PCB 941. The second interface device 944 may be controlled by network controllers in the first and second PIM network systems 720A and 720B. The second interface device 944 may be coupled to another PIM-based accelerating card or a network router through a network cable. The second interface device 944 may be disposed in a plural number.
The inventive concept has been disclosed in conjunction with some embodiments as described above. Those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the present disclosure. Accordingly, the embodiments disclosed in the present specification should be considered from not a restrictive standpoint but an illustrative standpoint. The scope of the inventive concept is not limited to the above descriptions but defined by the accompanying claims, and all of distinctive features in the equivalent scope should be construed as being included in the inventive concept.
Number | Date | Country | Kind |
---|---|---|---|
10-2023-0071543 | Jun 2023 | KR | national |