The present disclosure relates to a systolic array device.
In a systolic array (SA) device, a plurality of processing elements (PEs) are adjacently arranged and connected with each other. Data such as an input feature and a partial sum are transferred between the adjacently arranged processing elements. At this time, the data may be transferred in a vertical direction or a horizontal direction. For example, the partial sum may be transferred in a vertical direction of SA (from the north to the south), and the input feature may be transferred in a horizontal direction (from the left to the right).
In SA, data such as the input feature is transferred between the adjacently arranged processing elements to be reused, thereby a bottleneck of the memory may not be caused. Further, in SA, the connection between the processing elements is short, thereby it is possible to embody the ASIC or the FPGA having relatively high clock frequency.
Meanwhile, a multiplier and an adder may be included in each of the plurality of processing elements included in SA, and each of the multiplier and the adder may be embodied by a plurality of transistors. If it is determined that any one of the transistors has a fault in a scan test performed after manufacturing a semiconductor, the entire chip die needs to be discarded. It is because the processing of the input feature and the output feature is performed by being engaged in a lock-step in the plurality of processing elements. Thus, there is a problem that the production yield is low. Accordingly, a way to solve this production yield problem is required.
The problem to be solved in the present disclosure includes providing a way to utilize the SA without discarding the entire chip when the fault is found in the manufacturing process of the SA.
Meanwhile, the problem to be solved in the present disclosure is not limited to the description above. Another problem to be solved which was not described may be clearly understood to a person skilled in the art from the description below.
In accordance with an embodiment of the present disclosure, there is provided a systolic array device including: a plurality of processing units arranged in a matrix form of M by N (M and N are natural numbers), wherein each of the processing units includes: a processing element configured to perform a predetermined processing based on data received from a processing unit arranged adjacent to one side of the corresponding processing unit to output a result thereof; and a transfer part configured to perform one of an operation of transferring the received data to another processing unit arranged adjacent to the other side of the corresponding processing unit and an operation of transferring the result.
According to an embodiment of the present disclosure, even if there is a fault in a part of the processing units constituting the systolic array device, it is possible to reuse the systolic array device through a configuration of bypassing data in each of the processing units and a preliminarily provided processing unit, thereby it is possible to improve the production yield of the systolic array device.
The advantages and features of embodiments and methods of accomplishing these will be clearly understood from the following description taken in conjunction with the accompanying drawings. However, embodiments are not limited to those embodiments described, as embodiments may be implemented in various forms. It should be noted that the present embodiments are provided to make a full disclosure and also to allow those skilled in the art to know the full range of the embodiments. Therefore, the embodiments are to be defined only by the scope of the appended claims.
In describing the embodiments of the present disclosure, if it is determined that detailed description of related known components or functions unnecessarily obscures the gist of the present disclosure, the detailed description thereof will be omitted. Further, the terminologies to be described below are defined in consideration of functions of the embodiments of the present disclosure and may vary depending on a user's or an operator's intention or practice. Accordingly, the definition thereof may be made on a basis of the content throughout the specification.
The input feature cache unit 200 may be embodied by a cache for storing an input feature. Here, the input feature may indicate information showing features such as a video or a voice, and so on.
The weight cache unit 300 may be embodied by a cache for storing a weight. Here, the weight is a type of kernel data, and indicates information that is multiplied to the input feature.
The processing element (PE) array 100 is configured to perform a processing by matching with a synchronizing signal (e.g., a clock signal). The processing element array 100 includes a plurality of processing units arranged in a matrix form of M by N (here, M and N are natural numbers). The processing unit itself will be described later.
Meanwhile, a part of the plurality of processing units are classified as main processing units and a remaining part are classified as preliminary processing units.
The main processing units may include at least two processing units, and the at least two processing units may be arranged in a matrix form of P by Q (here, P is a natural number equal to or smaller than M, Q is a natural number equal to or smaller than N, and the sum of P and Q is smaller than the sum of M and N).
The preliminary processing units indicate processing units having a form of at least one row or column added to the matrix form of P by Q formed by the mentioned main processing units. The preliminary processing units are provided to back up processing units having faults among the main processing units, which will be described later.
The control unit 400 may control the systolic array device 1000 to operate in various ways, for example, to operate as a processing accelerator, using information on the main processing units and the preliminary processing units. The operation of the control unit 400 will be described later.
The integration unit 500 may integrate a plurality of partial sums when the plurality of partial sums are provided, and may generate the integrated result as the output feature.
The output feature cache unit 600 may receive the output feature from the integration unit 500 and store the output feature.
Meanwhile, the systolic array device 1000 according to the first embodiment of the present disclosure may operate according to the original purpose even if there are faults in a part of the plurality of processing units included in the processing element array 100. Thus, a manufacturing yield of the systolic array device 1000 including the processing element array 100 may be improved. From now on, the specific configuration of the systolic array device 1000 that makes the above operation enable will be explained.
The processing element array 100 included in the systolic array device 1000 may include the plurality of processing units as described above, and each of the processing units may include the processing element 10 according to the first embodiment of the present disclosure illustrated in
More specifically, the processing element 10 may output a result by performing a predetermined processing based on a received data if data is received from a processing unit adjacently arranged on one side of itself. Here, ‘itself’ indicates a processing unit including the corresponding processing element 10. Further, ‘data’ may include the input feature, the weight, or the partial sum as mentioned.
More specifically, referring to
Meanwhile, the processing element 10 may include a withdrawal route for bypassing information stored in each of the REG113, the REG214 and the REG315 to the processing unit adjacently arranged on the other side of itself. Meanwhile, if the processing element 10 has a fault, in some cases, the information stored in each of the REG113, the REG214 and the REG315 may not be bypassed outside. Here, having the ‘fault’ may mean that there is an abnormality in the component included in the processing element 10 or in the route.
Referring to
The explanation about the processing element 10 cites the part explained in
Then, each of the transfer parts 21, 22 is configured to transfer one of the followings to the processing unit 20 adjacently arranged on the other side of itself. At this time, the transfer part 21 may be indicated as the first bypass transfer part 21, the transfer part 22 may be indicated as the second bypass transfer part 22, and each of them may be embodied by the multiplexer.
Here, the case that the transfer parts 21, 22 transfer the data received from the processing unit 20 adjacently arranged on one side of itself is the case that there is a fault in the processing element 10. Meanwhile, the case that the transfer parts 21, 22 transfer the result that the processing element 10 processed and outputted is the case that there is no fault in the processing element 10.
The e-FLASH or External ROM 410 may provide the information about whether there is a fault in the processing element 10 to each of the transfer part 21, 22. Here, the information provided by the e-FLASH or External ROM 410 may be acquired through a previous scan test, which will be explained more specifically in
Meanwhile, the number of the transfer parts 21, 22 are illustrated as two in
In case of
If there is no fault in any main processing unit among the nine main processing units, the systolic array device may operate as the processing accelerator by only the nine main processing units. In this case, the preliminary processing unit may not operate as a component of the processing accelerator.
Unlike this, depending upon the case, there may be a fault in at least one main processing unit among the nine main processing units. For example, as illustrated in
In this case, the control unit 400 may identify the information on the two main processing units having the faults among the nine main processing units according to the ways illustrated in
Referring to
Further, referring to
Based on this, the data flow in the systolic array device 1000 illustrated in
After the systolic array device 1000 is manufactured, the bootloader may load the signal for controlling the transfer parts 21 and 22 from the e-FLASH or External ROM 410 included in the control unit 400, and may make the bypass operation to be performed in the processing units in (2, 2) and (3, 3) having the faults. Also, the bootloader may make the bypass operation to be performed in all the processing units in the second row and all the processing units in the third column. The weight cache unit 300 may provide weights to each of the processing units in (1, 1), (1, 2), (1, 4). Further, the input feature cache unit 200 may provide input features to each of the processing units in (1, 1), (3, 1), (4, 1).
Then, the processing unit arranged in (1, 1) may perform the processing using the input feature and the weight, and then, may transfer the partial sum which is a result of the processing and the weight to the processing unit arranged in (2, 1) and transfer the input feature to the processing unit arranged in (1, 2).
Then, the processing unit arranged in (2, 1) may bypass the transferred information to the processing unit arranged in (3, 1). The processing unit arranged in (3, 1) may perform the processing in the same way with the processing unit arranged in (1, 1) and process the processing result.
Meanwhile, the processing unit arranged in (1, 2) may perform the processing in the same way that the processing unit arranged in (1, 1) performed based on the information transferred from the processing unit arranged in (1, 1), and then, may transfer the processing result to the processing unit arranged in (1, 3). Then, the processing unit arranged in (1, 3) may bypass the transferred processing result to the processing unit arranged in (1, 4).
That is, in the first embodiment, even in case that a part of main processing units provided to constitute the systolic array device have faults, the systolic array device may be operated by partially utilizing the preliminary processing unit and the main processing unit having the fault. That is, the systolic array may not be discarded even if one of the main processing unit has the fault. Accordingly, the production yield of the systolic array device may be improved.
Meanwhile, each of the processing units is configured to receive the same synchronizing signal. Through the synchronizing signal, each processing performed in the processing unit is synchronized. For example, a plurality of processing units arranged in the same row (or column) may perform the processing at the same point in time, and transfer the processing result to the next row (or column) at the same timing. At this time, each of the input feature, the weight and the partial sum may be transferred to the adjacent processing unit periodically according to the synchronizing signal. Unlike this, according to an embodiment, the weight may be previously loaded at each processing unit regardless of the synchronizing signal.
Hereinafter, a systolic array device according to a second embodiment of the present disclosure will be explained. The systolic array device according to the second embodiment may have the same configuration with the systolic array device 1000 illustrated in
Referring to
Here, the processing element 10 may receive input of first information (e.g., an input feature) from a first direction (e.g., a column direction) and receive input of second information (e.g., a weight or a partial sum) from a second direction (e.g., a row direction). Then, the processing element 10 may perform the processing based on the input information and output a result.
Then, the transfer parts 31 to 33 may be configured to transfer one of the followings to the processing unit adjacently arranged on the other side of itself, or may configured not to transfer anything. At this time, the transfer part 31 may refer to a bypass transfer part 31, the transfer parts 32, 33 may refer to direction control transfer parts 32 and 33, and each of the transfer parts 31 to 33 may be embodied by the multiplexer.
Here, ‘the processing unit adjacently arranged on one side of itself’ may refer to the processing unit adjacently arranged to the same row (or column) with itself or the processing unit adjacently arranged in the diagonal direction to itself. Further, ‘the processing unit adjacently arranged on the other side of itself’ may refer to the processing unit adjacently arranged to the same row (or column) with itself or the processing unit adjacently arranged in the diagonal direction to itself.
The e-FLASH or External ROM 410 may be cited to the description explained in the first embodiment.
Meanwhile, the number of the transfer part 31 to 33 are illustrated as three in
In case of
If there is no fault in any main processing unit among the nine main processing units, the systolic array device may operate as the processing accelerator by only the nine main processing units. In this case, the preliminary processing unit may not operate as a component of the processing accelerator.
Unlike this, depending upon the case, there may be a fault in at least one main processing unit among the nine main processing units. For example, as illustrated in
In this case, the control unit 400 may identify the information on the main processing units having the faults among the nine main processing units. Based on the identified information, the e-FLASH or External ROM 410 included in the control unit 400 may control the transfer parts 31 to 33 included in each of the processing units and make the following operations to be performed.
<An Example of the Control Performed by the Bypass Transfer Part 31 in the Processing Unit Having the Fault>
In the row direction (or column direction), it is controlled such that the data transferred from the processing unit adjacently arranged on one side is bypassed to the processing unit adjacently arranged on the other side
<An Example of the Control Performed by the Bypass Transfer Part 31 in the Processing Unit Having No Fault>
Based on this, referring to
After the systolic array device 1000 is manufactured, when booting, the bootloader may load the signal for controlling the transfer parts 31 to 33 from the e-FLASH or External ROM 410 included in the control unit 400.
By the signal, in each of the processing units (1, 1), (2, 3) and (3, 2), data transferred from the processing unit adjacently arranged on one side in the row direction is transferred to the processing unit adjacently arranged on the other side, and any data or result is not transferred in the column direction.
Further, by the signal, in each of at least a part of the processing units having no fault, the result that the processing element of itself outputted is transferred to the processing unit adjacently arranged on the other side in the row direction. In the column direction, data is provided from one of the processing unit arranged adjacent to one side of the corresponding processing unit in the same column and the processing units arranged diagonally adjacent to the corresponding processing unit, and the result that the processing element of itself outputted is transferred to one of the processing unit arranged adjacent to the other side of the corresponding processing unit in the same column with itself and the processing units arranged diagonally adjacent to the corresponding processing unit.
For example, referring to
With this, the input feature that the input feature cache unit 200 transferred to the processing unit (1, 1) is transferred to the processing unit (1, 2) by being bypassed immediately, and then is transferred to the processing unit (1, 4) through the processing unit (1, 3). Further, the input feature that the input feature cache unit 200 transferred to the processing unit (2, 1) is transferred to the processing unit (2, 2), and then is transferred to the processing unit (2, 4) by being bypassed in the processing unit (2, 3). Furthermore, the input feature that the input feature cache unit 200 transferred to the processing unit (3, 1) is transferred to the processing unit (3, 3) by being bypassed in the processing unit (3, 2), and then is transferred to the processing unit (3, 4).
That is, in the second embodiment, even in the case that there are faults in a part of the main processing units provided to constitute the systolic array device, the systolic array device may be operated by partially utilizing the preliminary processing unit and the main processing units having the faults. That is, even if there is a fault in any one of the main processing units, the systolic array device needs not to be discarded. Thus, it is possible to improve the production yield of the systolic array device.
Meanwhile, each processing unit is configured to receive the same synchronizing signal. By the synchronizing signal, the processes performed in the processing units are synchronized with each other. For example, a plurality of the processing units arranged in the same row (or column) may perform the processing at the same timing, and may transfer the result of performing the processing to the next row (or column) at the same timing. At this time, each of the input feature and the partial sum may be transferred to the adjacent processing unit periodically according to the synchronizing signal. Unlike this, according to an embodiment of the present disclosure, the weight may be previously loaded at each processing unit regardless of the synchronizing signal.
Hereinafter, a systolic array device according to a third embodiment will be explained. The systolic array device according to the third embodiment may have the same configuration with the systolic array device 1000 illustrated in
Referring to
Here, the processing element 10 may receive the input of first information (e.g., an input feature) from a first direction (e.g., a column direction), receive the input of second information (e.g., a weight or a partial sum) from a second direction (e.g., a row direction), and then, output a result after performing the process based on the input information.
Then, the transfer parts 41 to 43 may be configured to transfer one of the followings to the processing unit adjacently arranged on the other side of itself, or may be configured not to transfer anything. At this time, the transfer part 41 may refer to the bypass transfer part 41, the transfer parts 42, 43 may refer to direction control transfer parts 42 and 43, and each of the transfer part 41 to 43 may be embodied by the multiplexer.
Here, ‘the processing unit adjacently arranged on one side of itself’ may refer to the processing unit adjacently arranged to the same row (or column) with itself or the processing unit adjacently arranged in the diagonal direction to itself. Further, ‘the processing unit adjacently arranged on the other side of itself’ may refer to the processing unit adjacently arranged to the same row (or column) with itself or the processing unit adjacently arranged in the diagonal direction of itself.
The e-FLASH or External ROM 410 may be cited to the description explained in the first embodiment.
Meanwhile, the number of the transfer part 41 to 43 are illustrated as three in
In
If there is no fault in any main processing unit among the nine main processing units, the systolic array device may operate as the processing accelerator by only the nine main processing units. In this case, the preliminary processing unit may not operate as a component of the processing accelerator.
Unlike this, depending upon the case, there may be a fault in at least one main processing unit among the nine main processing units. For example, as illustrated in
In this case, the control unit 400 may identify the information on the main processing units having the faults among the nine main processing units. Based on the identified information, the e-FLASH or External ROM 410 included in the control unit 400 may control the transfer parts 41 to 43 included in each of the processing units and make the following operations to be performed.
<An Example of the Control Performed by the Bypass Transfer Part 41 in the Processing Unit Having the Fault>
In the row direction (or column direction), it is controlled such that the data transferred from the processing unit adjacently arranged on one side is bypassed to the processing unit adjacently arranged on the other side
<An Example of the Control Performed by the Bypass Transfer Part 41 in the Processing Unit Having No Fault>
Based on this, referring to
After the systolic array device 1000 is manufactured, when booting, the bootloader may load the signal for controlling the transfer parts 41 to 43 from the e-FLASH or External ROM 410 included in the control unit 400.
By the signal, in each of the processing units (4, 1), (1, 2) and (3, 3), data transferred from the processing unit arranged adjacent to one side of the corresponding processing unit in the column direction is transferred to the processing unit arranged adjacent to the other side of the corresponding processing unit in the column direction, and any data or result is neither received nor delivered in the row direction.
Further, by the signal, in each of at least a part of the processing units having no fault, the result that the processing element of itself outputted is transferred to the processing unit adjacently arranged on the other side in the column direction. In the row direction, data is provided from one of the processing unit arranged adjacent to one side of the corresponding processing unit in the same row and the processing units arranged diagonally adjacent to the corresponding processing unit, and the result that the processing element of itself outputted is transferred to one of the processing unit arranged adjacent to the other side of the corresponding processing unit in the same row with itself and the processing units arranged diagonally adjacent to the corresponding processing unit.
For example, referring to
With this, the input feature that the input feature cache unit 200 transferred to the processing unit (1, 1) is transferred to the processing unit (1, 2), and then is transferred to the processing unit (1, 3). Further, the input feature that the input feature cache unit 200 transferred to the processing unit (2, 1) is transferred to the processing unit (3, 2), and then is transferred to the processing unit (2, 3). Further, the input feature that the input feature cache unit 200 transferred to the processing unit (3, 1) is transferred to the processing unit (4, 2), and then is transferred to the processing unit (4, 3).
That is, in the third embodiment, even in the case that there are faults in a part of the main processing units provided to constitute the systolic array device, the systolic array device may be operated by partially utilizing the preliminary processing unit and the main processing units having the faults. That is, even if there is a fault in any one of the main processing units, the systolic array device needs not to be discarded. Thus, it is possible to improve the production yield of the systolic array device.
Meanwhile, each processing unit is configured to receive the same synchronizing signal. By the synchronizing signal, the processes performed in the processing units are synchronized with each other. For example, a plurality of the processing units arranged in the same row (or column) may perform the processing at the same timing, and may transfer the result of performing the processing to the next row (or column) at the same timing. At this time, each of the input feature and the partial sum may be transferred to the adjacent processing unit periodically according to the synchronizing signal. Unlike this, according to an embodiment of the present disclosure, the weight may be previously loaded at each processing unit regardless of the synchronizing signal.
Meanwhile,
Then, as illustrated in
Embodiments of the present disclosure is advantageous, among other reasons, because the production yield of the systolic array device may be improved.
Number | Date | Country | Kind |
---|---|---|---|
10-2019-0087044 | Jul 2019 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
20180267936 | Chen et al. | Sep 2018 | A1 |
20190079801 | Lyuh | Mar 2019 | A1 |
20190164037 | Kim | May 2019 | A1 |
Number | Date | Country |
---|---|---|
10-1998-0032544 | Jul 1998 | KR |
10-1999-0077600 | Oct 1999 | KR |
10-2019-0030564 | Mar 2019 | KR |
10-2019-0063393 | Jun 2019 | KR |
Entry |
---|
Zhang et al.; Analyzing and Mitigating the Impact of Permanent Faults on a Systolic Array Based Neural Network Accelerator; 2018; IEEE (Year: 2018). |
Kim et al.; On the Design of Fault-Tolerant Two-Dimensional Systolic Arrays for Yield Enhancement; 1989; IEEE (Year: 1989). |
Number | Date | Country | |
---|---|---|---|
20220129410 A1 | Apr 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2020/009532 | Jul 2020 | WO |
Child | 17569081 | US |