The present invention relates to an apparatus for parallel processing capable of error detection and error correction. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(Ministry of Science and ICT) (Project unique No.: 1711193550; Project No.: 2021-0-00863-003; R&D project: Development of new concept PIM semiconductor technology; Research Project Title: Development of an intelligent in-memory error correction device for high-reliability memory; and Project period: 2023.01.01.˜2023.12.31.), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(Ministry of Science and ICT) (Project unique No.: 1711193592; Project No.: 2021-0-02052-003; R&D project: Information and Communication Broadcasting Innovation Talent Training; Research Project Title: Artificial intelligence system for smart mobility Development of core semiconductor technologies and training of personnel; and Project period: 2023.01.01.˜2023.12.31.), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(Ministry of Science and ICT) (Project unique No.: 1711195788; Project No.: 00228970; R&D project: Next-generation intelligent semiconductor technology development (design); Research Project Title: Development of Flexible SW/HW Conjunctive Solution for on-edge self-supervised learning; and Project period: 2023.04.01.˜2023.12.31.).
An apparatus for parallel processing, including multiple cores, divides a single task into multiple threads and processes parallel calculations for the multiple threads using the multiple cores. In cases where most of the cores perform multiplication calculations, there is a problem that multiple cores need to collectively perform calculations even when at least one of the operands is zero, resulting in zero as the output of the multiplication calculations. Korean Patent Application Publication No. 10-2020-0096102 discloses the background of the present invention.
The object of the present invention is to provide a technology for detecting and correcting errors in calculations in parallel processing.
In addition, the object of the present invention is to provide a technology for detecting and correcting errors in calculations in parallel processing, including sparse matrix calculations such as training of an artificial neural network model.
In addition, the object of the present invention is to provide a technology for detecting and correcting errors in calculations in the parallel processing of a GPU.
In addition, the object of the present invention is to provide a technology for detecting the occurrence of faults in multiple cores for parallel processing and isolating the cores where faults have occurred.
In accordance with an aspect of the present disclosure, there is provided a method of operating an apparatus for parallel processing including multiple cores, comprising: searching for a zero core among the multiple cores performing multiplication calculations to process a first instruction, wherein the zero core includes at least one operand of the multiplication calculations performed by each of the multiple core that is zero; performing multiplication calculations in each of the multiple cores, wherein at least one zero core performs the multiplication calculations for same operands as operands of a non-zero core, wherein the non-zero core includes operand that is not zero; and comparing a calculation result of the at least one zero core with a calculation result of the non-zero core to determine whether calculation error has occurred.
In determining whether the calculation error has occurred, when the calculation result of the at least one zero core does not match the calculation result of the non-zero core, it may be determined that an error has occurred in the multiplication calculations performed by the non-zero core or the at least one zero core.
The method may further comprise: re-performing, by at least three cores including the non-zero core and the zero core among the multiple cores, multiplication calculations for operands calculated by the non-zero core when it is determined that an error has occurred in the multiplication calculations performed by the non-zero core or the zero core; and determining whether there is a fault in the non-zero core based on multiplication calculation results of the at least three cores.
In the determining whether there is a fault in the non-zero core or the zero core, when the multiplication calculation result of the non-zero core or the zero core among the at least three cores does not match calculation results of the other cores, it may be determined that there is a fault in the non-zero core or the zero core.
The method may further comprise: setting, when it is determined that there is a fault in the non-zero core or the zero core, in processing a second instruction, to be processed after the first instruction has been processed, another zero core to perform multiplication calculations to be performed in the non-zero core or zero core determined to have a fault.
In the re-performing of the multiplication calculations, when it is determined that there is an error in the multiplication calculations performed by the non-zero core or the zero core, processing of a second instruction, to be processed after the first instruction has been processed, it may not be started and the multiplication calculations may be re-performed.
The method may further comprise: setting, when it is determined that an error has occurred in the multiplication calculations performed by the non-zero core or the zero core, and calculation results of two or more cores match each other, the multiplication calculation results of the two or more cores as the multiplication calculation result of the non-zero core or the zero core.
The first instruction may include multiplication calculations for a sparse matrix.
In the searching for the zero core, the zero core may be searched by performing logical multiplication calculations on operands to be processed by each of the multiple cores in the first instruction to identify whether at least one operand is zero.
In accordance with another aspect of the present disclosure, there is provided an apparatus for parallel processing, comprising: multiple cores that perform parallel processing; a zero-core search unit configured to search for a zero core among the multiple cores performing multiplication calculations to process a first instruction, wherein the zero core includes at least one operand of the multiplication calculations performed by each of the multiple core that is zero; a calculation controller configured to perform multiplication calculations in each of the multiple cores and control at least one zero core to perform the multiplication calculations for same operands as operands of a non-zero core, wherein the non-zero core includes operand that is not zero; and an error determination unit configured to determine whether calculation error has occurred by comparing a calculation result of the at least one zero core with a calculation result of the non-zero core.
The error determination unit may be further configured to determine that there is an error in the multiplication calculations performed by the non-zero core or the zero core when the multiplication calculation result of the at least one zero core does not match the multiplication calculation result of the at least one non-zero core.
The calculation controller may be further configured to control at least three cores, including the non-zero core and the zero core among the multiple cores, to re-perform multiplication calculations for operands calculated by the non-zero core when it is determined that there is an error in the multiplication calculations performed by the non-zero core or the zero core, and wherein the error determination unit is further configured to determine whether there is a fault in the non-zero core or the zero core based on multiplication calculation results of the at least three cores.
The error determination unit may be further configured to determine that there is a fault in the non-zero core or the zero core when the multiplication calculation result of the non-zero core or the zero core among the at least three cores does not match calculation results of the other cores.
The error determination unit may be further configured to set, when it is determined that there is a fault in the non-zero core or the zero core, in processing a second instruction, to be processed after the first instruction has been processed, another zero core to perform multiplication calculations to be performed in the non-zero core or zero core determined to have a fault.
The calculation controller may be further configured to control to re-perform the multiplication calculations without starting processing of a second instruction, to be processed after the first instruction has been processed, when it is determined that there is an error in the multiplication calculations performed by the non-zero core or the zero core.
The calculation controller, when it is determined that there is an error in the multiplication calculations performed by the non-zero core or the zero core, and calculation results of two or more zero cores match each other, may be further configured to set the multiplication calculation results of the two or more zero cores as the multiplication calculation result of the non-zero core.
The first instruction may include multiplication calculations for a sparse matrix.
The zero-core search unit may be further configured to search for the zero core by performing logical multiplication calculations for operands to be processed by each of the multiple cores in the first instruction to identify whether at least one operand is zero.
In accordance with another aspect of the present disclosure, there is provided a method of operating an apparatus for parallel processing including multiple cores, comprising: receiving an instruction that includes N threads as input; allocating the N threads to each of multiple cores; searching for at least one zero thread among the N threads, where at least one of operands of multiplication calculations is zero; reallocating at least one zero core that has been allocated at least one zero thread to process the same non-zero thread as a non-zero core matched to a non-zero thread that is not the zero thread; and determining whether an error has occurred in the multiplication calculations of the non-zero core based on a multiplication calculation result of at least one zero core according to the reallocation and a multiplication calculation result of the non-zero core.
According to one aspect of the present invention, it is possible to detect and correct calculation errors in parallel processing.
In addition, according to another aspect of the present invention, it is possible to detect and correct calculation errors in parallel processing, including sparse matrix calculations such as training of an artificial neural network model.
In addition, according to yet another aspect of the present invention, it is possible to detect and correct calculation errors in the parallel processing of a GPU.
In addition, according to yet another aspect of the present invention, it is possible to detect the occurrence of faults in multiple cores for parallel processing and to isolate the cores where faults have occurred.
Advantages and features of the present invention and methods of achieving the advantages and features will be clear with reference to embodiments described in detail below together with the accompanying drawings. However, embodiments are not limited to those embodiments described, as embodiments may be implemented in various forms. It should be noted that the present embodiments are provided to make a full disclosure and also to allow those skilled in the art to know the full range of the embodiments. Therefore, the embodiments are to be defined only by the scope of the appended claims.
In describing embodiments of the present invention, detailed descriptions of well-known functions or configurations will be omitted unless they are essential for describing the embodiments of the present invention. In addition, the terms used in the exemplary embodiments of the present invention are defined considering the functions in the present invention and may vary depending on the intention or usual practice of a user or an operator. Therefore, the definition of the present disclosure should be made based on the entire contents of the present specification.
The term “unit”, “part”, “module”, or the like, which is described in the specification hereinafter, refers to a unit that performs at least one function or operation, and the “unit”, “part”, “module” or the like may be implemented by hardware, software, or a combination of hardware and software.
With reference to
The multiple cores 1100 are each an independent processing unit and may perform parallel calculations to process a single instruction.
In an embodiment, the multiple cores 1100 may refer to multiple cores 1100 included in a GPU or the like.
In an embodiment, the multiple cores 1100 may process an instruction that has N threads as a unit (e.g., warp), where each of the multiple cores 1100 processes calculations for a single thread, thereby performing parallel calculations for the instruction.
In an embodiment, the multiple cores 1100 may process instructions using the single instruction multiple thread (SIMT) method.
In an embodiment, the instruction may include multiplication calculations for a sparse matrix. Specifically, the instruction may include calculations for a feature matrix that has passed through the activation function ReLU (rectified linear unit) of an artificial neural network model such as a deep neural network (DNN) or a convolutional neural network (CNN), that is, a matrix containing many zero elements. An artificial neural network model, such as a deep neural network (DNN) model or a convolutional neural network (CNN) model, uses an activation function to learn complex nonlinear representations. The most widely used activation function is the rectified linear unit (ReLU). The ReLU activation function may replace all negative values in the activation feature matrix with zero. Due to this characteristic, the activation feature matrix that has passed through the ReLU function becomes a sparse matrix with many zero values. In addition, the pruning technique used for model compression creates weights of low importance to zero. Consequently, due to the two factors of the ReLU activation function and pruning, deep neural networks mainly perform sparse matrix multiplication calculations. Therefore, the apparatus for parallel processing according to the present invention may also be applied to the training and use of artificial neural network models including calculations for sparse matrices. In addition, multiply-accumulate (MAC) calculations, which are mainly used in matrix multiplication calculations, are defined as D=A*B+C. Sparse matrix multiplication calculations create many of operands A and B required for MAC calculations to zero, which means that regardless of whether MAC calculations are performed, the final result (D=0+C=D) remains unchanged. Therefore, the sparsity that occurs in the DNN may create unnecessary calculations, reducing calculation efficiency. Accordingly, the present invention may make it possible to use these unnecessary calculations for error correction.
The zero-core search unit 1200 may search for at least one core (hereinafter, “zero core”), among the multiple cores 1100, that performs multiplication calculations, where at least one of the operands in the multiplication calculations is zero, in order to parallel process the instruction.
In an embodiment, the zero-core search unit 1200 may perform logical multiplication calculations between operands allocated to each of the multiple cores 1100 to search for at least one zero core.
In an embodiment, the zero-core search unit 1200 may search among the N (a natural number of 2 or greater) threads allocated to each of the multiple cores 1100 for a thread (hereinafter, “zero thread”) where at least one of the operands of the multiplication calculations is zero. Depending on the search result, the core allocated the zero thread may be determined as a zero core, and the core allocated a thread that is not a zero thread (non-zero thread) may be determined as a non-zero core.
The calculation controller 1300 may control the overall operation of the apparatus for parallel processing 1000 to process instructions.
In an embodiment, the calculation controller 1300 may control the operation of the multiple cores 1100 to perform parallel processing of instructions in the multiple cores 1100.
Specifically, the calculation controller 1300 may divide an instruction into N threads corresponding to the number of cores and control to process the N threads in parallel in each of the multiple cores 1100
In an embodiment, the calculation controller 1300 may control at least one zero core to perform multiplication calculations for the same operands as those of a core whose operand is not zero (hereinafter, “non-zero core”).
In an embodiment, the calculation controller 1300 may control at least one zero core to perform multiplication calculations for the same operands as those of the multiplication calculations performed by a non-zero core when processing a first instruction. Specifically, in parallel processing of the first instruction, when the operands processed by a first core (zero core) are A1 and B1, at least one of which is zero, while the operands processed by a second core are A2 and B2, both of which are not zero, since the result of the multiplication calculations by the first core is zero, the first core may be controlled to perform multiplication calculations for the same operands A2 and B2 as those of the second core, in order to use the calculation operation of the first core to detect whether an error occurred in the calculations of the second core.
In an embodiment, when the calculation controller 1300 determines that an error has occurred in the calculations, the calculation controller 1300 may control such that the calculations for the operands in which the calculation error occurred are re-performed in each of three or more cores, including the non-zero core and the zero core where the calculation error occurred.
In an embodiment, when the calculation controller 1300 determines that an error has occurred in the calculations, the calculation controller 1300 may stop the input of the instruction to be processed after the first instruction was processed (hereinafter, “second instruction”) and control each of three or more cores, including the non-zero core and the zero core in which the calculations has been performed, to re-perform the calculations for the operands to be processed by the non-zero core.
In an embodiment, the calculation controller 1300 may allocate threads to be processed by each of the multiple cores 1100. In addition, the calculation controller 1300 may allocate a non-zero thread allocated to a non-zero core to at least one zero core, even when a zero thread has already been allocated to the zero core.
The error determination unit 1400 may determine whether an error has occurred in the calculations performed by the core based on the calculation result of the non-zero core and the calculation result of at least one zero core.
In an embodiment, the error determination unit 1400 may determine that an error has occurred in the calculation result for the operand performed by the non-zero core or the zero core when the calculation result performed by the non-zero core does not match the calculation result performed by the zero core.
In an embodiment, the error determination unit 1400 may determine that an error has occurred in the operand calculations performed by the non-zero core when the calculation result of the non-zero core does not match the calculation result of at least one zero core.
In an embodiment, the error determination unit 1400 may determine that there is no error in the calculations of the non-zero core and the zero core when the calculation result of the non-zero core matches the calculation result of the zero core. In this case, the error determination unit 1400 may output the calculation result of the zero core as zero, which is the result of the multiplication calculations for the operands that were originally supposed to be processed.
In an embodiment, the error determination unit 1400 may, when the calculation results of two or more cores match each other and the calculation result of one core differs from those of the two or more cores, correct the calculation error of the core with differing calculation results by replacing the calculation result of the core with differing calculation results with the calculation result of two or more cores
In an embodiment, the error determination unit 1400 may determine whether one core has faulted based on the calculation results of three or more cores that calculated the same operand.
In an embodiment, when, even after re-performing the multiplication calculations for the operands with errors, the calculation result of one core among the three cores does not match the calculation result of the other two or more cores, the error determination unit 1400 may determine that a fault, i.e., a mechanical defect has been occurred in the one core that does not match the result. This is because, when calculation errors have repeatedly occurred, it is highly likely that a permanent fault has occurred in the core.
In an embodiment, when, after re-performing the multiplication calculations for the operands with errors, the calculation result of the non-zero core matches the calculation results of two or more non-zero cores, the error determination unit 1400 may use the current performed calculation result as the calculation result of the non-zero core. Then, the error determination unit 1400 may receive the input of the second instruction to be processed after the first instruction.
In addition, the error determination unit 1400 may perform error detection and error correction using a voting method. For example, the error determination unit 1400 may determine the calculation result for the operand as the calculation result derived most frequently when three or more cores perform calculations for the same operand but different calculation results have been derived.
Hereinafter, the method will be described in terms of example as being performed by the apparatus for parallel processing 1000 illustrated in
In step S2100, the apparatus for parallel processing 1000 may receive an instruction (hereinafter, “first instruction”) to perform parallel processing using the multiple cores 1100.
In an embodiment, the instruction may include multiplication calculations for a sparse matrix. Specifically, the instruction may include calculations for a feature matrix that has passed through the activation function ReLU (rectified linear unit) of an artificial neural network model such as a deep neural network (DNN) or a convolutional neural network (CNN), that is, a matrix containing many zero elements.
In step S2200, the apparatus for parallel processing 1000 may search for at least one core (hereinafter, “zero core”), among the multiple cores 1100, that performs multiplication calculations, where at least one of the operands in the multiplication calculations is zero, in order to parallel process the first instruction.
In an embodiment, the apparatus for parallel processing 1000 may search for at least one zero core by performing logical multiplication calculations (logical AND calculations) between the operands to be allocated to each of the multiple cores 1100.
In an embodiment, the apparatus for parallel processing 1000 may determine whether to perform error detection and correction operations based on the number of zero cores and the number of cores that are not zero core (hereinafter, “non-zero core”) according to the zero core search result. Since the present invention assumes processing for a sparse matrix in which zeros are more frequent than ones, the following describes the case in which error detection and correction operations are performed in processing an instruction.
In step S2300, the apparatus for parallel processing 1000 may perform multiplication calculations for the same operands as those of the multiplication calculations to be performed by one non-zero core in at least one zero core in processing the first instruction. Specifically, in parallel processing of the first instruction, when the operands processed by a first core (zero core) are A1 and B1, at least one of which is zero, while the operands processed by a second core are A2 and B2, both of which are not zero, since the result of the multiplication calculations by the first core is zero, the first core may perform multiplication calculations for the same operands A2 and B2 as those of the second core, in order to use the calculation operation of the first core to detect whether an error occurred in the calculations of the second core.
In step S2400, the apparatus for parallel processing 1000 may determine whether an error has occurred in the calculations performed by the non-zero core or the zero core, based on the calculation result of the non-zero core and the calculation result of at least one zero core.
In an embodiment, the apparatus for parallel processing 1000 may determine that an error has occurred in the calculation result for the operand performed by the non-zero core or the zero core when the calculation result performed by the non-zero core does not match the calculation result performed by the zero core.
In an embodiment, the apparatus for parallel processing 1000 may determine that an error has occurred in the calculations performed by the non-zero core or the zero core when the calculation result of the non-zero core does not match the calculation result of at least one zero core.
In an embodiment, the apparatus for parallel processing 1000 may determine that there is no error in the calculations of the non-zero core and the zero core when the calculation result of the non-zero core matches the calculation result of the zero core. In this case, the apparatus for parallel processing 1000 may output the calculation result of the zero core as zero, which is the result of the multiplication calculations for the operands that were originally supposed to be processed.
In an embodiment, the apparatus for parallel processing 1000 may, when the calculation results of two or more cores match each other and the calculation result of one core differs from those of the two or more cores, correct the calculation error of the core with differing calculation results by replacing the calculation result of the core with differing calculation results with the calculation result of two or more cores.
In step S2500, when the apparatus for parallel processing 1000 determines that an error has occurred in the calculations, the apparatus for parallel processing 1000 may re-perform the calculations for the operands in which the calculation error occurred in each of three or more cores, including the non-zero core and the zero core where the calculation error occurred.
In an embodiment, when the apparatus for parallel processing 1000 determines that an error has occurred in the calculations of the core, the apparatus for parallel processing 1000 may stop the input of the instruction to be processed after the first instruction (hereinafter, “second instruction”) and re-perform the calculations for the operands to be processed by the non-zero core in each of three or more cores, including the non-zero core and the zero core that performed the calculations.
In step S2600, the apparatus for parallel processing 1000 may determine whether a core has faulted based on the calculation results of three or more cores that performed calculations for the same operand.
In an embodiment, when, even after re-performing the multiplication calculations for the operands with errors, the calculation result of one core among the three cores does not match the calculation result of the other two or more cores, the apparatus for parallel processing 1000 may determine that a fault, i.e., a mechanical defect has been occurred in the one core that does not match the result. This is because, when calculation errors have repeatedly occurred, it is highly likely that a permanent fault has occurred in the core.
In an embodiment, the apparatus for parallel processing 1000 may perform error detection and error correction using a voting method. For example, the error determination unit 1400 may determine the calculation result for the operand as the calculation result derived most frequently when three or more cores perform calculations for the same operand but different calculation results have been derived.
In an embodiment, when, after re-performing the multiplication calculations for the operands with errors, the calculation result of the non-zero core matches the calculation results of two or more non-zero cores, the apparatus for parallel processing 1000 may use the current performed calculation result as the calculation result of the non-zero core. Then, the apparatus for parallel processing 1000 may receive the input of the second instruction to be processed after the first instruction.
In step S2700, when the apparatus for parallel processing 1000 determines that a fault has occurred in the non-zero core, the apparatus for parallel processing 1000 may modify core pairing information used for distributing the operands to be processed by each of the multiple cores 1100 for parallel processing of the instruction.
In an embodiment, the apparatus for parallel processing 1000 may modify the core pairing information to have the operands that need to be processed by the core determined to have a fault calculated by another zero core in processing subsequently input instructions.
In step S2800, the apparatus for parallel processing 1000 may receive the input of the second instruction and perform parallel processing of the second instruction through the multiple cores 1100 based on the modified core pairing information. That is, in processing the second instruction, the apparatus for parallel processing 1000 may have another zero core perform the calculations for the operands that need to be processed by the core determined to have a fault.
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
With reference to
As illustrated in
The device described above may be implemented using hardware components, software components, and/or a combination of hardware and software components. For example, the device and components described in the embodiments may be implemented using one or more general-purpose or special-purpose computers, such as a processor, controller, arithmetic logic unit (ALU), digital signal processor (DSP), microcomputer, field-programmable array (FPA), programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute one or more software applications operating on an operating system (OS).
Additionally, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For convenience, a single processing device may be described, but it will be understood by those skilled in the art that the processing device may include multiple processing elements and/or multiple types of processing elements. For example, the processing device may include multiple processors or a combination of a processor and a controller. Other processing configurations, such as a parallel processor, may be also possible.
Software may include computer program, codes, instructions, or any combination thereof, which can configure the processing device to operate as desired or collectively command the processing device independently or in combination. Software and/or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal wave to be interpreted by or provide instructions or data to the processing device. Software may also be distributed across networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.
Combinations of steps in each flowchart attached to the present disclosure may be executed by computer program instructions. Since the computer program instructions can be mounted on a processor of a general-purpose computer, a special purpose computer, or other programmable data processing equipment, the instructions executed by the processor of the computer or other programmable data processing equipment create a means for performing the functions described in each step of the flowchart. The computer program instructions can also be stored on a computer-usable or computer-readable storage medium which can be directed to a computer or other programmable data processing equipment to implement a function in a specific manner. Accordingly, the instructions stored on the computer-usable or computer-readable recording medium can also produce an article of manufacture containing an instruction means which performs the functions described in each step of the flowchart. The computer program instructions can also be mounted on a computer or other programmable data processing equipment. Accordingly, a series of operational steps are performed on a computer or other programmable data processing equipment to create a computer-executable process, and it is also possible for instructions to perform a computer or other programmable data processing equipment to provide steps for performing the functions described in each step of the flowchart.
In addition, each step may represent a module, a segment, or a portion of codes which contains one or more executable instructions for executing the specified logical function(s). It should also be noted that in some alternative embodiments, the functions mentioned in the steps may occur out of order. For example, two steps illustrated in succession may in fact be performed substantially simultaneously, or the steps may sometimes be performed in a reverse order depending on the corresponding function.
The above description is merely exemplary description of the technical scope of the present disclosure, and it will be understood by those skilled in the art that various changes and modifications can be made without departing from original characteristics of the present disclosure. Therefore, the embodiments disclosed in the present disclosure are intended to explain, not to limit, the technical scope of the present disclosure, and the technical scope of the present disclosure is not limited by the embodiments. The protection scope of the present disclosure should be interpreted based on the following claims and it should be appreciated that all technical scopes included within a range equivalent thereto are included in the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2023-0154867 | Nov 2023 | KR | national |