The present invention relates to the field of parallel computation systems. More particularly, the invention relates to a method and system for efficiently performing, by a plurality of processors, fault tolerant numerical linear algebra computations, such as classic and fast matrix multiplication.
Computer systems that employ a plurality of processors suffer from errors (faults), which may be hard errors or soft errors. Such errors introduce serious problems in high performance computing. Given the increase in machine size and decrease in operating voltage, hard errors (component failure) and soft errors (bit flip) are likely to become more frequent. Hardware trends predict two errors per minute up to two per second on exascale machines, (which are computing systems capable of at least one exaFLOPS, or a 1018 calculations per second—see references [2], [24], [11] below).
General-purpose hard-error resiliency solutions such as checkpoint-restart [22], and diskless-checkpoints [22] successfully meet this challenge but are costly, and severely degrade performance. For numerical linear algebra, more efficient, solutions incur lower overhead by combining error correcting codes with matrix computations. However, current solutions require a significant, increase in the required number of processors. These solutions are also based on distributed 2D (such as Cannon's) algorithms (which store a single copy of the (two or more) input matrices and use minimal possible memory contrary to algorithms, which allow using extra memory), and can guarantee good performance only when the matrices are large enough to fill all the local memories. Otherwise, their inter-processor communication costs are asymptotically larger than the lower-bounds, which degrade performance.
There are several existing solutions that consider the problem of communication costs of (non-resilient) matrix multiplication.
Communication Costs of (Non-Resilient) Matrix Multiplication
Cannon [8], Van De Geijn and Watts [25], and Fox et al. [13] proposed matrix multiplication algorithms that minimize communication when the memory is the minimum needed to store input and output, namely
where Θ is the upper and lower asymptotic bound. The communication costs of these algorithms (also known as 2D algorithms) are
where O is the upper asymptotic bound.
Agarwal et al. [1] put forward a 3D algorithm that uses less communication when additional memory is available: For
they obtained
McColl and Tiskin [18], and Solomonik and Demmel [23] generalized these algorithms to cover the entire relevant range of memory size, namely
The communication costs of their 2.5D matrix multiplication algorithm (the 2.5D algorithm has the 2D and 3D algorithms as special cases, and effectively interpolates between them) are
where c is the memory redundant factor, namely
For
the communication costs are bounded below by the memory independent lower bound Ω
[4], therefore increasing c beyond
cannot help reduce communication costs, McColl and Tiskin [18], Ballard et al. [5], and Demmel et al. [12] used an alternative parallelization technique for recursive algorithms, such as classical and fast matrix multiplication. The communication costs of the 2.5D algorithm and the Breadth-First Search (BFS)-Depth-First Search (DFS) scheme (i.e., performing BFS and DFS steps, where in a BFS step, all sub-problems are computed at once, each on an equal fraction of the processors. In a DFS step, all sub-problems are computed in sequence, each using all the processors) applied to classical matrix multiplication are both optimal (up to an O (log P) factor), since they attain both the lower bound in Irony, Toledo and Tiskin [17], in the range
and the lower bound in Ballard et al. [4] for larger M values.
Checkpoints-Restart
One general approach to handling faults is checkpoint-restart; all the data and states of the processors are periodically saved to a disk, as shown in
Plank, Li, and Puening [22] suggested using a local memory for checkpoints instead of disks. This solution does not require additional hardware, and the writing and reading of checkpoints are faster. Still, the periodic write operations, as well as the restart operations significantly slow down the algorithms. Furthermore, this solution takes up some of the available memory from the algorithm. For many algorithms, matrix multiplication included, less memory implies a significant increase in communication cost, hence a slowdown.
Algorithm-Based Fault Tolerance
Huang and Abraham [16] suggested algorithm-based fault tolerancy for classic matrix multiplication. The main idea was to add a row to A which is the sum of rows, and a column to B which is the sum of columns. The product of the two resulting matrices is the matrix A·B with an additional row containing the SUM of its rows and an additional column containing the sum of its columns. They addresses soft errors, and showed that by using the sum of rows and columns it is possible to locate and fix one faulty element of C. Huang and Abraham [16] used a technique that allows recovery from a single fault throughout the entire execution. Gunnels et al. [14] presented fault tolerant matrix multiplication that can detect errors in the input, and distinct between soft errors and round-off errors. Chen and Dongarra showed that by using the technique of [16], combined with matrix multiplication as in Cannon [8] and Fox [13] does not allow for fault recovery in the middle of the run, but only at the end. This severely restricts the number of faults an algorithm can withstand. Chen and Dongarra [10, 11] adapted the approach described by Huang and Abraham for hard error resiliency, using the outer-product [26] multiplication as a building block. Their algorithm keeps the partially computed matrix C encoded correctly, in the inner steps of the algorithm, not only at the end. By so doing, they were able to recover from faults occurring in the middle of a run without recomputing all the lost data. In [11] they analyzed the overhead when at most one fault occurs at any given time. In [10] they suggested an elegant multiple faults recovery generalization of this algorithm. For this purpose they introduced a new class of useful erasure correcting codes. There algorithm requires 2 h·√{square root over (P)} additional processors to be able to deal with up to h simultaneous faults. Further, they analyzed its numerical stability, Wu [27] et al, used outer product for soft error resiliency.
Hakkarinen and Chen [15] presented a fault tolerant algorithm for 2D Chole sky factorization. Bouteiller et al. [7] expanded this approach and obtained hard error fault tolerant 2D algorithms for matrix factorization computations. Moldaschl et al. [19] extended the Huang and Abraham scheme to the case of soft errors with memory redundancy. They considered bit flips in arbitrary segments of the mantissa and the exponent, and showed how to tolerate such errors with small overhead.
Fast matrix multiplication algorithms reduce both computational complexity and communication costs compared to classical algorithms, and are often significantly faster in practice. However, existing fault tolerance techniques are not adapted to such fast algorithms.
It is an object of the invention to provide methods for efficiently performing fault tolerant numerical linear algebra computations, such as classic and fast matrix multiplication, containing soft and hard errors, by a plurality of processors.
It is another object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant numerical linear algebra computations, such as classic and fast matrix multiplication containing soft and hard errors, while reducing the required number of additional processors.
It is a further object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant numerical linear algebra computations, such as classic and fast matrix multiplication containing soft and hard errors, while utilizing additional memory for saving communication between processors.
It is yet another object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant numerical linear algebra computations, such as classic and fast matrix multiplication containing soft and hard errors, with efficient resource utilization and high performance for error-resilient algorithms that are close to the efficiency and performance of non-resilient algorithms.
It is still another object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant fast matrix multiplication containing hard errors, with small costs overheads, for Strassen's and other fast matrix multiplication algorithms, while using a tradeoff between the number of additional processors and the communication costs.
It is yet another object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant fast matrix multiplication containing soft and hard errors, which uses coding that allows reducing the number of required additional processors.
It is yet another object of the invention to provide a method for efficiently performing by a plurality of processors, fault tolerant fast matrix multiplication containing soft and hard errors, which reduces latency and communication required between processors and within memory elements.
It is a further object of the invention to provide a method for efficiently performing by a plurality of processors, a plurality of reduce operations, with substantially low latency.
Other objects and advantages of this invention will become apparent as the description proceeds.
A computer implemented method for performing fault tolerant numerical linear algebra computation task consisting of calculation steps that include at least one classic matrix multiplication, comprising the following steps:
A computer system for performing fault tolerant numerical linear algebra computation task consisting of calculation steps that include at least one classic matrix multiplication, comprising:
A computer implemented method for performing fault tolerant numerical linear algebra computation task consisting of calculation steps that include at least fast matrix multiplication, comprising:
A computer system for performing fault tolerant fast matrix multiplication task consisting of steps, comprising:
A method for performing pipelined-reduce operation, comprising:
Classic matrix multiplication may be a part of one or more of the following:
A user interface for operating a controller, consisting of at least one processor with appropriate software module, for controlling the parallel execution of a fault tolerant numerical linear algebra computation consisting of calculation steps that include at least classic matrix multiplication task by a plurality of P processors for performing the multiplication, each of which having a corresponding memory, between two or more input matrices A and B, the controller is adapted to:
A user interface for operating a controller, consisting of at least one processor with appropriate software module, for controlling the parallel execution of a fault tolerant numerical linear algebra computation consisting of calculation steps that include at least fast, matrix multiplication task by a plurality of P processors for performing the multiplication, each of which having a corresponding memory, between two or more input matrices A and B, the controller is adapted to:
Whenever the matrix multiplication algorithm is 2.5D, the method may further comprise:
Whenever the matrix multiplication algorithm is 2.5D, the controller is further adapted to:
In one aspect, the error correction codes are preserved during multiplication steps.
The controller may spit the task among the plurality of P processors by assigning a different, input block from each input matrix to a processor.
Classic matrix multiplication may be a part of one or more of the following:
Using a second reduce or pipelined-reduce operation is repeated (√{square root over (P)}) times when the amount of memory is minimal.
In one aspect, at the initial step, there is no input, matrix and the input, matrices are created by a process for which the computation task is performed. Alternatively, there is only a first input matrix and the other input matrix is derived from the first input matrix.
In one aspect, the down-recursion DFS steps may be interleaved with the down-recursion BFS steps, and/or the up-recursion DFS steps may be interleaved with the up-recursion BFS steps.
In the drawings:
Numerical linear algebra extensively uses matrix multiplication as a basic building block, with many practical applications, such as machine learning and image processing. For example, machine learning algorithms include a training stage, which is then used to classify new and unknown events. In this example, the training data is held in a matrix of weight, which should be multiplied by a vector which may represent a portion of an image or an area of pixels (for example, for determining whether or not an image contains a specific object which should be identified). In order to increase efficiency, all vectors to be multiplied are arranged in another matrix, such that multiplication provides the results for all vectors. Other implementations of matrix multiplications may be in the field of audio analysis, voice recognition, speech recognition etc., as well as in the field of graphics.
The method proposed by the present invention provides a method and system for efficiently performing fault tolerant classic and fast matrix multiplication by a plurality of processors. The proposed fault tolerant parallel matrix multiplication algorithms reduce the resource overhead by minimizing both the number of additional processors and the communication costs between processors. The number of additional processors has been reduced from Θ(√{square root over (P)}) to 1 (or from Θ(√{square root over (P)}) to h, where h is the maximum number of simultaneous faults). The latency costs have been reduced by a factor of Θ(Log P).
The new fault tolerant algorithms for matrix multiplication reduces the number of additional processors and guarantees good inter processors communication costs. For the 2D algorithm, the number of additional processors has been decreased from Θ(√{square root over (P)}) to 1 (or from Θ(h√{square root over (P)} to h), where h is the maximum number of simultaneous faults). Also, a log P factor of the latency cost has been saved.
The bandwidth lower bound when f=O(√{square root over (P)}), where f is the total number of faults during runtime when local memories are larger than the minimum needed to store inputs and outputs, the communication costs have been reduced using 2.5D technique, with no (or very few) additional processors, thereby attaining the bandwidth lower bound for f=O(√{square root over (P/c)}).
The proposed computation model is a distributed machine with P processors, each having a local memory of size M words. The processors communicate via message passing, it is assumed that the cost of sending a message is proportional to its length and does not depend on the identity of the sender or of the receiver as in [17], and in the context of fault tolerance [10, 11]. This assumption can be alleviated with predictable impact on communication cost, cf. [3]. The number of arithmetic operations is denoted by F. The bandwidth cost of the algorithm is given by the words count and is denoted by BW. The latency cost is given by the message count and is denoted by L. The number of words, messages and arithmetic operations is counted along the critical path as defined in [28]. The total runtime is modeled by γ·F+β·BW+α·L, where α, β, γ are machine-dependent parameters.
h is denoted as the maximum number of faults that can occur simultaneously; i.e., the maximum number of faults in one step or iteration of the algorithm. f is denoted as the total number of faults throughout the execution. When comparing an algorithm to a fault tolerant adaptation, (P, M, F, BW, L) to denotes the resources used by the original algorithm and (P′, M′, F′, BW′, L′) denotes the resources used by the fault tolerant adaptation.
(P′, M′, F′, BW′, L′) is expressed as a function of the former, of h, f, and the input size n. When a fault occurs, the faulty processor loses all its data, and the machine allocates a new processor to replace the faulty one. For simplicity, it is assumed that no faults occur during recovery phases (faults during the recovery phase of any of the algorithms may introduce at most a constant factor overhead to the recovery phase, and thus do not affect the analysis).
A distributed system (machine) also comprises a controller with an appropriate user interface, which splits the computation task among P processors. The controller may equally split the task among the processors, while defining for each processor which fraction of the task to execute. According to a first operation mode, after the splitting the task among P processors the processors will be able to continue and complete that task without intervention of the controller. According to a second mode, all operations during execution are controlled by the controller, which also collects all fractions of the result from all processors and generates the final result.
The present invention introduces a new coding technique as well as ways to apply it to both 2D and 2.5D matrix multiplication algorithms. By doing so, fault tolerant algorithms for matrix multiplication are obtained. In the 2D case, only h additional processors (the minimum possible) are used, and even fewer processors are used for the 2.5D algorithm. The run-time overhead is low, and the algorithms can utilize additional memory for minimizing communication. These algorithms can also handle multiple simultaneous faults.
Pipeline Reduce Operations
Broadcast and reduce operations are in the algorithms proposed by the present invention. Sanders and Sibeyn [21] showed an efficient algorithm for performing broadcast and reduce.
Lemma 1 (bandwidth) Let P be the number of processors, and W the data size of each processor. It is possible to compute a weighed sums of the data of the P processors, using: (F, BW, L)=(O(W), O(W), O(log P))
The present invention proposes an efficient way to perform l reduce operationl in a row. The naïve implementation uses the algorithm above l times and requires (F, BW, L)=(O(l·W), O(l·W), O(l·log P)). The present invention proposes pipelining the reduce operations and save latency.
Lemma 2 (Efficient multiple weighed sum) Let P+l be the number of processors, and W the data size on P of them. It is possible to compute l weighed sums of the data of the P processors on the l other processors with resources: (F, BW, L)=(O(l·W), O(l·W), O(log P+l))
Proof The algorithm for one weighted sum has two phases. For ease of presentation, P is assumed to be an integer power of 2 (the generalization is straightforward). The first phase reduces the weighed sum but the data remains distributed. The second phase gathers the data to the destination processor.
The reduce function works as follows:
The first step is to divide the processors into two sets. Each set performs half of the task. The division involves communicating half of the data. Each set recursively calls the reduce function. The base case is when each set, contains only one processor. Then, each processor holds
traction of the results.
In the next step, the data is gathered to the additional processor. The reduction phase costs
and L=log2P. The gathering costs
Thus the total cost of the single weighted sum algorithm is: (F, BW, L)=(O(W), O(W), O(log P)).
This algorithm can be efficiently pipelined since the messages size decreases exponentially. Let the names of the processors be a binary string of length log2 P. In the first phase, the communication is between pairs of processors that agree on all the digits aside from the first digit. These pairs communicate the first weighted sum. In the second phase, the communication is between processors that agree on all digits aside from the second, and they send the second step of the first reduce, the first step of the second reduce, and so on. Each weighted sum takes at most O(log P) steps and then the data is sent to one of the l new processors. Therefore, at any time, at most O(log P) weighted sums are being computed. The memory required for all the reduce operations that can occur in parallel is at most
and the memory required for all the gathering is at most
Therefore, the memory footprint of this algorithm is M≥4W. In summary, performing l reduce operations in a row with this algorithm uses local memories of size 4W costs: (F, BW, L)=(O(l·W), O(l·W), O(log P+l))
The present invention uses use linear erasure code for recovering faults. Definition 1 (n, k,d)-code is a linear transformation: k→n with distance d, where distance d means that for every x≠y ∈k, T(x),T(y) have at least d coordinates with different values. The generator matrix of T is an n×k matrix G such that T (x)=G·x.
The used erasure codes preserve the original word and add redundant letters, A word x of length k is coded to a word y of length n using n−k additional letters such that yk+i=Σj=1nEi,j·xj for some (n−k)×k matrix E. Therefore, the code generating matrix is of the form
Minimum Memory, Single Fault
Table 1 shows results for fault tolerant algorithms for 2D algorithms, namely
with at most one simultaneous fault, where n is the matrix dimension. P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm.
Previous Algorithms for M=Θ(n2/P)
Chen and Dongarra [10, 11] used the Huang and Abraham scheme [16] to tolerate hard errors. Specifically, they added one row of processors that store the sum of the rows of A and similarly for, and one column of processors that store the sum of the columns of B and similarly for C. These rows and columns are called the check-sum. A matrix that has both is called a fully cheek-sum matrix.
Chen and Dongarra showed that this approach, applied to 2D algorithms (e.g., Cannon [8] and Fox [13]), allows for the recovery of C at the end of the matrix multiplication. However these 2D algorithms do not preserve the check-sum during the inner steps of the algorithm. To deal with higher fault rate, this requires recovery of faults during the run of the algorithm, Chen and Dongarra used the outer product as the building block of their algorithm. Thus their algorithm can recover faults throughout the run of the algorithm at the end of each outer product iteration. Lost data of A and C of a faulty processor can be recovered at the end of every outer product step, from the processors of the same column from the processor in the check-sum row. Similarly, the data from B and C can be recovered using the check-sum column.
Consider a 2D communication optimal matrix multiplication algorithm with resources (P, F, BW, L). Let (P′, F′, BW′, L′) be the resources required for the fault tolerant 2D matrix multiplication algorithm of Chen and Dongarra that can withstand a single fault at any given time. Let n be the matrix dimension and let f be the total number of faults. Then:
Proof: For completeness, a proof is provided, based on that of Chen and Dongarra [10, 11]. The algorithm uses an additional row and column of processors, thus P′=2·√{square root over (P)}+1. Next are the time of the code creation (CC) matrix multiplication (MM), and the recovery (Re). F′=FCC+FMM+FRe, similarly BW′=BWCC+BWMM+BWRe, and L40=LCC+LMM+LRe.
Code Creation and Recovery
The code creation and the recovery are reduce operations. They use fractional tree [21] for this end, thus
Matrix Multiplication
Matrix multiplication is computed using the outer product in each iteration. The outer product broadcasts one row and one column of blocks at the beginning of each iteration. To do so, they used a (simple) broadcast instead of a “fractional tree” [21] thus
Total Costs
Summing up Equations 1, 2, and 3:
Since matrices A and B are not modified by the algorithm, lost input data can be easily handled using an erasure code of length P+1. The main challenge involves recovering C. The present invention proposes two new algorithms for recovering C: the first algorithm uses the outer product and encoding of the blocks of C with additional processors. This is similar to the approach by Chen and Donagarra, except for using a new coding scheme that decreases the additional processor count from Θ(√{square root over (P)}) to one. This first algorithm is called the slice-coded algorithm.
The second algorithm recovers the lost data of C by recomputing at the end of the run. This second algorithm is called the posterior-recovery algorithm.
Slice-Coded Algorithm
Theorem 2 (Slice-coded) Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand a single cult at any given time, where n is the matrix dimension and f is the total number of faults, with resources:
Further to the approach presented [10, 11], the present invention uses the outer product matrix multiplication as the basis for the algorithm. However, where while used 2·√{square root over (P)}+1 additional processors, for the coded data, only one is used. The additional processor contains the sum of the other processors. This processor acts similarly to the corner processor in Chen and Dongarra's algorithm (corresponding to the red processor in
where s is an inner index for the summation.
The slice-coded algorithm allocates one additional processor for the code, thus, P′P+1. The algorithm is composed of three steps. In the first step, code creation (CC) the algorithm creates codes for A and B and stores them in the additional processor. The second step, is the matrix multiplication (MM). Upon a fault, a recovery (Re) step is performed. Therefore, F′ is composed of three components, namely, F′=FCC+FMM+FRe. Similarly, BW′=BWCC+BWMM+BWRe, and L′=LCC+LRe.
Code Creation
In this step, the slice-coded algorithm computes the SUM of the blocks of A and of B, and stores them in the additional processor, using a reduce operation. By Lemma 1 this takes:
Matrix Multiplication
The matrix multiplication phase is performed as in an outer-product algorithm, but with a small change: every processor computes its share of the code. In the sth iteration (of √{square root over (P)} iterations), the processors compute the outer product A(:,s)·B(s,:). The processors of the current block column of A and the processors of the current block row of B broadcast them. The processors compute the sum of the current block column of A; specifically, each column of processors computes 1/√{square root over (P)} of this sum. Similarly, the processors compute the sum of the current block row of B. The processors send these two sums to the additional processor. Then each processor multiplies the two blocks. By Theorem 2 the broadcasting (B) takes
The reduce operation is distributed among the rows and the columns, where each row and column of processors performs a reduce operation with a
block size. Therefore, this reduce operation (R) takes:
The multiplication of two blocks in time is
There are √{square root over (P)} iterations; thus the multiplications takes:
Recovery
Each recovery is a reduce operation. By Lemma 1 f recoveries take:
Total Costs
Summing up Equations 5, 6, and 7 yields
Posterior-Recovery Algorithm
The Posterior algorithm allocates additional processor for code to A and B. In case of fault the algorithm recovers A and B and proceeds. After the multiplication ends the algorithm re-computes the lost data of C. It does not use the outer product, thus saves communication. The algorithm uses one additional processor, P′=P+1.
Theorem 3,3. (Posterior-recovery) Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand a single fault at any given time, where n is the matrix dimension and f is the total number of faults, with resources:
In this algorithm, the output is recovered by re-computation. That is, A and B input matrices are coded, but C is not. A faulty processor incurs the restoration of its share of A and B. Re-computing its lost share of the workload is performed at the end of the algorithm, using all processors. When a fault occurs, the algorithm recovers the lost data of A and B using their code, initializes the lost block of C to zeros, and resumes computations.
Definition 3.1. A cube is denoted as the set of scalar multiplications defined by the two blocks (sub-matrices) multiplication.
Proof. [of Theorem 3.3]
It is assumed that at each iteration, at most one fault occurs. Therefore, the algorithm needs only one additional processor to encode A and B, namely, P′=P+1
F′=FCC+FMM+FReIn+FReOut, where CC stands for code creation, MM for the matrix multiplication, ReIn for the recovery of the input A and B, and ReOut for the recomputation. Similarly, BW′=BWCC+BWMM+BWReIn+BWReOut, and L′=LCC+LMM+LReIn+LReOut.
Code Creation
The costs of this phase are as in the Slice-coded algorithm above.
Matrix Multiplication
The algorithm performs 2D matrix multiplication (e.g., Cannon's [8]), thus (FMM, BWMM, LMM)=(F, BW, L). (8)
Input Recovery
By Lemma 1 the costs of f recoveries are:
Output Recovery
This stage involves communication, multiplication, and reducing the data. It is assumed that the maximum number of faults in an iteration is 1. Each processor computes √{square root over (P)} cubes. Therefore there are at most P cubes to compute again, as there are √{square root over (P)} iterations. The algorithm distributes the workload of the lost cubes. Each processor gets at most one cube. Since computing a cube is multiplying two blocks of size
it takes
flops (floating point operations per second which are a measure of computer performance, useful in fields of scientific computations that require floating-point calculations).
The communication cost is due to moving two input blocks and the reduce of C. Thus it takes
Total Costs
Summing up Equations 5, 8, 9, and 10 yields
Table 2 shows results for fault tolerant algorithms for 2D algorithms, namely
with at most h simultaneous fault, where n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm.
Handling Multiple Faults
The proposed new algorithms may be extended to several of simultaneous faults.
Previous Algorithm
Theorem 4.1. ([12]) Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimensions. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
The algorithm adds h rows of processors to A and C and h columns of processors to B and C. It stores weighted sums of the original processors in the additional processors. Chen and Dongarra's algorithm uses a (√{square root over (P)}, √{square root over (P)}+h, h+1)-code, with a generating matrix
such that every minor of E is invertible. It can therefore recover h simultaneous faults, even if they occur in the same row or column of processors.
Proof. Chen and Dongarra algorithm has h processors for each row and column of the original processors (see
The code creation is done by h reduce operations, performed by each row and column of processors. Thus by Theorem 2:
Matrix Multiplication
The second phase of the Chen and Dongarra's algorithm is the matrix multiplication. They used the outer product algorithm. This algorithm includes broadcasting to a slightly larger set of processors, h+√{square root over (P)} instead of √{square root over (P)} and it runs the same number of iterations. Therefore it takes 1:
Recovery
When a fault, occurs the processors in same column recover the block of A, the processors in the same column recover the block of B, and the same for C. By Theorem 2 for f faults this takes
Total Costs
Summing up Equations 11,12 and 13:
4.2 Erasure Correcting Code
For multiple faults erasure code ( )recall is used according to the definition in Section 2.2). To withstand h simultaneous faults it is required that a (P, P+h, h+1)-code. In other words, any P letters suffice to recover the data. This is possible if and only if every minor of size P of the generating matrix
is invertible. In other words, every minor of E is invertible.
Similar to the single fault case, each code processor multiplies a weighed sum of the current block column of A with a weighed sum of the current block row of B, and adds it to the accumulated sum. Thus the weighed sum is of the form:
where wi,j=vi·uj for some vectors v and u. The code used in [10] does not have the above property, and therefore cannot be used. It is shown that there exists a code with the required properties.
Lemma 4.1. There exists (P+h, P, h+1) code, such that the generating matrix
has the following property. For every row i of E there exists two vectors vi, ui∈ such that Ei=vi⊗ui. Namely Ei,a+(√{square root over (P)}−1)b=vai·ubi. Proof Consider an erasure code with venerating matrix
where I=IP, and E is an h×P Vandermonde matrix. Every minor of the Vandermonde matrix is invertible. The ith row of the Vandermonde matrix is of the form ri=(αi0, αi1, . . . , αiP−1), where ri is a row of the Vandermonde matrix. By taking vi=(αi0, . . . , αi√{square root over (P)}−1) and ui=(αi0, αin, . . . , αiP−√{square root over (P)}), where αi is the generating scalar of the row ri (the elements in ri are powers of αi. The smallest power is 0. The largest power is P−1) and therefore, Ei, α+(√{square root over (P)}−1)b is vi·uj=ra+(√{square root over (P)}−1)bi.
Slice-Coded Algorithm for 2D Matrix Multiplication
Theorem 4.2. (Slice-coded algorithm)
Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
Proof: Section 4.2 showed how to use h additional processors to obtain a code with distance h+1; thus P′=P+h. The rest of the analysis is similar to the single fault case in the proof of Theorem 2 and F′=FCC+FMM+FRe, and similarly for BW′ and L′.
Code Creation
The algorithm first creates an erasure code for A and B. By Theorem 2 as
and l=h this takes:
Matrix Multiplication
The multiplication involves broadcasting and reduction of h weighted sums. Each column of processors computes
weighted sums of the blocks of A and each row of processors computes
weighted sums of the blocks of B. The broadcasting and reduction (BR) takes:
The multiplication of two blocks takes
flops. There are √{square root over (P)} iterations. Therefore,
Recovery
When faults occur, the portion of A, B, and C of the faulty processor are recovered at the end of the iteration, using the erasure code. Assuming that at iteration i that the number of faults is fi, then by Theorem 2, as W=n2/P and l=fi>0 the recovery takes:
Recall that f=Σi=1√{square root over (P)}fi. Therefore,
Total Costs
Summing up Equations 15, 16, and 18 yields,
Posterior-Recovery Algorithm for a 2D Matrix Multiplication
This algorithm allocates k processors for encoding A and B. It runs a 2D matrix multiplication (e.g., Cannon [8] not just outer product ones). When a processor faults, the algorithm recovers A and B and proceeds. After the multiplication the algorithm re-computes the lost portion of C.
Theorem. 4.3. (Posterior-Recovery)
Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
The analysis of this algorithm is very similar to the single fault case. A proof for Theorem 4.3 is provided in [8].
Proof The algorithm requires code with distance h+1, and uses h additional processors P′=P+h. The analysis is similar to the single fault case, specifically, F′=FCC+FMM+FReIn+FReOut, and the same for BW′ and L′.
Code Creation
By Theorem 2 the code creation costs:
Matrix Multiplication
The algorithm runs a 2D matrix multiplication, thus
(FMM,BWMM,LMM)=(F,BW,L) (20)
Input Recovery
Assume that at iteration i the number of fault is fi. By Theorem 2, as
and l=fi, the recovery of the input at iteration i takes:
if fi>0. Since f=Σi=1√{square root over (P)}fi summing up the recoveries of each iteration costs:
Output Recovery
When a processor faults, it loses at most √{square root over (P)} cubes of computations. Therefore, if f faults occur during the multiplication step, at the end of the multiplication the algorithm performs O(f·√{square root over (P)}) cube re-computations. Each cube computation involves three steps: receiving the input blocks, multiplying the matrices, and reducing the results. The receiving costs are as follows:
The multiplication costs
flops. Since processors may fault quite late, there is a reduce of O(√{square root over (P)}) blocks that increases the latency to O(log P). These multiplications can done in parallel. There are O(f √{square root over (P)}) blocks to multiply and P processors. Therefore, each processor performs
multiplications. Hence,
Total Costs
Summing up Equations 19, 20, 22, and 23
Memory Redundancy
Both the slice-coded algorithm and the posterior-recovery algorithm may be extended to the case where redundant memory is available, namely
for some
Fault Distribution
Recall that a 2.5D algorithm splits the processors into c sets, where each set performs
or the iteration or 2D algorithm. When h is the maximum number of simultaneous faults, each set of processors has to be able to tolerate h simultaneous faults. For ease of analysis, it is assumed that the faults are distributed uniformly among the c sets. If this is not the case the algorithm can divide the computations differently, and assign more computation to a set that has fewer faults. This is possible since each set of processors has sufficient data to perform all these computations.
Slice-Coded
Theorem 5,1. (Slice-coded)
Consider a 2.5D communication-cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension and let M be the local memories size. Let c be the memory redundancy fact, namely
Then there exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total limits, with resources:
Proof The algorithm splits the processors into c sets (where c=Θ(P·M/n2)), each set performs √{square root over (P/c3)} iterations of outer product. For each set of processors the algorithm allocates h processors for redundant, code. Therefore P′=P+c·h. The analysis is similar to the minimum memory case, particularly F′=FCC+FMM+FRe, and the same for BW′ and L′,
Code Creation
First, the algorithm duplicates A and B and then it creates the code. Each set of processors computes the code of
processor, and the code processors duplicate themselves. By Theorem 2 it takes:
Matrix Multiplication
In each iteration (iter) of the outer product in the 2.5D matrix multiplication, one column of processors broadcasts the current blocks of A and one row of processors broadcasts the current blocks of B. Then the processors compute h weighted sums of those blocks, each row or column computes
sums, and sends them to the code-processors. Then each processor multiplies 2 blocks of
Summing up, the iteration costs are
There are √{square root over (P/c3)} iterations, thus
Recovery
Recovering faults is done by computing a weighted sum of √{square root over (P/c)} processors. At the end of each iteration the algorithm recovers faults by pipelining the reduce operations. According to Theorem 2, as W=M, this costs:
Total Costs
Summing up Equations 24, 25, and 26:
Consider a 2.5D communication-cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension and let M be the local memories size. Let c be the memory redundancy factor, namely
Then there exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
Proof. The algorithm splits the processors into c sets (where c=Θ(P·M/n2)), each set performs √{square root over (P/c3)} iterations of outer product. For each set of processors the algorithm allocates h processors for redundant code. Therefore P′=P+c·h. The analysis is similar to the minimum memory case, particularly F′=FCC+FMM+FRe, and the same for BW′ and L′.
Code Creation
First, the algorithm duplicates A and B and then it creates the code. Each set of processors computes the code of
processor, and the code processors duplicate themselves. By Theorem 2 it takes:
Matrix Multiplication
In each iteration (iter) of the outer product in the 2.5D matrix multiplication, one column of processors broadcasts the current blocks of A and one row of processors broadcasts the current blocks of B. Then the processors compute h weighted sums of those blocks, each row or column computes
sums, and sends them to the code-processors. Then each processor multiplies 2 blocks of
Summing up, the iteration costs are
There are √{square root over (P/c3)} iterations, thus
Recovery
Recovering faults is done by computing a weighted sum of √{square root over (P/c)} processors. At the end of each iteration the algorithm recovers faults by pipelining the reduce operations. According to Theorem 2, as W=M, this costs:
Total Costs
Summing up Equations 24, 25, and 26:
Posterior-Recovery
The 2.5D adaptation of the posterior-recovery algorithm is similar to the 2D case, with one main exception: there is an inherent redundancy in the replications of A and B in the 2.5D algorithm that is utilized to decreases the length of the code, hence reduces the number of additional processors required. If h<c, the algorithm does not require additional processors at all. The algorithm splits the processors into c sets where
Each set performs
of the Aerations of a 2D algorithm (not necessarily the outer product algorithm). When a fault occurs, the processors in the set of the faulty processor wait for the recovery of that processor. The lost data of A and B are recovered from the next set of processors.
Theorem 5.2. (Posterior-Recovery)
Consider a 2.5D algorithm with resources (P, F, BW, L). Let n be the matrix dimension and let M be the local memories size. Let c be the memory redundant factor, namely
Then there exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources
Proof. As explained above P′=P. The algorithm does not create code, and similar to the 2D case it recovers the input immediately and re-computes the lost output data after the multiplication ends. Therefore F′=FMM+FReIn+Fre_out. Likewise BW′=BWMM+BWReIn+BWre_out, and L′=LMM+LReIn+Lre_out.
Matrix Multiplication
The algorithm performs a 2.5D matrix multiplication therefore,
(FMM,BWMM,LMM)=(F,BW,L) (27)
Input Recovery
The algorithm recovers faults at the end of each iteration. Since c>h there is at least one copy of each block even when h processors fault simultaneously. If k processors that hold the same block of A (or B) fault simultaneously, the algorithm broadcasts this block. Therefore in the worst case, this recovery requires O(log k) messages. Recall that in the ith iteration fi<c processors fault. By Lemma 1 it costs:
(FReIn
thus the total recovery costs are:
Output Recovery
After the 2.5D matrix multiplication is completed, the algorithm computes the lost cubes (recall Definition 2). When a processor faults it loses O(√{square root over (P/c3)}) such cubes. Each processor gets
such cubes for recomputing, and multiplies pairs of them. The block size is
therefore multiplying two blocks costs
flops. Thus the costs are
log P is added to the latency because the output recovery may include the broadcast operation of the blocks and the reduce operation of the results.
Total Costs
Summing up Equations 27, 28, and 29:
Table 3 shows results for fault tolerant 2.517 algorithms, c copies of the input and the output fit into the memory, namely
with at most h simultaneous fault, where n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm.
Fault Tolerant Fast Matrix Multiplication
The present invention also provides a method and system to achieve fault resilience (for hard errors) with small costs overheads, for Strassen's (and other) fast matrix multiplication, and obtain a tradeoff between the number of additional processors and the communication costs. Parallelization is made using the BFS-DFS parallelization technique of [33, 41, 52].
In BFS, the processors compute seven n/2×n/2 matrices locally and redistribute one set of P/7 processors deals with one sub-problem. In DFS, the processors compute one of the seven n/2×n/2 matrices locally and proceed by jointly computing the sub-problem.
71 processors can preserve the code during 1 steps. Therefore, there is a tradeoff between the number of additional processors and the number of code creations. The data size is different between the code creations. Each BFS step increase the data size by a factor of 7/4, and therefore, the code creation after the multiplication layer dominants. By adding (√{square root over (P)}) processors, it is possible to handle half of the BFS steps. After the multiplication, the code is create code again, and after half of the up BFS steps.
There is a tradeoff between the number of the processors and the number of code creations.
Communication Costs of Fast Matrix Multiplication Lower Bounds
Ballard et al. [34] obtained lower bounds on the communication costs required by any parallelization of Strassen and several other fast matrix multiplication, namely
where ω0=log2 7 for Strassen.'s algorithm. In addition they obtained a memory independent communication lower bound of
[32]. Scott et al. [56] generalize these lower bounds to all fast matrix multiplication algorithms. Recently Bilardi and De Stefani [35] strength the lower bound for Strassen so that it holds when re-computation is allowed.
Existing Algorithms
Several parallelization techniques have been applied to Strassen's algorithm. [51, 44, 42, 58, 49, 50, 33, 52] Where [52, 50, 33] are optimal, namely they attain the lower bounds of [32, 34] up to an O(log P) factor. The costs of these algorithms are:
Table 1* shows a fault tolerant solution, unlimited memory case
where Ω is the lower asymptotic bound, with at most one simultaneous fault. n is the matrix dimension, Ω0 is the exponent of the fast matrix multiplication algorithm (for Strassen's algorithm, Ω0=log2 7), P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm. The second row is for the CAPS algorithm which is not fault tolerant. The next rows present the proposed algorithms, and the tradeoff between the number of additional processors and communication performance. (2≤d≤log7 P) where log P/d BFS steps encoded together.
Table 2 shows fault tolerant solutions, unlimited memory case,
with at most h simultaneous faults. n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm. (2≤d≤log7 P) where log P/d BFS steps encoded together.
Table 3 shows fault tolerant CAPS algorithm for the limited memory case, where
with at most h simultaneous fault. n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm. (2≤d≤log7 P) where log P/d BFS steps encoded together.
Linear Erasure Code
A linear erasure code is used for recovering faults.
Definition 1 An (n, k, d)-code is a linear transformation T:k→n with distance d, where distance d means that for every x≠y ∈k, T(x), T(y) have at least d coordinates with different values.
The erasures code used preserves the original word and adds redundant letters. Chen and Dongarra [39] presented a class of such codes that gain the maximum distance possible, namely d=n−k+1.
2.3 Parallelization of Fast Matrix Multiplication Algorithms
Let us recall the BFS-DFS parallelization of [33, 50, 52]. The algorithm has two main steps. The BFS step computes the 7 sub-problems of the recursion in parallel, and each problem is assigned to 1/7 of the processors. The DFS step computes the sub-problems serially, such that all the processors compute the sub-problems one by one. The BFS step requires more memory, but it decreases communication.
In both BFS step each processor computes its share of S1, . . . , S7, T1, . . . , T7 locally. Then the processors redistribute the data, and assign each sub-problem to a sub-set of processors. Each set (i) makes a recursive call to compute Si·Ti; then the processors redistribute the data again, and each processor computes its share of C locally from the results. Consider the numbering of the processors in base-7; that is, the name of a processor is a string of digits 0, . . . , 6 of length log7 P. In the sth BFS step the algorithm divides the processors into seven subsets, such that in each subset the sth digit is identical. Communication takes place between processors whose number match on the whole name except the sth place.
When the memory exceeds
the algorithm only performs BFS steps. Ballard et al. [33] denoted this the unlimited memory case, as additional memory here cannot further reduce communication costs. When
the algorithm performs DFS steps and reduces the problem size. When it is sufficiently small the algorithm performs BFS steps. They showed that
DFS steps are sufficient for this purpose. Each BFS step divides the processors into 7 sub-sets. When P=1 the processor computes the multiplication locally. Thus the number of BFS steps is exactly log7 P. This number is denoted by k, and the number of DFS steps depends on M.
Fault Tolerant BFS-DFS
In this section a fault tolerant algorithm is described, based on BFS-DFS parallelization [33, 50, 52] of Strassen's matrix multiplication [59]. The algorithm requires 7 additional processors. The algorithm is first described for the case of unlimited memory and h=1. In Section 6 the algorithm is extended to tolerate more simultaneous faults (larger h). In. Section 6.1 the algorithm is extended to the limited memory case.
3.1 Single Fault
It is next explained how an erasure code is preserved during the BFS and DFS steps. Assume each of the P original processors holds a block of numbers and a code processor c holds a linear code of them, i.e., a weighted sum of the blocks. If all processors (code processor included) locally compute the same linear combination of the elements of their block, then the code on the code processor is preserved. Therefore a code processor preserves the linear code during DFS steps as they include a linear local computations only. Further, the linear code is also preserved during the computation parts of BF′S steps.
Next, the communication part of BFS steps will be discussed. Assume the processors' names are string of length log7 P over [7]. In the sth BFS step, each processor communicates with the six processors that agree with it on all coordinates except the sth. Recall that BFS step assigns each of the seven sub-problem to 1/7 of the processors. Particularly, each processor computes its share of S1, . . . , S7, T1, . . . , T7 and sends six of them. A processor with sth digit equals 1 sends it share of S2, T2 to the processor with name identical on all coordinates except its sth coordinate, which is 2 and so on. Assume, for the sake of discussion, that the processors are arranged in a P/7×7 grid (see
Each set of P7 processors is encoded by one processor. When the processors in a set send a block (to a processor in the same row), so does the code processor. Therefore, the code processor preserves the code during the communication phase.
The algorithm requires seven additional processors for the error correcting code (see Algorithm 5). At each BFS step, the algorithms first creates code. That is, it divides the original processors into seven subsets, and for each set it computes a linear code (e.g., the sum, using a reduce operation) of the processors' blocks and sends it to one of the seven additional processors. Then the original processors perform a BFS step as usual. Each processor (including the additional seven) computes locally the seven sub problems, and the processors redistribute the problems, one problem to one set of processors. By the linearity of the code a code processor obtains the sum of its set during the BFS step. Next, the algorithm performs a recursive call. After the recursive call the algorithm creates code again, the processors redistribute the results and compute C locally. When a fault occurs, the set of the faulty processor recovers the lost processor's data using the code, i.e. by computing a weighted sum of the processors' blocks, which is done using a reduce operation.
It is required that exactly 7 additional processors to preserve the code during the local computations and during the communication of the BFS step.
BFS(A, B, C, P, k) create code locally compute S1, . . . , S7 and T1, . . . , T7 from A and B redistribute the data. BFS(Si, Ti, Qi, 7, k−1) where k is the number of BFS steps to be recursively computed. In parallel, each subset of processors computes one sub-problem create code redistribute the data locally compute C from Q1, . . . , Q7
Theorem 1 Let (P, F, BW, L) be the resource parameters of CAPS [33], and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand one simultaneous fault at any given time. Let n be the matrix dimension, f be the total numher of faults, and assume
Then (P′, F′, BW′, L′)=(P+7,F·(1+o(1)), BW·O(1), O(L·log P)).
Proof. Seven additional processors are sufficient. Therefore P′=P+7. Costs are analyzed for code creation (CC), matrix multiplication (MM), and for the recovery (Re) phases. It is denoted that F′=FCC+FMM+FRe, BW′=BWCC+BWMM+BWRe, and L′=LCC+LMM+LRe, where FCC, FMM, and FRe are the flops count of code creation, matrix multiplication, and the recovery phases, and similarly for BW′ and L′.
Code Creation.
Each PFS step increases the data size by a factor of 7/4, such that in the s step the data size in each processor is
The algorithm creates code at every PFS steps. The code creation is achieved by a reduce operation. By Lemma 1 this takes:
Summing the costs of code creations:
Matrix Multiplication.
The matrix multiplication part of this algorithm is as in CAPS [33]. The only difference is the additional code processors. Each code processor encodes a set of P/7 processors, i.e., the local memory of a code processor contains weighted sum of the local memories of P/7 original processors, where the weights are predefined (see Section 2.2). The code processor acts like a processor from this set during the BFS steps. It computes the seven sub-problems locally and redistributes the data with the other code processors. Therefore the matrix multiplication costs are exactly as in CAPS: hence
(FMM,BWMM,LMM)=(F,BW,L) (2)
Recovery.
When a fault occurs the algorithm recovers the lost data and sends it to the new processor. A fault during the BFS steps incurs a reduce operation for computing the lost data by using the code. This recovery is executed on the fly (ReF) namely it incurs only linear computations and no recomputation. A fault during the multiplication step incurs a reduce operation and recomputation (ReC) of the multiplication. Thus FRe=FReF+FReC, and similarly for BW and L.
By Lemma 1 a reduce operation takes: O(W), O(W), O(log P)) where W is the size of the data. In the CAPS algorithm the data size is a geometric series; thus the addend of the maximum data size dominates the flops and the bandwidth, up to constant factor. The maximum size of the data is
thus recovering a fault when the data size is maximal dominates FReF, BWReF. Summing f reduce operations yields:
It is assumed that at most one fault occurs in each step of the algorithm, in particular in the multiplication step. Recovery during the multiplication incurs recomputation of a block multiplication of size
By using CAPS [33] on the sub-problem, this multiplication takes
Summing the recovery costs
Total costs.
Summing Equations 1, 2, and 3 yields:
4 Trading Off Processors for Communication
The previous section showed how to preserve the error correcting code during a BFS step using seven additional processors. This algorithm creates a new code for every BFS step. One can use more code processors and save communication. Specifically, the processors are arranged in a P/7r×7r grid, where r is the number of consecutive BFS steps that can be performed before a new error correcting code needs to be computed (see
and 2≤log7 P.
Therefore, in the unlimited memory case the algorithm creates code exactly 2d times (see Section 4.1).
The algorithm generates ECC I at the beginning. After half of the BFS steps, the algorithm generates ECC II. The code is not preserved during the multiplication step. The algorithm generates ECC III after the multiplications, and after log7 P/2 more BFS steps (back in the recursion) the algorithm generates ECC IV (see
Theorem 2 Let (P, F, BW, L) be the resource parameters of CAPS [33], and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of faults, and assume
Then
To handle multiple faults the algorithm allocates
additional processors. Therefore
Each h of them encode a set of
processors.
As in the seven additional processors case F′=FCC+FMM+FRe, and similarly for BW′ and L′. The costs of MM and Re are similar to the seven processors case in Section 6.
Code Creation
The algorithm creates code every log7 P/d BFS steps. The code creation done by reduce operations in a row, by Lemma 2 it takes
Summing all the costs of code creations:
Total Costs
Summing up Equation 4 above with Equations 6, and 7 from Section 6 yields:
Note that when d=log7 P the number of additional processors is
and processors minimizing algorithm is obtained. And when d=2 the communication minimizing algorithm is obtained (see Table 2). Summing the above with Equations 13, and 14 from Section 6.1:
Inherent Fault Tolerance of Strassen's Algorithm
Strassen's algorithm is composed of three phases, the encoding of A and of B, element-wise multiplications, and the decoding of C. Each encoding step is a linear transformation from to . This encoding is a linear code in terms of coding theory. It is therefore useful for fault recovery. This code has a distance of 2 (see Appendix 8 for details), thus it enables recovery of a single erasure. Moreover, each BFS step increases the distance of the code by a factor of 2. Therefore, after the second BFS step, three faults can be recovered using this code (and in general 2l−1 after l BFS steps). Therefore, for a small number of faults that occur during the encoding phase, the algorithm does not need code processors for recovery. This code is also very local, since only 7 processors are involved in the recovery. The inherent fault tolerance of the encoding phase of Strassen algorithm may exists in other fast matrix multiplication but does not have to. Furthermore, immediately following the multiplication, existence of an element that can be expressed as a linear combination of the others is unlikely, since this would imply the existence of an algorithm with fewer multiplications, hence better complexity.
Fault tolerance for other fast matrix multiplication algorithm The proposed approach can be generalized to other fast matrix multiplication algorithms. Consider a fast matrix multiplication algorithm that multiplies two matrices of size n0×n0 using m0 multiplications (hence its asymptotic arithmetic complexity is Θ(nω
Theorem 3 Let (P, F, BW, L) be the resource parameters of a parallelization of fast matrix multiplication [33, 50, 52] with exponent ω0 and with m0 multiplications in the base case, and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of faults, and assume
Then for any d such that 2≤d≤logm
When the memory is a bounding resource, the same algorithm yields the following:
Theorem 4 Let (P, F, BW, L) be the resource parameters of a parallelization of fast matrix multiplication [33, 50, 52] with exponent ω0 and with m0 multiplications in the base case, and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of faults. Then for any d such that 2≤d≤logm
The proofs of Theorems 3 and 4 are the same as the proofs of Theorems 2 and 8.
Comparing the Proposed Solution to Existing Solutions
Chen and Dongarra [39, 40], presented a fault tolerant algorithm based on classic matrix multiplication. In a recent work [36] two fault tolerant algorithms have been obtained, based on classic matrix multiplication that improve resource use, e.g., minimizing additional processors requirement and communication costs.
Theorem 5 (Slice coded [36]) There exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, with resources.
where c is the memory redundancy factor, namely c=Θ(P·M/n2), n is the matrix dimension, M is the local memory size, P is the number of processors of the original algorithm, and f is the total number of faults.
Strassen's algorithm is asymptotically better than classic matrix multiplication. However the overhead of the proposed classic based fault tolerant algorithm is better than the overhead for the Strassen based one, as a result as long as h is not too large, the Strassen-based solution is expected to be faster. However, for huge h values, namely
the communication costs of the Strassen-based solution dominate those of the classical based one, hence the latter is expected to be more efficient.
Two methods for obtaining fault tolerance at lower costs have been presented above: the slice-coded algorithm and the posterior-recovery algorithm. Both can handle multiple simultaneous faults. When the memory is minimal both algorithms use as few processors as possible; namely h, where h is the maximum number of faults that may occur in one iteration. It has been shown how to combine these methods with a 2.5D algorithm that utilizes redundant memory, to reduce the communication costs. When the number of fault is not too large our algorithms only marginally increase the number of arithmetic operations and the bandwidth costs. The slice-coded algorithm increases the latency by a factor of log P. If faults occur in every iteration of the posterior recovery algorithm, its latency increases by a factor of log P as well.
The slice-coded algorithm uses the outer-product in each iteration and keeps the code processors updated. The outer product uses up to a constant factor more words, and up to O (log P) factor more messages. Therefore, the slice-coded algorithm communicates a little more, but it can recover faults quickly at each iteration. In contrast, the posterior recovery communicates less in this phase, but performs more operations to recover faults. Therefore the slice-coded algorithm is more efficient when many faults occur, and useful when quick recovery is needed. For fewer faults, the posterior recovery is more efficient.
The posterior recovery with redundant memory uses the input replication of the 2.5D algorithm. It utilizes the redundant memory to reduce communication costs and to reduce the number of required additional processors. The case of h<c has been analyzed, where the maximum number of simultaneous faults is smaller than the number of copies of the input. In this case the algorithm does not need to allocate additional processors but rather recovers the input using the existing replication. The case of h≥c, where h−c+1 additional processors are required is not analyzed, and the recovery run-time depends on the faults distribution. Briefly, in this case, if a code processor faults, the recovery requires computations, whereas when an original processor faults, the recovery uses the input replication, and is very fast.
For Strassen's [54] and other fast matrix multiplication, Ballard et al. [5] described a communication optimal parallelization that matches the communication costs lower bound [36]. However, this parallelization technique does not allow for a direct application of either introduced methods.
Although embodiments of the invention have been described by way of illustration, it will be understood that the invention may be carried out with many variations, modifications, and adaptations, without exceeding the scope of the claims.
To handle h simultaneous faults, the algorithm uses multiple code processors for encoding each set of the original processors. Similar to the single fault case, the original processors are divided into seven sets, but use h code processors to encode each set rather than one (so a total of 7h additional processors rather than 7). The algorithm uses h reduce operations (efficiently, recall Lemma 2).
Theorem 6 Let (P, F, BW, L) be the resource parameters of GAPS [3] and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of faults, and assume memory is unlimited, namely
Then (P′, F′, BW′, L′)=(P+7h, F·(1+o(h)), BW·O(h+1), L+O(log2 P+(f+h)·log P)).
Code Creation.
Each BFS step increases the data size by 7/4, such that in the s step the data is size in each processor is
The algorithm creates code every BFS step. The code creation is done by reduce operations in a row, which by Lemma 2 takes
Summing the all costs of code creations:
Matrix Multiplication
As in the single fault yields
(FMM,BWMM,LMM)=(F,BW,L) (6)
Recovery
As in the single fault case, FRe=FReF+FReC and similarly for BW and L. The recoveries when the data size is maximal dominate FReF, BWReF. There are at most h such recoveries. Therefore the costs of f reduce operations are:
At most h faults occur during the multiplication step. After the multiplication step the algorithm recomputes the lost data. It needs to recompute h matrix multiplication of size
The algorithm splits the processors into h sets and each set computes one block multiplication. By using CAPS [3] for this multiplications it takes:
Summing up the recovery costs, yields
Total Costs.
By Equations 5, 6, and 7 yields:
6.1 Limited Memory Case
When memory is limited
CAPS algorithm starts with l DFS steps, which is followed by k IFS steps, where
and k=log7 P. The linear code is easily preserved during the DFS steps, because they only involve local computations. Each DFS step reduces the problem size by a factor of 4, such that after l DFS steps the algorithm executes the unlimited model (i.e., k BFS steps) on a problem of size
There are 7l such sub-problems; therefore it performs the unlimited model 71 times, on inputs problems of size
Theorem 7 Let (P, F, BW, L) be the resource parameters of CAPS [3] and let (P′, F′, BW′, L′) be the corresponding parameters of the fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of limits, and
be the memory size. Then:
Proof. To handle multiple faults the algorithm allocates 7h additional processors, therefore P′=P+7h. Each h of them encode a set of P/7 processors using a maximum distance code. For the analysis below 7l, and
are computed
therefore
Code Creation.
The algorithm creates code of the input data at the beginning. By Theorem 2 this takes the form of:
This code is preserved throughout all the DFS steps and the first BFS step. The algorithm creates a new code every BFS steps, as in the unlimited memory case (Section 6), The size of the matrices at the first BFS step is
Each BFS step increases the data size by 7/4, as the input shrinks by a factor of 4 but the number of inputs increases by a factor of 7. In the s'th step the input size in each processor is
The code creation, by a reduce operation, by Lemma 2 costs:
Summing all the costs of code creations:
Matrix Multiplication.
Aside from code creation and recoveries the algorithm performs matrix multiplication as usual. Therefore the matrix multiplication costs are as in CAPS, namely
(FMM,BWMM,LMM)=(F,BW,L) (13)
Recovery
The algorithm recovers the lost data using the code. The lost data are a linear combination of P/7 other processors, and can be recovered by a reduce operation. Each fault incurs recovery of at most O(M) data. By Lemma 2 each such recovery takes O(M), O(M), O(log P)). In case of a fault during the multiplication step, the algorithm recomputes the lost data by applying the CAPS algorithm to compute the multiplication. This takes
Therefore the recovery takes at most:
Total Costs
Summing up Equations 12, 13, and 14:
Limited memory model when trading off processors for communication Theorem 8 Let (P, F, BW, L) be the resource parameters of CAPS [3] and let (P′, F′, BW′, L′) be the corresponding parameters of a fault tolerant algorithm that can withstand h simultaneous faults at any given time. Let n be the matrix dimension, f be the total number of faults. Then
To handle multiple faults the algorithm allocates
additional processors. Each h encodes a set of
processors using a maximum distance code. Therefore
Similar to the seven additional processors case, F′=FCC+FMM+FRe, BW′=BWCC+BWMM+BWRe, and L′=LCC+LMM+LRe.
Code Creation
The algorithm creates code at the beginning of the run on the input data. By Theorem 2 it takes:
This code is preserved during the DFS steps and the first log7 P/d BFS steps. The algorithm creates code every log7 P/d BFS steps. The problem size at the first BFS step is
Each BFS step increases the data size by 7/4, such that in the s step the problem size in each processor is
The algorithm creates code every log7 P/d BFS steps. The code creation done by a reduce operation, which by Lemma 2 takes:
Summing the all costs of code creations:
Total Costs
Summing up equations above with Equations 13, and 14 from Section 6.1 yields
Strassen's Encoding from Coding Theory Perspective
The distance of the inherent code of Strassen's algorithm will be shown below.
The encoding matrices of Strasse's algorithm generate codes of distance 2.
Proof. Strassen's [29] matrix multiplication is a bilinear algorithm with encoding and decoding matrices (the linear operations) U, V,W as follows:
It is easy to verify that each row in U can be represented by a linear combination of the other rows, therefore it is generating matrix of a code of distance at least 2. Note that the fifth row and seventh row of U cannot be represented by a linear combinations of the five other rows, hence the distance is exactly 2. Similarly V generates a code of distance 2, where by rows three and six of V the distance is exactly 2.
Number | Name | Date | Kind |
---|---|---|---|
20180293777 | Sarel | Oct 2018 | A1 |
20190332467 | Ross | Oct 2019 | A1 |
20190347125 | Sankaran | Nov 2019 | A1 |
Entry |
---|
Ding et al., “Matrix Multiplication on GPUs with On-line Fault Tolerance”, IEEE , p. 311-317. (Year: 2011). |
Number | Date | Country | |
---|---|---|---|
20180365099 A1 | Dec 2018 | US |
Number | Date | Country | |
---|---|---|---|
62521732 | Jun 2017 | US |