1. Field of the Invention
Embodiments of the present invention relate to high-performance computing. More particularly, embodiments of the invention relate to solving a large-scale matrix equation using a system that includes reconfigurable computing devices.
2. Description of the Related Art
Many high-performance computing applications in science and engineering, involving fields such as computational fluid dynamics, electromagnetics, geophysical exploration, economics, linear programming, astronomy, chemistry, and structural analysis, require the solution to large matrix equations. The matrix equation may take the form Ax=b, where A is a known n×n matrix, b is a known vector of size n, and x is an unknown vector if size n. Some approaches to finding the solution vector, x, involve the usage of reconfigurable computing devices, such as field programmable gate arrays (FPGAs). In some instances requiring extremely high performance, the solution vector may include millions of elements. In such cases, the reconfigurable computing device may not be able to store the solution vector in its own internal memory. Thus, the solution vector may be stored in a memory unit external to the reconfigurable computing device. As a result, the performance of the reconfigurable computing device may be reduced due to the latency involved in accessing the external memory unit to retrieve and update the solution vector.
In other instances, the nature of the problem to be solved, the characteristics of the system that is modeled, or similar circumstances may produce a matrix, A, that is sparsely populated. In other words, a significant portion of the elements of the matrix may have a value of zero. Traditional matrix linear solvers may not recognize this fact and take advantage of it. As a result, performance may be sacrificed by unnecessarily retrieving data and performing calculations.
Embodiments of the present invention solve the above-mentioned problems and provide a distinct advance in the art of high performance computing. More particularly, embodiments of the invention provide a system that includes reconfigurable computing devices that find the solution to a large-scale matrix equation wherein the solution vector may be extremely large or the matrix may include sparse data.
Various embodiments of the system for solving a large-scale matrix equation involving a matrix, a first vector, and a second vector comprise a plurality of field programmable gate arrays (FPGAs), a matrix memory element, a plurality of matrix memory element controllers, and a plurality of processing elements.
Each FPGA includes a plurality of configurable logic elements and a plurality of configurable storage elements. The matrix memory element may be accessible by the FPGAs and may be configured to store the matrix. The matrix memory element controllers may be formed from the configurable logic elements and the configurable memory elements and may be configured to access the matrix memory element and to supply a plurality of portions of a row of the matrix to the processing elements.
Each processing element generally solves an iteration of an element of the first vector. Each processing element may store a portion of the first vector and at least one element of the second vector. Each processing element includes a matrix-vector product summation unit that calculates the matrix-vector product sum by receiving all of the portions of the vector from other processing elements and all of the portions of a row from the memory element controller. A linear solver update unit receives the matrix-vector product sum along with an element of the second vector, an inverse of the row diagonal of the matrix, and the element of the first vector that was calculated in a previous iteration. From these values, the linear solver update unit calculates the element of the first vector for the current iteration. The other processing elements calculate the values of the other elements in the first vector. In the next iteration, the current values of the first vector are used to calculate new values of the first vector. The iterations continue until a stopping criteria is met and the solution of the first vector is established.
Another embodiment of the system solves a large-scale matrix equation involving a sparse matrix, a first vector, and a second vector. The system may be substantially similar to the system described above but may further include a vector memory element and a vector memory element controller. Other differences may be described as follows.
The matrix memory element controller may be configured to access the matrix memory element and to supply non-zero data of a row of the matrix. The vector memory element may be accessible by the FPGAs and may be configured to store the first vector. The vector memory element controllers may be formed from the configurable logic elements and the configurable memory elements and may be configured to access the vector memory element and to supply matching elements of the first vector that correspond to the non-zero data of the row of the matrix.
The matrix-vector product summation unit may receive the non-zero data of the row of the matrix and the matching elements of the first vector to calculate the matrix-vector product sum. The linear solver update unit calculates the element of the first vector for the current iteration, as described above. Likewise, the iterations continue until a final solution for the first vector is found.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Other aspects and advantages of the present invention will be apparent from the following detailed description of the embodiments and the accompanying drawing figures.
Embodiments of the present invention are described in detail below with reference to the attached drawing figures, wherein:
The drawing figures do not limit the present invention to the specific embodiments disclosed and described herein. The drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the invention.
The following detailed description of the invention references the accompanying drawings that illustrate specific embodiments in which the invention can be practiced. The embodiments are intended to describe aspects of the invention in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments can be utilized and changes can be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense. The scope of the present invention is defined only by the appended claims, along with the full scope of equivalents to which such claims are entitled.
A system 10 for matrix partitioning in large-scale sparse matrix iterative linear solvers, as constructed in accordance with various embodiments of the current invention, is shown in
The matrix equation may have the form Ax=b, where A is a known n×n matrix (referred to as the “A-matrix”), b is a known vector of size n (referred to as the “b-vector”), and x is an unknown vector of size n (referred to as the “x-vector” or alternatively the “solution vector”). The matrix and the two vectors may all have a total of n rows. For a large scale matrix equation, n may be in the millions. The matrix equation may be expanded as shown in EQ. 1:
One approach to solving the matrix equation for x involves approximating an initial solution for x and then iteratively solving the matrix equation to find successive solutions for x. The approach may include solving EQ. 2 for one element of the x-vector for one iteration, as shown:
x
r
next
=x
r
+Δx
r EQ. 2
wherein xr
such that Dr is the diagonal of the row r from A-matrix, br is the element of the known b-vector in row r, Ar is the all the values in row r of A-matrix, and x is all the values of the x-vector that were calculated in the last iteration of the solution.
In applying the iterative process, the term xr
The iterative equations of EQ. 2 and EQ. 3 may include a linear solver update component and a matrix vector product summation component. The term ΣAr·x forms the matrix vector product summation component. The rest of EQ. 3 and the addition portion of EQ. 2 forms the linear solver update component as the incremental change, Δxr, is added to xr to update its value.
The FPGA 14 generally provides the resources to implement the partitioning memory controllers 18, the processing elements 20, and the inter FPGA links 22. The FPGA 14, as seen in
The FPGA 14 may be programmed in a generally traditional manner using electronic programming hardware that couples to standard computing equipment, such as a workstation, a desktop computer, or a laptop computer. The functional description or behavior of the circuitry may be programmed by writing code using a hardware description language (HDL), such as very high-speed integrated circuit hardware description language (VHDL) or Verilog, which is then synthesized and/or compiled to program the FPGA 14. Alternatively, a schematic of the circuit may be drawn using a computer-aided drafting or design (CAD) program, which is then converted into FPGA 14 programmable code using electronic design automation (EDA) software tools, such as a schematic-capture program. The FPGA 14 may be physically programmed or configured using FPGA programming equipment, as is known in the art.
The matrix memory element 16 generally stores all the values of the A-matrix. The matrix memory element 16 may receive the values of the A-matrix from an external source, and may be updated either in full or in part each time a solution for the x-vector is found. The matrix memory element 16 may also receive requests and/or control signals from the partitioning memory controllers 18 to send A-matrix data to each partitioning memory controller 18 to be distributed to the processing elements 20. The matrix memory element 16 may include a matrix memory data output 28, through which A-matrix data is transmitted. The matrix memory data output 28 may include serial lines, multi-bit parallel busses, or combinations thereof, and may follow a standardized protocol, such as PCI Express, or the like.
The matrix memory element 16 may include random-access memory (RAM) elements, multi-port RAM elements, read-only memory (ROM) elements, programmable ROM (PROM) elements, buffers, registers, flip-flops, floppy disks, hard-disk drives, optical disks, or combinations thereof.
In various embodiments, the number of elements in the x-vector may be too large for the x-vector to be stored in a single FPGA 14. In such embodiments, the x-vector may be divided into a plurality of subsets, xs, such that the x-vector is the concatenation of xsi, where i is an index with a range from 1 to m, m being the total number of subsets. The x-vector may be divided into equal-sized subsets or the size of each subset may vary. The x-vector may also be divided such that at least one subset of the x-vector, xs, is stored on an FPGA 14, while in certain embodiments, more than one subset is stored on an FPGA 14.
The processing element 20 generally calculates one element of the x-vector, xr
The processing element 20, as shown in
The x-vector subset tracking and storage element 30 generally tracks and stores the subsets of the x-vector, xsi. The x-vector subset tracking and storage element 30 may receive an x-vector subset data input 48 from the communication destination element 44 that includes the x-vector subset data as well as an index tag to identify which subset is being received. The x-vector subset tracking and storage element 30 may receive all the subsets of the x-vector during one iteration. The x-vector subset tracking and storage element 30 may also transmit an x-vector subset data output 50 as requested by the matrix-vector product summation unit 38 or at regular intervals.
The x-vector subset tracking and storage element 30 may be formed from combinational logic gates, e.g., AND, OR, and NOT, control logic blocks such as finite state machines (FSMs), as well as configurable storage elements 26, such as first-in first-out registers (FIFOs), single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The x-vector subset tracking and storage element 30 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The x-vector row storage element 32 generally stores the element of the x-vector for row r, xr. The x-vector row storage element 32 may receive an x-vector row data input 52 from the communication destination element 44 and may transmit an x-vector row data output 54 to the linear solver update unit 40. The x-vector row storage element 32 may receive the x-vector row data input 52 at the end of each iteration. The x-vector row storage element 32 may supply the x-vector row data output 54 as necessary.
The x-vector row storage element 32 may be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The x-vector row storage element 32 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The b-vector storage element 34 generally stores the element of the b-vector for row r, br. The b-vector storage element 34 may transmit a b-vector row data output 56 to the linear solver update unit 40, as necessary. The b-vector storage element 34 may receive the b-vector row data from an external source or other components of the system once a solution for the x-vector is found.
The b-vector storage element 34 may be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The b-vector storage element 34 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The row diagonal storage element 36 generally stores the data for a given row of the A-matrix. The row diagonal storage element 36 may receive A-matrix row data for the row, r, from the matrix memory element 16. From the row data, the row diagonal storage element 36 may calculate the row diagonal, Dr, and the inverse of the row diagonal, 1/Dr. Thus, the row diagonal storage element 36 may supply a row diagonal inverse output 58 to the linear solver update unit 40 as necessary.
The row diagonal storage element 36 may be formed from configurable logic elements 24 such as combinational logic gates, as well as adders, multipliers, shift registers, combinations thereof, and the like. The row diagonal storage element 36 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The row diagonal storage element 36 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The matrix-vector product summation unit 38 generally calculates the matrix-vector product summation, ΣAr·x, for a given row, r, of the A-matrix. The matrix-vector product summation unit 38 may receive the subset of x-vector data, xsi, through the x-vector subset data output 50 as well as the corresponding subset of the row data for the A-matrix, Arsi, from the matrix row data input 46. With these two sets of data, the matrix-vector product summation unit 38 may calculate a subset of the matrix-vector product sum, ΣArsi·Xsi. The matrix-vector product summation unit 38 may then receive another subset of the x-vector data, xsi, and another subset of the A-matrix data, Arsi, to calculate another subset of the sum. The earlier calculated subset of the sum may be temporarily stored to be finally added later or may be successively added to the ongoing sum. Once all the subsets of the matrix-vector product summation have been added, the matrix-vector product summation unit 38 may transmit the total sum, ΣAr·x, as a matrix-vector product summation output 60 to the linear solver update unit 40.
The matrix-vector product summation unit 38 may be formed from configurable logic elements 24 such as combinational logic gates, as well as adders, multipliers, shift registers, accumulators, multiply-accumulate units (MACs), combinations thereof, and the like. The matrix-vector product summation unit 38 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The matrix-vector product summation unit 38 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The linear solver update unit 40 generally calculates xr
The linear solver update unit 40 may be formed from configurable logic elements 24 such as combinational logic gates, as well as adders, multipliers, shift registers, accumulators, multiply-accumulate units (MACs), combinations thereof, and the like. The linear solver update unit 40 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The linear solver update unit 40 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The communication source element 42 generally broadcasts data from one processing element 20 to all the other processing elements 20. The communication source element 42 may transmit the x-vector solution output 62. The communication source element 42 may also transmit the subset of the x-vector data, xs.
The communication destination element 44 generally receives the data that is broadcast from all the other processing elements 20. The communication destination element 44 may receive xr
The communication source element 42 and the communication destination element 44 may be formed from configurable logic elements 24 such as combinational logic gates, multiplexers, demultiplexers, crossbar or crossover or crosspoint switches, combinations thereof, and the like. The communication source element 42 and the communication destination element 44 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The communication source element 42 and the communication destination element 44 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL. In addition, the communication source element 42 and the communication destination element 44 may include an architecture such as the one described in “SWITCH-BASED PARALLEL DISTRIBUTED CACHE ARCHITECTURE FOR MEMORY ACCESS ON RECONFIGURABLE COMPUTING PLATFORMS”, U.S. patent application Ser. No. 11/969,003, filed Jan. 3, 2008, which is herein incorporated by reference in its entirety.
Referring to
The partitioning memory controller 18 may be formed from configurable logic elements 24 such as combinational logic gates, FSMs, combinations thereof, and the like. The partitioning memory controller 18 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The partitioning memory controller 18 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The inter FPGA link 22 generally allows communication from the components, such as the processing elements 20, on one FPGA 14 to the components on another FPGA 14. The inter FPGA link 22 may buffer the data and add packet data, serialize the data, or otherwise prepare the data for transmission.
The inter FPGA link 22 may include buffers in the form of flip-flops, latches, registers, SRAM, DRAM, and the like, as well as shift registers or serialize-deserialize (SERDES) components. The inter FPGA link 22 may be a built-in functional FPGA block or may be formed from one or more code segments of an HDL or one or more schematic drawings. The inter FPGA link 22 may also be compatible with or include Gigabit Transceiver (GT) components, as are known in the art. The inter FPGA link 22 may receive data from the communication source element 42 and may transmit data to the communication destination element 44. The inter FPGA link 22 may couple to an inter FPGA bus 66 to communicate with another FPGA 14.
The inter FPGA bus 66 generally carries data from one FPGA 14 to another FPGA 14 and is coupled with the inter FPGA link 22 of each FPGA 14. The inter FPGA bus 66 may be a single-channel serial line, wherein all the data is transmitted in serial fashion, a multi-channel (or multi-bit) parallel link, wherein different bits of the data are transmitted on different channels, or variations thereof, wherein the inter FPGA bus 66 may include multiple lanes of bi-directional data links. The inter FPGA bus 66 may be compatible with GTP components included in the inter FPGA link 22. The inter FPGA link 22 and the inter FPGA bus 66 may also be implemented as disclosed in U.S. Pat. No. 7,444,454, issued Oct. 28, 2008, which is hereby incorporated by reference in its entirety.
In other embodiments of the system 10, the A-matrix may be a sparse data matrix, in which a significant portion of the elements of the A-matrix may have a value of zero. Thus, the calculation of the matrix-vector product summation, ΣAr·x, may be affected. Since Ar may include only a small percentage of non-zero elements, only the corresponding elements of the x-vector need to be retrieved for the summation calculation. Furthermore, since the index of the non-zero elements of the A-matrix may be different for each row, the corresponding elements of the x-vector required for the summation calculation may differ as well. As a result, in another embodiment, the system 10 may include a vector memory element 68, and a plurality of lookahead memory controllers 70, as shown in
The vector memory element 68 generally stores all the values of the x-vector. The vector memory element 68 may receive the values of an initial approximation for the x-vector from an external source, and may be updated by the lookahead memory controller 70 after every iteration of the calculation of xr
The vector memory element 68 may include RAM elements, multi-port RAM elements, ROM elements, PROM elements, buffers, registers, flip-flops, floppy disks, hard-disk drives, optical disks, or combinations thereof.
The lookahead memory controller 70 generally retrieves the elements of the x-vector that correspond to the non-zero values for a given row, r, of the A-matrix. For example, if the non-zero elements for a given row, r, of the A-matrix are located in columns 1, 10, and 50, then the lookahead memory controller 70 may retrieve elements 1, 10, and 50 from the vector memory element 68. Thus, a subset of the x-vector may be created for each row, r, of the A-matrix. However, in contrast to the x-vector subsets discussed above, which generally included contiguous portions of the x-vector and were the same for each row, the x-vector subsets of the current embodiments include only those elements corresponding to non-zero elements of the A-matrix and may be different for each row. The lookahead memory controller 70 may then transmit the appropriate subset of the x-vector, xs, for row r to the processing element 20 that is calculating xr
The lookahead memory controller 70 may be formed from configurable logic elements 24 such as combinational logic gates, control logic blocks such as FSMs, combinations thereof, and the like. The lookahead memory controller 70 may also be formed from configurable storage elements 26, such as FIFOs, single-port or multi-port RAM elements, memory cells, registers, latches, flip-flops, combinations thereof, and the like. The lookahead memory controller 70 may also include built-in components of the FPGA 14, and may further be implemented through one or more code segments of an HDL.
The system 10 may function substantially the same as the system 10 of
The components of another embodiment of the processing element 20, shown in
The processing element may further include a sparse matrix-vector product summation unit 84, which may be structurally equivalent to the matrix-vector product summation unit 38 as described above. However, the sparse matrix-vector product summation unit 84 may compute the matrix-vector product summation, ΣAr·x, wherein Ar includes only the non-zero elements of row r, (Ar) and x includes only the corresponding x-vector elements (xs).
Embodiments of the system 10 as shown in
Each processing element 20 may proceed to calculate each element of the x-vector, xr
Once all the xr
Embodiments of the system 10 as shown in
Likewise, the iterations to find a solution for the x-vector continue until some stopping criterion is met and the solution for the x-vector is then transmitted to an external destination, as described above.
Although the invention has been described with reference to the embodiments illustrated in the attached drawing figures, it is noted that equivalents may be employed and substitutions made herein without departing from the scope of the invention as recited in the claims.