A sparse matrix (also referred to as a sparse array) is a data structure in which most of the elements of the matrix are zero. Sparse matrices are often used in computational domains where the concept of sparsity is applicable, such as network theory, numerical analysis, and scientific computing, which typically have a low density of significant data or connections. For example, sparse matrices can be useful for large-scale applications where dense matrices are intractable, as in solving partial differential equations. Sparse matrix-vector multiplication (SpMV) is a mathematical operation that is often used in scientific and engineering applications that involves the matrix multiplication of one or more sparse matrices. SpMV is a fundamental kernel that is employed in high performance computing, machine learning, and scientific computation; and is also utilized heavily used in graph analytics. Furthermore, SpMV is often considered as a building block for many other computational algorithms such as graph algorithms, graphics processing, numerical analysis, and conjugate gradients.
However, a single unilateral approach for formatting a sparse matrix may not be suitable for all sparse matrices (e.g., having different sparsity patterns) or most efficient for all applications that utilize sparse matrices, including SpMV. Therefore, it may be beneficial to represent a sparse matrix (e.g., the input matrix for a SpMV operation) in a manner that improves the performance of processing sparse matrices, and particularly SpMV, in order to optimize processing time and space usage on different computer platforms.
The present disclosure, in accordance with one or more various implementations, is described in detail with reference to the following figures. The drawings are provided for purposes of illustration only and merely depict typical or example implementations. These drawings are provided to facilitate the reader's understanding of various implementations and shall not be considered limiting of the breadth, scope, or applicability of the present disclosure. It should be noted that for clarity and ease of illustration these drawings are not necessarily made to scale.
The figures are not intended to be exhaustive or to limit various implementations to the precise form disclosed. It should be understood that various implementations can be practiced with modification and alteration.
Sparse matrix-vector multiplications (SpMV) are widely used for many scientific computations, such as graph algorithms, graphics processing, numerical analysis, and conjugate gradients. The performance and efficiency of SpMV are greatly affected by the sparse matrix representation utilized. The disclosed embodiments provide a distinct column-partitioned formatting to represent the sparse matrix, which improves cache utilization, improves scalability, reduces latency, and leverages the advantages of distributed computer processing particularly for SpMV calculations.
As referred to herein, a sparse matrix is a matrix in which a large number (e.g., majority) of the elements within the matrix are zero. Thus, sparse matrices have very few non-zero values spread throughout. By contrast, if most of the elements of the matrix are non-zero, then the matrix is considered a dense matrix. In many computing platforms, processing a large sparse matrix, with few non-zero values, is inefficient due to the suboptimal overhead in space usage and computational processing associated with the many non-zero values. These aforementioned drawbacks related to sparse matrices are motivating continued development of specialized data layouts (or formats) and algorithms for efficiently transferring and manipulating the data in manner that leverages the advantages of the sparse structure of the sparse matrix.
The sparsity of a matrix is defined as the number of non-zero elements (NZ) divided by number rows in the matrix. As used herein, the term sparsity is not limited to the abovementioned definition and also can be a reference to a more conceptual sparsity of a matrix. Sparse Matrix Vector Multiplication (SpMV) benchmarks a function that is represented below mathematically as:
Y=Ax (1)
There are several traditional approaches for representing a sparse matrix utilizing a more efficient layout (or format) to address the problems associated with sparse matrices (e.g., processing and space inefficiencies). Examples of specialized layouts for a sparse matrix that are conventionally used include but are not limited to: Compressed Sparse Row (CSR); Compressed Sparse Column (CSC); Block Sparse Row (BSR); and the like. Although these conventional layouts for a sparse matrix, such as CSR, can be parallelized, these layouts are still subject to limitations, bottlenecks, and inefficiencies that prevent efficient scalability even when implemented across a distributed computing platform. For instance, the CSR layout may experience scalability problems as the system and problem sizes increase. For example, in the CSR layout it can be difficult to partition groups of rows such that processing entities (PEs) have the same number of transfers (e.g., transfers of data during computational processing). Additionally, there are other considerations associated with the CSR layout, including: the row index vector must be parsed to discern how many non-zero values; which values of the x-dense column vectors that will be needed cannot be determined a priori; and the entire x-vector is typically read, which is an extra pressure on the PE memory capacity. In order to address these and other drawbacks associated with traditional layouts, the disclosed embodiments include an enhanced specialized layout for representing a sparse matrix, referred to as a column-partition sparse matrix. The column-partition sparse matrix (CPSM), which is described in greater detail herein, provides improvements over the aforementioned traditional layouts. Particularly, the disclosed CPSM format is an enhanced layout that can be used for matrix operations, where the data is distinctly arranged by column-partitioning the sparse matrix and partitioning the dense matrix in a manner that improves scalability and computational efficiency when performing sparse matrix vector multiplication (SpMV).
Referring now to
Initially (e.g., prior to arranging data of the sparse matrix 120 and DCMV 130 into the CSR layout),
Moreover, the sparse matrix 120 can be re-formatted in accordance with a traditional layout in order to realize some improved efficiency as previously described. In the example of
In addition,
In operation, the cores of each PE 111-113 can parallelize decoding and processing sparse matrix 120 structure, multiply-accumulating elements of the DCRV 150, and writing elements of the DCRV 150 as each row completes. As one of the partitions 121-123 of the sparse matrix 120 is processed by the corresponding PE's 111-113 processing core, the next of the partitions 121-123 can be prefetched from an external memory of the system (not shown) to improve execution efficiency. Because the computing system processes the larger sparse matrix 120 as smaller partition blocks 121-123, as each partition 121-123 is completed the corresponding partial DCRV 150 can be sent back to the external memory. Processing for the SpMV benchmark operation continues until all of the partition blocks 121-123 have been processed and the final result of the SpMV operation, which is the full DCRV 150, is generated and stored.
The structure of the CSR layout data 140 strongly influences processing (e.g., while performing the SpMV computation) to progress sequentially through columns for each sequential row, respectively for each of the partitions 141-143 within the CSR layout data 140. The sparsity of non-zero columns in a row provides no expectation of which elements in the DCMV 130 need to be used at any given time during the SpMV computation. Consequently, while the sparse matrix 120 and the DCRV 130 does achieve some efficiency in being partitioned across the system, the entire DCMV 130 needs to be distributed to every endpoint PE and persist at every endpoint PE for the multiply-accumulate operations to proceed in the order that the matrix values are accessed. Also,
Thus, there is a full data representation of the DCMV 130 stored respectively at each of the PEs 111-113 as a portion of the SpMV computation that is also distributed to each of the PEs 111-113 to be performed in parallel. However, sparsity of the sparse matrix 120 implies that there will be a low probability of reuse for recently cached cache lines that contain multiple adjacent data from the respective DCMVs 130a-130c. For example, there is a low probability that each of the elements within the DCMV 130a will be reused for the specific SpMV computations executed on PE 111 (corresponding to the sparse matrix elements in the CSR layout data 140 on PE 111), which leads to cache misses during processing at the PE 111. This potential for cache misses also corresponds to the distributed processing at the other PEs, as data that may be needed for a specific computation on a PE, such as PE 111, may be currently stored on another PE that is executing its portion of processing. Thus, it can be assumed in the example of
The disclosed column-partition sparse matrix (CPSM) format, as described in greater detail in reference to
In the example of
As previously described, storing all the elements (e.g., majority of which are zero elements) of a sparse matrix can lead to a wastage of resources, such as processing and memory. Accordingly, reorganizing the sparse matrix 210 into the CPSM format involves initially partitioning the sparse matrix 210 based on columns in a manner that reduces the amount of data that is distributed to each PE for parallelized processing. As illustrated in
In an example, the size for each column partition (or column range) is a defined static parameter, for example being defined prior to deployment of the computer system, where the parameter is based on the specifications for one or more of the computer's resources. In an embodiment, the size of a column partition is a defined as a function of a L2 cache size utilized by the computer system (or a PE) in order to reduce the number of cache misses experienced during computational processing. In some cases, the size for a column partition can be defined using a mathematical relationship with respect to the L2 cache size, such as a percentage, rate, magnitude, fraction, factor, multiple, amount, proportion, or other quantifiable relationship/algorithm. For example, the computer system implementing the sparse matrix 210 can have a predefined default L2 cache size of 8 MB that is utilized for each of the PEs in the system. Thus, the size for the column partitions 241-246 is defined as a static parameter that is set with respect to the known L2 cache size of the PEs, for instance being approximately 800 KB which is quantifiably a factor of 10 smaller than the default cache L2 size. Generally, the defined size for a column partition may increase proportionally to the L2 cache size used by the computer system, where column partitions of a smaller size may be more optimal for substantially smaller cache sizes (e.g., 4 MB cache) and column partitions of a larger sizer may be more optimal for substantially large L2 cache sizes (e.g., 128 MB cache). The defined size for the column partitions may be based on other specifications or characteristics of the various resources of the computer system implementing the SpMV (or other matrix operations/algorithms) in addition to (or in lieu of) the L2 cache size, such as processor speed, memory size, clock speed, L1 cache size, L3 cache size, number of PEs dedicated for compute processing, and the like.
Alternatively, the size for column partitions can be a dynamic parameter, for instance being selected at run-time by the computer system. As an example, utilizing dynamic sizing for column partitions may be suitable in scenarios where sparsity of individual columns is found to have significant variance relative to the overall matrix sparsity. In this case, dynamically allocating a variable number of columns to each respective partition can allow the total number of non-zero elements in each partition to be nearly equal. Thus, as a dynamic parameter, the column partition size can be based on variables that may change substantially for different SpMV operations, such as: the size of a sparse matrix; size of the dense vector; sparsity of the sparse matrix; sparsity of the individual columns of the sparse matrix; complexity of the problem (e.g., 2n matrix scaling factor); total data transferred, and the like. Moreover, in some embodiments, all of the column partitions have a uniform size (e.g., sizes for each of the column partitions are the same/equal). In some embodiments, the size can vary amongst multiple column partitions, where each column partition that is derived from dividing a sparse matrix has its own respective size (which may be different than the sizes of the other column partitions). For example, it may be desirable for column partitions to be similar in size whereas the distribution of non-zero values may be non-uniform. The overhead to encode non-sequential columns may be acceptable to achieve aggregation of columns with similar number of non-zero elements.
As previously described, another distinct aspect of the approach to reorganize a sparse matrix into the CPSM format involves also partitioning the dense multiplier vector (shown as the DCMV 230). In contrast to other conventional sparse matrix layouts where the full dense vector is replicated at every PE involved in the SpMV operation (e.g., CSR layout), restructuring data into the disclosed CPSM format also includes partitioning the dense vector in addition to column-partitioning the sparse matrix. Accordingly,
In the embodiments, the number of distinct blocks, or vector partitions, that are generated from partitioning the dense vector is based on the number of column partitions that are available. Restated, the dense vector is separated into a number of vector partitions such that each vector partition has a corresponding column partition, where the vector partitions and column partitions can be distributed to all of the PEs in the computer system that are dedicated to computer processing. Specifically, in the example of
In the example of
Particularly, in
Additionally,
sizeCPG<=(sizeMM)/2 (2)
For instance, the column partition groups are created such that the number of column partitions 241-246, based on a size of main memory, can be distributed across all of the PEs of a computer system that are available to be utilized for compute processing of the SpMV operation.
In an embodiment, the number of individual column partitions that are collected to form a column partition group may be based on other CPSM format-based parameters, or specifications and/or characteristics of the various resources of the computer system implementing the SpMV (or other matrix operations/algorithms) in addition to (or in lieu of), including: the size of main memory; the number of PEs in the computer system that are dedicated to compute processing; L2 cache size; such as processor speed; memory size; clock speed; L1 cache size; L2 cache size; L3 cache size; number of vector partitions; number of column partitions; and the like.
As a general description, SpMV involves dot product multiplication of an element in a row of the dense vector by each element in a column of the sparse matrix. To this point, the DCMV 230 is a N×1 matrix having a number of rows that is equal to the number of columns in the M×N sparse matrix 210, such that each row of the DCMV 230 has a corresponding column in the sparse matrix 210 for the per-element computations executed to complete the SpMV operation. In the example of
Similarly, column partition group 248 includes the elements from the sparse matrix 210 that correspond to the elements from the DCMV 230 that are held by vector partitions 254-256. Referring back to the SpMV operation between sparse matrix 210 and the DCMV 230, row c9 of the DCMV 230 corresponds to column c9 of the sparse matrix 210, row c10 of the DCMV 230 corresponds to column c10 of the sparse matrix 210, row c11 of the DCMV 230 corresponds to column c11 of the sparse matrix 210, row c12 of the DCMV 230 corresponds to column c12 of the sparse matrix 210, and so on. Accordingly,
Thus, as a general description,
The tuple arrays 261, 262 shown in
Furthermore, in the CPSM format, the tuple array structures only represent those elements which are non-zero in the sparse matrix (or the column partitions) for improved scalability and efficiency. Referring again to the example of
Moreover, the disclosed CPSM format provides a distinct arrangement/representation of data in order to achieve reduced space (e.g., less consumption of cache/memory resources), more efficient cache utilization, and improved processing speed and efficiency (e.g., decreased running time, decreased per-element computation time) associated with matrix operations, such as SpMV.
As seen in
In an embodiment, without departing from the scope of the disclosed invention, additional optimizations may be applicable to the CPSM format, as described herein. As previously described, processing data in the aforementioned CSC format will increase the data transferred in the communication and may also result in bad cache performance during result accumulation since there will be more partial results needs to be accumulated. In contrast, by utilizing the disclosed CPSM format, when a row in the column partition is computed, it will accumulate all the non-zero elements in that row and that partial result will be sent to the appropriate destination node for the result accumulation.
For example, in the case where matrices have a large number of rows, each column is likely to have more than one non-zero value. Rather than duplicate the column value in each tuple, enumeration of tuples down a column preferably enumerating the column value once with a count value indicating the number of corresponding non-zero values to follow, which allows each tuple to encode only row index and element value. Similarly, for some matrix operations, non-zero values are always 1. In this scenario, it may also be advantageous to utilize an encoding optimization, such that the element tuple is further reduced to encode only the row index. By applying these types of additional optimizations, the total number of bytes representing column partition groups is thus reduced, thereby improving the efficiencies that can be achieved in using the disclosed CPSM format in SpMV operations.
In accordance with the disclosed CPSM format, a tuple array (which represents the data from a column partition group) is also arranged as an array having a sequenced list, or row, of elements that correspond to each respective column partition group that is clustered in the same column partition group. Restated, the column partitions can be represented by an individual sequenced row in a tuple array. For example,
Moreover, in the disclosed CPSM format, a tuple array is structured to include the elements from the vector partition that correspond to a particular column partition (with respect to computations for the SpMV operation) on the same row as their tuples within the tuple array. That is, in the CPSM format, the specific dense vector elements that are needed for computations are logically arranged with (e.g., on the same row) the representation of their corresponding sparse vector elements (e.g., tuples). For example, as previously described, the SpMV operation between sparse matrix 210 and the DCMV 230 involves computations between the element “10” in row c0 of the DCMV 230 (also in row c0 of the vector partition 251) and the elements in column c0 of the sparse matrix 210 (also in column c0 of column partition 241), the element “8” in row c1 of the DCMV 230 (also in row c1 of the vector partition 251) and the elements in column c1 of the sparse matrix 210 (also in column c1 of column partition 241), and the element “6” in row c2 of the DCMV 230 (also in row c2 of the vector partition 251) and the elements in column c2 of the sparse matrix 210 (also in column c2 of column partition 241). Accordingly, by arranging this data in the CPSM format, the tuple array 261 includes the specific elements from the DCMV 230 (or the vector partitions 251-253), namely the list of elements contained in sections 290-292 of the tuple array 261, which particularly correspond to the column partitions 241-243 (or column partition group 247) that are also represented in that tuple array. Additionally,
In this same manner, the tuple array 262 is structured to include: elements from vector partition 254 that correspond specifically to the column partition 244 on a row (section 293) with the tuples 280-282 which represent the elements from the column partition 244 (e.g., having a range of three columns) in the tuple array 262; the elements from the vector partition 255 that correspond specifically to the column partition 245 are on a row with the tuples 283-285 which represent the elements from the column partition 245 (e.g., having a range of three columns); and the elements from the vector partition 256 that correspond specifically to the column partition 246 are on a row as the tuples 286-288 which represent the elements from the column partition 246 (e.g., having a range of three columns).
Therefore, it may be described that a row in a tuple array in the CPSM format has a number of elements from the dense vector that is equal to the number of columns that are included in the defined range of column partitions. In the example of
In an additional embodiment, a sparse matrix that is already in a CSR layout, as described above in reference to
Referring back to
The computer system 300 is implemented as a distributed computing system having a plurality of PEs, shown as PEs 310-312. As referred to herein, a distributed computer system is a computing environment in which various components of the computer, such as hardware and/or software components, are spread across multiple devices (or other computers). By utilizing multiple devices, shown as PEs 310-312, the computer system 300 can split up the work associated with calculations and processing, coordinating their efforts to complete a larger computational job, particularly the SpMV operation, more efficiently than if a single processor had been responsible for the task. The computer system 300 can be a computing environment that is suitable for executing scientific, engineering, and mathematical applications that may involve matrix operations, namely SpMV operations. For instance, the computer system 300 may be a computer of a file system (controlling operation of a storage device, e.g., disk), high-performance computer, or graphics processing system. It should be appreciated that the architecture for the computer system 300 shown in
In the example of
In
Thus, in order to derive an RMij entry of the result matrix 320, the computations for non-zero elements in a row of the sparse matrix that are spread across both of the tuple arrays 261, 262 need to be accumulated. That is, the per-element product computations using data from tuple array 261 and the separate per-element product computations using data from tuple array 262 are needed to compute a sum, which is the RMij entry of the result matrix 320. For example, referring back to
Due to the CPSM format spreading the columns of a sparse matrix across the system 300, vis-à-vis tuple arrays, partial results that correspond to a portion of a row from the sparse matrix (e.g., some elements from a row that are included in a column-partition group) will be generated. In order to calculate a dot product result for an entire row of elements from the sparse matrix (e.g., an element in the result vector), partial results will still need to be accumulated with the partial results corresponding to the rest of the elements in its row (in other tuple arrays).
Generally, each PE that is utilized for SpMV computations will individually generate corresponding partial results (also referred to as result data), where the partial results correspond to the computations for data in a tuple array that is transferred to that particular PE. Further, result data associated with each of the partial results generated from its respective PE will need to be combined (e.g., summed) in order to generate the final result vector for the SpMV operation. In addition, as previously described, the result vector 320 can be partitioned based on the number of PEs that are available on the computer system. Restated, a partial result vector can be maintained respectively by each of the available PEs, where the multiple partial result vectors collectively represent the entire result vector for the SpMV operation. As depicted in
By default, all of the computed elements in the row of a column partition will be combined before they are communicated. For instance, the higher the column partition size, the higher the probability that the row will have more non-zero elements in that column partition. Thus, all of the non-zero values in that row will be accumulated, which in-turn leads to lesser addition required by the accumulation core and also less data that is required to be communicated. Furthermore, combining the partial results of the same row from different column partition in a computation node may degrade the performance causes difficulty in predicting when the partial results of the same row from different column partition will be computed. Therefore, sending these partial results to the accumulation core as they are computed will help speed up the process. This also ensure the overlap between computation and communication.
For example, in a case where the total number of available PEs (e.g., n) on the computer system 300 is three, namely PEs 310-312, the result vector 320 can be partitioned such that: PE 310 is configured to maintain partial result vector 321a on its memory; PE 311 is configured to maintain the partial result vector 321b on its memory; and PE 312 is configured to maintain the partial result vector 321c on its memory. Accordingly, as the compute processing for the SpMV operation is distributed across multiple PEs of the computer system 300, the result data from those calculations that are associated with each of the partial result vectors 321a-321c can be transferred to the appropriate one of the PEs 310-312. As an example, as PE 310 executes compute processing for the CPSM data that it has received, which is tuple array 261, that data is processed such that: the result data that is calculated by PE 310 and corresponds to the partial result vector 321a can remain thereon; the result data that is calculated by PE 310 and corresponds to the partial result vector 321b can be communicated to PE 311; and the result data that is calculated by PE 310 and corresponds to the partial result vector 321c can be communicated to PE 312. Compute processing on PE 311 is also conducted in a similar fashion.
In some implementations, elemental partial results corresponding to a result vector row processed within a PE are combined (also referred to as accumulation) before being communicated to the final PE containing the row. In reference to
In other implementations, combining, or accumulation, of the result data can be done at the column partition, the column partition group or opportunistically as a finite cache where capacity misses are evicted to the final PE. In instances where combining is commutative, the result is independent of what choice is made and is not a key aspect of the embodiments disclosed herein.
In other words, result data associated with all of the partial results, which are individually calculated by each of the PEs as a partial component of the full result, have to be generated and accumulated together at some point in a manner that allows the final result vector for the SpMV operation to be obtained. By accumulating result data for each partial result from each of the PEs that are involved in compute processing, all of these partial components of the result that are distributed across the system (being separately generated at a respective PE) can be amassed and pieced together to generate the full final result for the SpMV operation, namely a result vector (or DCRV).
Referring to the example of
Since every PE can compute multiple column partitions, partial results of the same row from different column partitions are typically not accumulated before communicating the partial result to the accumulation core. The compute PE will accumulate the partial computed results of a row in that column partition. Eventually one partial result for ever row in a column partition will be communicated.
As previously described, a key aspect of the disclosed CPSM format is to partition the sparse matrix and the dense multiplier vector, which enables the parallel processing capabilities of a distributed computer system, such as computer system 300, to be leveraged in order to separate the various computations needed to execute a SpMV operation amongst the processing power of the PEs 310-312. However, this distributed and parallelized processing, which enhances efficiency of the SpMV operation in some respects (e.g., efficient cache utilization, and improved scalability) does require the additional processing associated with accumulation, as the result vector is also distributed across the PEs (each PE has a partial result for the result vector).
To address the need to accumulate the multiple partial results that are output by the distributed PEs, a specialized procedure for handling the SpMV operation utilizing the CPSM format is also disclosed. In the disclosed handling procedure, computer system resources are dedicated to handling the accumulation (and summation) of each partial result. Where each partial result is individually computed by a respective PE in a manner that properly derives the full dot product results for each row and the sparse matrix (and the column of the DCMV). According to the embodiments, the PEs on the computer system support threads for executing the specialized procedure for accumulating (and summing) result data for partial results that represent the full result for the SpMV operation utilizing the disclosed CPSM format. As a result, the full result of the SpMV operation is partitioned across many physical nodes, and partial results can be sourced by any PE. Furthermore, as previously described, accumulating the result data for a particular partial result within a PE before communicating this data to the partial result's final location reduces internode communication. Consequently, to this point, the number of PEs that can be involved in both compute processing for the SpMV and accumulation can be n, where n is the total number of available distributed PEs in the computer system, thereby allowing each of the PEs to be more equally loaded to reduce overall completion time.
In some embodiments, each of the PEs 310-312 of computer system 300 can have at least one thread (processing core) that is dedicated to result accumulation. For example, PEs will have a condition denoting all threads have processed all CPSM groups required of it for a particular partial result which triggers flushing any remaining data for the partial result to their final PE (e.g., PE handling the particular partial result vector) for accumulation. Likewise, all PEs with result vector partitions, shown in
Also,
Further, as one of the main sub queues 336a-336c, 346a-346c, and 356a-356c fills up, the data is communicated to the appropriate one of the PEs 330, 340, 350. For example,
Referring now to
The process 400 can begin at operation 405, where a sparse matrix is partitioned into a plurality of column partitions. The sparse matrix can be an M×N sparse matrix that is the multiplicand in the SpMV operation. Further, the SpMV operation can include a dense vector, also referred to as a DCMV, which is structured as N×1 vector that is the multiplier in the SpMV operation. In operation 405, the elements of the sparse matrix can be partitioned by column, forming a plurality of smaller column-partitioned blocks that include a number of columns from the original sparse matrix.
The number of columns that are included in a column partition, or column-partitioned blocks of elements, is referred to as the size. The size for column partitions can be predefined, for example being set as a static parameter at deployment, or dynamically selected, for example being set at run-time when performing the SpMV operation. In accordance with the embodiments, the size is defined as a function of the L2 cache size (for a PE) of the computer system. By using memory, specifically L2 cache size, as a limitation that governs how the column-partitioned data is organized in the CPSM format, this ensures that that all of the data needed for the computations (e.g., elements within the same set of columns of the sparse matrix) processed by a respective PE will be consecutively stored in its cache, thereby reducing cache misses, and improving the overall computational speed of the SpMV operation by the computer system. Furthermore, in an embodiment, the number of column partitions (or the number of column partition groups) and/or the size of the column partitions generated by operation 405 is based on other specifications of the computing resources in addition to (or in lieu of) the L2 cache size, such as the number of non-zero elements of the sparse matrix (e.g., column partitions are of equal size.
Thus, after operation 405, the sparse matrix has been effectively reduced by reorganizing the sparse matrix data into several smaller blocks, namely column partitions, that can be efficiently distributed across the PEs in the system. Stated another way, the elements contained in all of the smaller column partitions generated in operation 405 can collectively be considered the larger original sparse matrix.
Next, at operation 410, the process 400 groups several individual column partitions (resulting from previous operation 405) together, in order to form multiple distinct column partition groups. A column partition group is a collection of multiple contiguous, by column, column partitions. Thus, each column partition group comprises a column-partitioned monolithic block of elements from the original sparse matrix. Transferred an entire column partition group, including several individual column partitions, to a PE allows a group of data, which is larger than the individual column partitions, to distributed in the system in a manner that better optimize the balance between the number of transfers needed against the size of data processed (e.g., amount of computational processing time) by each PE. As previously described, as each column partition group includes one or more column partitions, a group only includes a portion of entire row from the original sparse matrix. According to an embodiment, the number of column partition groups (or the number of column partitions included in each column partition group) formed in operation 410 are based on the number of vector partitions and/or the number of PEs used for the distributed compute processing.
The process 400 can continue to operation 415 to partition the DCMV into multiple distinct vector partitions. A critical aspect of the CPSM format involves also partitioning the dense vector, namely the DCMV, in addition to column-partitioning the sparse matrix in the previous operations 405, 410. In operation 415, the dense vector is row-partitioned into multiple smaller and distinct partition vectors, where each partition vector has fewer elements and rows than the full DCMV. Thus, by partitioning the DCMV into smaller and sperate blocks allows, the dense vector data can also be distributed across the PEs (with the column partition groups) in a scalable and efficient manner for parallelized compute processing. Consequently, process 400 performs an efficient partitioning and distributing of the dense vector (as smaller vector partition blocks) that is distinct from conventional approaches that have to recreate the full dense vector multiple times, for each PE used in parallelization, which leads to inefficiencies and bottlenecks (e.g., poor scalability, and inefficient consumption of space, memory, and processing resources).
In an embodiment, the number of vector formed in operation 415 are based on the number of column partition groups and/or the number of PEs used for the distributed compute processing. Thus, a vector partition is a smaller subset block of contiguous elements, by row, from the original DCMV.
In addition, operation 415 can include formatting the column partition groups and the vector partitions as tuple arrays, in accordance with the CPSM format. Tuple arrays are a defined as a data structure that includes a sequenced list of tuples of type (row, column, value) which represent the non-zero elements from the column partition groups, and the elements from the vector partition. Tuple arrays also are particularly structured to logically arrange a specific column partition group with its corresponding vector partition, where correspondence is based with respect to per-element computations needed for SpMV. This ensures that each tuple array, which is used in the CPSM format, includes the specific elements from the dense vector, or DCMV (without including the entire DCMV), that are needed to perform a SpMV computation for each element in the corresponding column partition group (from the sparse matrix) that is represented in that tuple array.
Moreover, as previously described, organizing the elements in the column partition groups (from the sparse matrix) into tuple arrays of the CPSM format reduces and/or compresses the amount of data that needs to be transferred to the PEs by only representing the non-zero elements. Additional details regarding the structure and contents of tuple arrays, in accordance with the disclosed CPSM format, are described above, for example in reference to
Subsequently, process 400 can proceed to operation 420 where data in the disclosed CPSM format, namely the tuple arrays including the column partition groups and the vector partitions, are distributed to a respective processor, or PE for parallelized processing. In an embodiment, the computer system executing the SpMV operation has multiple PEs that are designated for compute processing associated with the SpMV operation and result accumulation associated with the SpMV operation. Further, the number of separate tuple arrays may be equal to the number of compute PEs. Thus, for example, each tuple array in the CPSM format can be transferred to a separate compute PE. The computational and result accumulation operations performed by the distributed PEs to execute the SpMV operation, using the data in the CPSM format, is previously disclosed in detail, for example in reference to
By utilizing the CPSM format, parallelized processing capabilities of the distributed PEs can be leveraged such that the SpMV operation as a whole is distributed across the multiple PEs in the computer system, where each PE can independently perform a portion of the SpMV compute processing using only the data in its respective tuple array. The data in the disclosed CPSM format is also particularly organized such that PEs access and process this data in a manner that also dramatically improves scalability and improves computational efficiency (e.g., reduced data transferred, reduced cache misses, reduced computation time).
The disclosed CPSM format, and the parallelized distributed processing associated with the SpMV operation using this data, does require a specialized procedure for accumulating the final DCRV result. That is, each PE only has a partial result for the SpMV operation due to only portions of the compute processing being distributed across the multiple PEs. In other words, each partial result that is individually processed by each has to be accumulated and properly compiled together in order to derive the full final DCRV result. The specialized procedure for executing the SpMV computations and result accumulation, utilizing data in the disclosed CPSM format, is depicted by
As a general description of the coordination between the processes in
Next, at operation 510, the CPSM data that is read during previous operation 505 can be allocated to memory. In some cases, the memory is local to the PE receiving the CPSM data, such as a L2 cache for a core.
Subsequently, at operation 515, the process 500 can perform a check to determine whether there is any additional CPSM data that can still be read from the computer system. In the case where there is remaining CPSM data (indicated by “Yes” in
Returning back to operation 515, in the case where there is no remaining CPSM data (indicated by “No” in
Referring now to
Because different row ranges correspond to different partitioned sections of the DCRV this means that as a column partition is processed, it can more efficiently store partial results in its L2 cache buffer and more efficiently write the partial results to their respective destination message buffer in L3 cache/DRAM in batches as the matrix data is processed. This allows the multiplication threads to efficiently fetch CPSM data from memory to perform the multiplication and amortize the overhead of moving the partial results both out of the PE's core and out of the PE to be aggregated. This organization also allows for all cores in a PE to contribute to adding to a message before it is sent out, allowing the overhead for all messages to be amortized efficiently across all cores.
Process 600 can begin at operation 605, where a conditional check is performed to determine whether there is CPSM data to be read. As previously described, data in the CPSM format includes column-partitioned blocks from the sparse matrix, namely column partitions, and partitioned blocks for the dense vector, or DCMV. In the case where there is CPSM data to be read (indicated as “Yes” in
At operation 610, matrix multiplication is performed. For example, operation 610 involves the per-element calculations between a partial row of elements in a column partition group and the elements from the portion of the DCMV that are represented in the same tuple array in the CPSM format.
Subsequently, at operation 615, the partial result generated by executing the matrix multiplication in previous operation 610 is stored in a buffer. As an entire row from the sparse matrix is distributed across all of the PEs, the matrix multiplication performed by an individual PE only corresponds to the per-element calculations for a portion of a row, or a partial result. Matrix multiplication of operation 610 can involve partially calculating an entry of the result matrix, RMij, by multiplying term-by-term each element of the entire ith row of the sparse matrix and a corresponding entry in the entire column (jth column) of the DCMV (the summing of these n products is performed by the result accumulation process in
The process 600 continues to operation 620 to perform a conditional check in order to determine whether the buffer, which is storing partial results, is full. In the case that operation 620 determines that the buffer is full (indicated in
At operation 625, the data within the buffer (e.g., L2 cache), namely the partial result computed by the individual PE, is flushed to a message buffer. Thereafter, at operation 630, a conditional check is performed in order to determine whether the message buffer is full. In the case that operation 630 determines that the message buffer is full (indicated in
Subsequently, at operation 635, a message is sent to the additional core in the PE, and the process 600 ends. In some cases, message buffer will also get filled up many times over the execution of the application, so whenever it gets filled up, it will send the entire message buffer to the appropriate PE and the message buffer can be reutilized. Accordingly, in these scenarios, once the message buffer is sent in operation 635, the process 600 can then go to either operation 610 if there is still data to process, or it can go to operation 605 to check if there is any remaining CPSM data to process. Consequently, the operation 600 computes and communicates a partial result for the SpMV, corresponding to the CPSM data it has retrieved, to other cores and/or PEs in the system. According to the embodiments, the compute process 600 can be executed by multiple PEs of the computer system in parallel, which leverages the parallelization capabilities of the distributed computing environment and improves processing efficiency.
Referring now to
Process 700 can begin at operation 705 where a message is received. The message can be received from another PE (or core), where the message is indicative of the partial result associated with the SpMV operation that has been computed by the respective PE. For example, during computations, every PE will send data from the message buffer to the appropriate PE as it gets filled. Thus, over the course of execution, many messages from a PE can be received. In every message from a PE there will be a flag indicating whether that specific message will be the last message to be received from a particular PE. If this flag is present, and this is a last message from a specific PE, then it will increase a count.
Thereafter, at operation 715, a conditional check is performed to determine whether a current count is less than the total number of PEs on the computer system. In the case that operation 715 determines that the count is not less than (e.g., equal to, or greater than) then the total number of distributed PEs (indicated in
Thus, the process can continue to operation 720. Alternatively, in the case that operation 715 determines that the current count is still less than (e.g., equal to, or greater than) the total number of distributed PEs (indicated in
After the process 715 has accumulated all of the data for a partial result from the other distributed PEs (and all of the PEs in the computer system have completed accumulated their respective partial results) the full result of the SpMV operation is obtained. That is, each element in the result vector has been calculated, and the full result vector, DCRV, has been generated. Thus, the process 700 allows result accumulation and matrix multiplication to occur in parallel, rather than in two independent phases as the result vector accumulation and computation for an SpMV operation is divided and distributed computed across multiple PEs.
The computer system 800 also includes a main memory 806, such as a random-access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 802 for storing information and instructions to be executed by processor 804. Main memory 808 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 804. Such instructions, when stored in storage media accessible to processor 804, render computer system 800 into a special-purpose machine that is customized to perform the operations specified in the instructions.
The computer system 800 further includes storage devices 810 such as a read only memory (ROM) or other static storage device coupled to fabric 802 for storing static information and instructions for processor 804. A storage device 810, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 802 for storing information and instructions.
The computer system 800 may be coupled via bus 802 to a display 812, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 814, including alphanumeric and other keys, is coupled to bus 802 for communicating information and command selections to processor 804. Another type of user input device is cursor control 516, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 804 and for controlling cursor movement on display 812. In some implementations, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.
The computing system 800 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
In general, the word “component,” “engine,” “system,” “database,” data store,” and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip-flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.
The computer system 800 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 800 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 800 in response to processor(s) 804 executing one or more sequences of one or more instructions contained in main memory 806. Such instructions may be read into main memory 806 from another storage medium, such as storage device 810. Execution of the sequences of instructions contained in main memory 806 causes processor(s) 804 to perform the process steps described herein. In alternative implementations, hard-wired circuitry may be used in place of or in combination with software instructions.
As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 800.
As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain implementations include, while other implementations do not include, certain features, elements and/or steps.
Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known,” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.