The following are incorporated by reference for all purposes as if fully set forth herein:
The technology disclosed relates to computing systems for executing data parallel and DP computing applications. In particular, the technology disclosed relates to executing matrix computations in data parallel computing systems. Some such systems can employ reconfigurable processors, such as Coarse-Grain Reconfigurable Processors (CGRPs) to perform matrix computations.
The present disclosure relates to computing systems for executing data parallel and/or DP computing applications, such as in machine learning and neural networks. The disclosure further relates to methods and structures of a computing system to perform matrix computations such as computing dot products of matrices. Such computations can be included in machine learning and/or neural networks. Computing systems of the present disclosure include computing systems utilizing reconfigurable processing architectures, such as computing systems comprising Coarse-Grained Reconfigurable Processors (CGRPs).
A method comprises computing dot products of a the left side matrix and a right side matrix having a shared dimension. The left side matrix comprises the shared dimension number of columns and the right side matrix comprises the shared dimension number of rows. In the method, a first Matrix Processing Unit (MPU) of a computing system receives column elements of a row of a first column-split matrix and row elements of a column of a first row-split matrix. The first column-split matrix comprises a first number of columns among columns of the left side matrix and the first row-split matrix comprises the first number of rows among rows of the right side matrix. A second MPU included in the computing system receives column elements of a row of a second column-split matrix and row elements of a column of a second row-split matrix. The second column-split matrix comprises a second number of columns among columns of the left side matrix and the second row-split matrix comprises the second number of rows among rows of the right side matrix.
In the method the first MPU computes a first partial dot product comprising a sum of products of elements among the column elements of the row of the first column-split matrix multiplied by corresponding elements among the row elements of the column of the first row-split matrix. Concurrently, the second MPU computes a second partial dot product comprising a sum of products of elements among the column elements of the row of the second column-split matrix multiplied by elements among the row elements of the column of the second row-split matrix. A third MPU included in a computing system computes a dot product comprising a sum of the first partial dot product and the second partial dot product.
In the method the second and third MPU can be the same MPU.
A computer program product can comprise processor instructions to cause one or more processors to execute aspects of the method.
A computing system can perform the method. The computing system can comprise the first, second, and third MPUs and a shared dimension (SD splitter). The SD splitter can be configured to generated, from the left side and right side matrices, the first column-split and first row-split matrices and the second column-split and second row-split matrices. The first, second, and third MPUs can be configured to compute the respective first partial dot product, second partial dot product, and the dot product comprising the sum of the first and second partial dot products.
The drawings included in the present disclosure are incorporated into, and form part of, the specification. They illustrate implementations of the present disclosure (hereinafter, “the disclosure) and, along with the description, serve to explain the principles of the disclosure. The drawings are intended to be only illustrative of certain implementations and are not intended to limit the disclosure.
Aspects of the present disclosure (hereinafter, “the disclosure”) relate to methods of performing matrix computations in computing systems. More particular aspects relate to improving parallelism of matrix computations and reducing processing cycles times computing systems by exploiting shared dimensions of matrices. As will be seen from a discussion of techniques and structures of the disclosure, implementations of the disclosure (hereinafter, “implementations”) can perform matrix computations more efficiently and with higher degrees of parallelism by exploiting shared dimensions of two multiplicand matrices in matrix computations.
Aspects of the disclosure can also particularly apply to processors of data parallel (DP) computing systems, such as Central Processing Unit (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Digital Signal Processors (DSPs). Certain aspects of the disclosure relate to performing tensor and/or matrix computations in computing systems utilizing reconfigurable processor architectures, such as computing systems utilizing Coarse-Grain Reconfigurable Processors (CGRPs), and/or reconfigurable Application Specific Integrated Circuits (ASICs) or Application Specific Instruction-set Processors (ASIP).
Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. The disclosure in some instances repeats references to these options. However, omission from some implementations recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
Particular expressions of the disclosure will be understood to have the following operative meanings:
As used herein, “incorporated subject matter” refers, collectively, to subject matter disclosed, and/or otherwise encompassed, among the disclosures incorporated herein by reference. For purposes of illustrating the disclosure, but not intended to limit implementations, various terms of the disclosure are drawn from the incorporated subject matter. As used herein, unless expressly stated otherwise, such terms as may be found in the incorporated subject matter have the same meanings, herein, as their meanings in their respective incorporated disclosures.
Aspects of the disclosure can be appreciated through a discussion of example implementations and/or applications of methods and/or systems. However, such examples are for purposes of illustrating the disclosure. It should be understood that the intention is not to limit the disclosure to the example implementations described herein, but to encompass all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. Thus, the disclosure is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein. Various modifications to the disclosed examples will be readily appreciated by those of ordinary skill in the art, and the general principles defined herein may be applied to other implementations without departing from the spirit and scope of the disclosure.
Turning now to more particular aspects of the disclosure, DP computing applications can comprise computations that can be executed concurrently, in parallel, among a plurality of computational elements (processors and/or programs executing on processors, of a DP computing system). Examples of such DP applications include machine learning (ML) and deep machine learning (DML) methods of Artificial Intelligence (AI) applications; image processing; stream processing (e.g., processing of streaming video and/or audio data); natural language processing (NLP); and/or recommendation engines.
DP computing systems can comprise reconfigurable processing elements (reconfigurable processors, or “RPs”) particularly designed and/or configured to efficiently perform DP computing applications. Reconfigurable processors, such as field programmable gate arrays FPGAs and/or CGRP-based processors, can be configured to implement a variety of computational and/or data transfer functions more efficiently or faster than might be achieved using a general-purpose processor executing a computer program.
Prabhakar, et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA '17, Jun. 24-28, 2017, Toronto, ON, Canada, (hereinafter, “Prabhakar”) describes example CGRPs and, systems utilizing such CGRPs. U.S. Nonprovisional patent application Ser. No. 16/239,252, “VIRTUALIZATION OF A RECONFIGURABLE DATA PROCESSOR”, to Grohoski, et al, (hereinafter, “Grohoski”), and U.S. Nonprovisional patent application Ser. No. 16/922,975, “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATA FLOW RESOURCES”, to Kumar, et al, (hereinafter, “Kumar”), both incorporated herein by reference, illustrate additional example implementations of CGRPs and DP systems utilizing CGRPs. As used herein, the term “CGRP” to processors based on coarse-grain reconfigurable architectures and, interchangeably, to a hardware implementation—such as an integrated circuit, chip, or module—of a CGRP. In implementations, systems based on, and/or incorporating,
Owing to their dynamic reconfigurability and the potential to incorporate many hundreds or even thousands of CGRPs in a computation system, DP computing systems can particularly take advantage of CGRPs to improve computing performance. Accordingly, aspects of the disclosure relate to methods and systems utilizing reconfigurable DP resources, such as resources of a CGRP. However, the disclosure is not necessarily limited to computing systems utilizing CGRPs and it will be appreciated by one of ordinary skill in the art that computing systems can employ processing elements other than CGRPs (e.g., CPUs, FPGAs, GPUs, etc.) and remain within the scope and spirit of the disclosure.
As used herein, the term “reconfigurable DP system (RDS)” refers to a computing system that can utilize reconfigurable processing resources, such as CGRPs, to perform operations of DP applications. Owing to reconfigurability, reconfigurable DP systems can perform these operations more efficiently than systems comprising fixed or non-reconfigurable resources. As also used herein, the term “application” refers to any computing application (e.g., software program), and/or computing system, that utilizes an RDS, to perform algorithms and/or computations of the application. An application can execute, for example, on a processor included in, or coupled to, an RDS.
Kumar illustrates a DP system (e.g., an RDS) comprising user applications, programming libraries (e.g., deep learning frameworks), a software development kit, computation graphs associated with user applications, compilers, execution files that can specify operations of a user application to perform using resources (reconfigurable data flow resources) of the DP system, and host and runtime processors. User applications can comprise data parallel and/or DP applications. As illustrated by the examples of Kumar an RDS can comprise a plurality of physical racks each comprising one or more compute nodes (hereinafter, for brevity, “nodes”).
In the examples of Kumar a host and runtime processors can, for example, facilitate compiling a DP application, determining particular RDS resources to execute the application, and managing execution of the RDS resources in performing operations of the application. In the examples of Kumar a node can comprise a host processor, a runtime processor, and, more generally, reconfigurable processors (“RPs”), such as CGRPs. A runtime processor can include kernel drivers and/or a user space library (e.g., a library of programs a user can include, or can invoke, in a DP application and that can execute in a user space of a runtime processor).
In implementations, an RP can comprise reconfigurable processing elements with reconfigurable interconnections. Using the examples of Prabhakar, Grohoski, and Kumar hardware implementations of an RP can comprise pattern compute units (PCUs), pattern memory units (PMUs), arrays of PCUs and/or PMUs (“tiles”), networks of tiles, and/or network interfaces. The hardware implementations can comprise one or more Integrated Circuits (ICs). As used herein, the term “chip” refers to an IC (or, combination of ICs) that can embody elements of a CGRP. A chip can typically be packaged in a chip module (e.g., a single chip module, “SCM” or, alternatively, a multi-chip module, “MCM”).
As illustrated by Grohoski and Kumar, a reconfigurable dataflow unit (RDU) of a DP system can comprise a dynamically reconfigurable hardware resource of the system that includes processing elements (e.g., RPs) to perform operations of DP applications. In the examples of Grohoski and Kumar an RDU can comprise a set of processing elements (e.g., one or more RPs), I/O interfaces to communicate among processors of differing RDUs, and, optionally, a memory. In the examples of Kumar and Grohoski an RDU can, comprise other than simply computational elements (e.g., processors, such as PCUs) and/or memories (e.g., PMUs), such as clock circuits, control circuits, switches and/or switching circuits, interconnection interface circuits (e.g., processor, memory, I/O bus, and/or network interface circuits, etc. Kumar also illustrates that an RDU can include virtualization logic and/or, RP configuration logic.
For purposes of illustrating the disclosure, but not intended to limit implementations, the disclosure occasionally refers to the example of an RDU comprising RPs of Grohoski and Kumar to illustrate a reconfigurable processing element for executing operations (e.g., computations and/or data transfer) of DP applications, such as matrix computations of DP applications. However, it will be appreciated by one of ordinary skill in the art that a processing element of a DP computing system can comprise any form of hardware processor, or combination of hardware processor, memories, interconnection, and/or ancillary circuits (e.g., clocks, control, interface, and/or status circuits), that can perform operations of DP applications. DP processing elements can comprise, for example, central processing units (CPUs); accelerator-class processors; matrix processing units (MPUs), intelligence processing units (IPUs), graphics processing units (GPUs); and/or, field programmable gate arrays (FPGAs) configured to perform particular DP application computations.
DP applications, such as machine learning and neural networks, commonly involve processing tensor data, such as tensors representing elements of image data, audio data, video data, and/or natural language data. To process such data the applications perform matrix computations using matrices of tensor data. Such computations can include, for example, matrix multiplication, matrix summation, matrix convolutions, and matrix transposition.
As used herein, in reference to matrices a capital letter, such as A, is used to refer to a matrix A as a whole, while lowercase letters, such as “a”, are used to refer to an element, or set of elements, of a matrix A. The term “element”, in reference herein to a matrix, refers to the contents (e.g., a scalar value) of a row and column cell of the matrix. The notation “M×K” refers to a matrix having M number of rows and K number of columns and, “K×N” similarly refers to a matrix having K number of rows and N number of column.
In particular, machine learning and neural network applications commonly perform matrix multiplication computations, commonly referred to as “General Matrix Multiply”, or “GeMM”. A GeMM computation produces a sum of products (a “dot product”) of all elements of a row of one matrix multiplied by all elements of a column of another, where the two matrices share a dimension. For example, a “left side” M×K matrix, A, can be multiplied by a “right side” K×N matrix, B, based on the shared dimension K. The result is an M×N matrix, C, in which each element of C, cij for each row i and column j, is a dot product that adds the products of all K elements of row i of the left side matrix A multiplied by corresponding K elements of column j of the right side matrix B. For example, c11 is computed as (a11b11+a12b21+ . . . +a1kbk1) for row 1 of matrix A and column 1 of matrix B; c12 is computed as (a11b12+a12b22+ . . . +a1kbk2) for row 1 of matrix A and column 2 of matrix B; and, c1n is computed as (a11b1n+a12b2n+ . . . +a1kbkn) for row 1 of matrix A and column N of matrix B.
As used herein, the term “dot product” refers to a sum of two or more products of elements of a row of a left side matrix multiplied by a column of a right side matrix, such as dot product c11 of row 1 of left side matrix A multiplied by column 1 of right side matrix B in the foregoing example The term “dot product computation”, as used herein, refers to a computing a dot product of a row of a left side matrix multiplied by a column of a right side matrix in a matrix multiplication computation.
As also used herein, the term “partial dot product” refers to a sum of one or more products of some, but not all, elements of a row of a left side matrix multiplied by a column of a right side matrix. For example, a partial dot product can comprise a product of one element of a row of a left side matrix A, and a corresponding element of a column of a right side matrix B, prior to computing and adding other products of that row of matrix A and column of matrix B, such as partial dot product (a11b1n), of c1n=(a11b1n+a12b2n+ . . . +a1kbkn). In another example, dot product c=(a11b1n+a12b2n) is a partial dot product of c1n=(a11b1n+a12b2n+ . . . +a1kbkn) comprising a sum of the first 2 row elements of matrix A and the corresponding first 2 column elements of matrix B.
The term “complete dot product” refers herein to a sum of products of all elements, 1 to K, of a row of an M×K left side matrix multiplied by all corresponding K elements a column of a K×N right side matrix. For example, c1n=(a11b1n+a12b2n+ . . . +a1kbkn) for all values of K is a complete dot product of all K elements of row 1 of an M×K left side matrix A multiplied by all corresponding K elements column n of a K×N right side matrix B. An expression such as [Σa b)] represents herein, interchangeably, a complete dot product, and a computation of a complete dot product, of a row of a left side matrix A multiplied by a column of a right side matrix B.
DP computing systems can include processing units particularly designed, or configured, to perform matrix computations with much improved performance. As used herein, the term “matrix processing unit” (MPU) refers to any type or arrangement of processing elements (e.g., RDUs, tiles, and/or arrays of PCUs/PMUs of a tile) and/or computational circuit(s) of a DP computing system designed to perform matrix computations, and that can be configured to process large numbers of matrix elements in parallel with other MPUs, processors and/or processing elements, and/or logic circuits, to improve performance of such computations.
A “shared dimension” (SD) matrix processing system can take advantage of a shared dimension of a left side and a right side matrix to improve computational latency, communications/data transfer (e.g., among MPUs and/or resources of MPUs) latency, and/or utilization of hardware resources of a DP system. An SD processing system can include an “SD splitter” component that can divide, or “split”, an M×K left side matrix and a K×N right side matrix based on their shared dimension, K. As used herein, the term “Shared Dimension Matrix Processor” (SDMP) refers to a computing system (e.g., an RDS) configured to perform matrix multiplication based on splitting “parent” multiplicand matrices along a shared dimension of the parent matrices. An SDMP can comprise, for example, a DP computing system having an SD splitter, to split parent matrices into SD “split matrices”, and having multiple MPUs each configured to each compute a subset of products and/or dot products of the split matrices in parallel with each other.
An SD splitter can split “parent” left and right side matrices into pairs of “column-split” and “row-split” matrices, in which each pair comprises a fraction of respective columns and rows among dimension K shared by the parent matrices. For example, to multiply an M×K left side parent matrix, A, by a K×N right side parent matrix, B, an SD splitter can split parent matrix A into two M×(K/2) column-split matrices, A0 and A1, and can split the parent matrix B into two (K/2)×N row-split matrices, B0 and B1. Matrices A0 and A1 can each have (K/2) number of the K columns of the left side parent, and matrices B0 and B1 can each have (K/2) rows of the right side matrix. Column-split matrix A0 can comprise, for example, all M rows and columns 1 to (K/2) of the left side parent, and column-split matrix A1 can comprise all M rows and columns (K/2)+1 to K of the left side parent. Row-split matrix B0 can comprise, correspondingly, rows 1 to (K/2) and all N columns of the right side parent, and column-split matrix A1 can comprise rows (K/2)+1 to K and all N columns of the right side parent.
SD MPUs of the SDMP can then multiply the column- and row-split matrices along dimension (K/2) to compute two partial dot products, corresponding to their respective (K/2) portions of the parent matrices. For example, one SD MPU can compute a partial dot product comprising a sum of (K/2) products of a row of matrix A0 multiplied by a column of matrix B0. A second SD MPU can compute a second partial dot product comprising a sum of (K/2) products of a row of matrix A1 multiplied by a column of matrix B1. One of the two SD MPUs (or, alternatively, another SD MPU or an adder circuit, such as an adder arithmetic logic unit, “ALU”) can then add the two partial dot products to compute a complete dot product of the corresponding row of the left side parent matrix multiplied by the corresponding column of the right side parent matrix, which can then be an element c u of an M×N results matrix C.
In particular, in an SDMP the two SD MPUs can compute their respective row/column products, and/or partial dot products, in parallel, reducing overall compute latency to compute a complete dot product of any one row and columns of the left and right side matrices. Additionally, as one of the SD MPUs can add the partial dot products, using adder circuitry to compute its respective partial dot product, an SDMP can reduce the hardware components required to compute a complete dot product of any one row and columns of the left and right side matrices
Similarly, SD splitter 104 can receive or access right side parent matrix B in memory 102B and scan split matrix B into row-split matrices B0 and B1, shown in
MPUs of an SDMP (not shown in
In implementations, an SD splitter can comprise, or can be included in, a processor of an SDMP, such as host processor, runtime processor, RDU, and/or PCUs of tiles of an RDS and/or a program executable on one or more of these. An SD splitter can comprise a specialized logic circuit designed to split input matrices into split matrices. An SD splitter can comprise a compiler of an SDMP that can generate split matrices as, for example, an output of compiling a machine learning application model (e.g., an execution or configuration file of an RDS such as in the examples of Grohoski and Kumar). An SD splitter can comprise a configuration or runtime component of an SDMP (e.g., runtime processor of an RDS) and can generate split matrices as an output of configuring resources of an SDMP to execute or train a machine learning application model In implementations split matrices can be components of data associated with performing matrix operations in an SDMP (e.g., an RDS comprising an SDMP). For example, split matrices can be components of an execution file, an application graph, and/or configuration file of an RDS.
An SD splitter comprise an input function of an SDMP to input left and right side parent matrices A and B into the MPUs for multiplying matrix A and matrix B. For example, an SD splitter can comprise a memory read function of an SDMP to read matrices A and B from a memory. When reading matrix A from the memory to input matrix A into the MPUs, for a memory address of matrix A in the memory corresponding to an address among columns 1 to (K/2) of matrix A, the SD splitter can output elements of these columns of matrix A from the memory to one set of MPUs (and/or to an M×(K/2) column-split matrix in a memory). For a memory address of matrix A corresponding to an address among columns (K/2)+1 to K of matrix A, the SD splitter can output elements of these columns of matrix A from the memory to another set of MPUs (and/or to another M×(K/2) column-split matrix in a memory).
Similarly, when reading matrix B from the memory to input matrix B into the MPUs, for a memory address of matrix B in the memory corresponding to an address among rows 1 to (K/2) of matrix B, the SD splitter can output elements of these rows of matrix B from the memory to one set of MPUs (and/or to (K/2)×N row-split matrix in a memory). For a memory address of matrix B in the memory corresponding to an address among rows (K/2)+1 to K of matrix B, the SD splitter can output elements of these rows of matrix B from the memory to another set of MPUs (and/or to another (K/2)×N row-split matrix in a memory). In some implementations (e.g., an implementation in which memories containing matrices A and/or B comprise multiple read ports) the SD splitter can concurrently read multiple columns of parent matrix A and/or rows of parent matrix B such that the SD splitter can concurrently read columns of matrix A and/or rows matrix B.
In implementations, memories among memories 102A-102F can be the same memory, or can be different memories. For example, memories 102C-102F can be memories of a host processor, runtime processor, RDU, and/or PMUs of tiles of an RDS. Memories 102C-102F can include memories communicatively coupled to an SDMP, and/or to an SD splitter.
As used herein, “SD MPUs” refers to MPUs of an SDMP designed or configured to compute dot products of split matrices, such as A0 and B0 and/or A1 and B1 in the examples of
SD MPU 200 is shown in
In
Matrix 202A can comprise an M×(K/2) column-split matrix of an M×K left side parent matrix A, and matrix 202B can comprise a (K/2)×N row-split matrix of a right side parent matrix B, where matrix A and B are split on shared dimension K, such as illustrated in the example of
To compute dot products of elements of a row of matrix 202A multiplied by elements of a column of matrix 202B, SD MPU 200, and/or MACC ALU 210, can execute from 2 to (K/2) number of MACC computation cycles to input elements (e.g., via read logic 204) of matrices 202A and 202B to MACC ALU, multiply the elements, and sum the products.
In MACC computation cycles multiplier ALU 216 can multiply a pair of buffer A and corresponding buffer B elements and output the products to adder ALU 218. Adder ALU 218 can add the products to a value of ACC 220 to a partial dot product summing products of other elements of matrix 202A and 202B, compute a complete dot product for a particular row of matrix 202A and column of matrix 202B. For example, multiplier ALU 216 can compute each product (a0b0), (a2b2), and (a3b3) and can output each of the products to adder ALU 218. Adder ALU 218 can add each product to ACC 220 to compute a partial dot product of a row of matrix 202A and column of matrix 202B.
As previously described, a partial dot product can comprise a single product of one element of a row of a left side matrix and a corresponding element of a column of a right side matrix. ACC 220 can comprise dot products computed for products of a row of matrix 202A and column of matrix 202B. MACC ALU 210 can, optionally, output the value of ACC 220 as a partial or complete dot product (comprising all (K/2) products) of a row of matrix 202A multiplied by a column of matrix 202B. MAC ALU 210 can initialize ACC 220 to have the value of product (a0b0) corresponding to the first column element of that row of matrix 202A (in matrix A buffer 212 a0) multiplied by the first row element of that column of matrix 202B (in matrix B buffer 214 b0). The initial dot product, as stored in ACC 220, is then just the product (a0b0) prior to computing and adding to ACC 220 products (a1b1), (a2b2), and (a3b3).
Matrix 202C can, then, comprise partial results of multiplying parent matrices A and B (not shown in
While not shown in
As just described, in implementation a plurality of SD MPUs can each multiply a set of split matrices generated from a pair of parent matrices, which can enable an SDMP to multiply two parent matrices in parallel among the SD MPUs. In
While not shown in
In
Matrices 242 can comprise SD split matrices generated based on shared dimension K of matrices 260 such as in the examples of
Matrix 250A can be a results matrix comprising products, partial dot products, and/or complete dot products of multiplying matrix 242A and matrix 242B. Matrix 250A can be a results matrix comprising products, partial dot products, and/or complete dot products of column elements 1 to K/2 of a row of matrix 242A multiplied by corresponding row elements 1 to K/2 of a column of matrix 242B. Matrix 250B can be a similar M×N matrix comprising products, partial dot products, and/or complete dot products of column elements 1 to K/2 of a row of matrix 242C multiplied by corresponding row elements 1 to K/2 of a column of matrix 242D. Matrix 250C can be a results matrix comprising sums of product and/or dot product elements included in matrix 250A and/or matrix 250B.
While not shown explicitly in
In implementations, SD MPUs 246 can be SD MPUs similar or equivalent, for example, to SD MPU 200 of
For example, SD MPU 246A can output to matrix 250A one or more products and/or dot products of matrix 242A multiplied by matrix 242B. SD MPU 246B can output to matrix 250A one or more products and/or dot products of matrix 242C multiplied by matrix 242D. Alternatively, or additionally, SD MPU 246A can output to SD adder 248 one or more products and/or dot products of matrix 242A multiplied by matrix 242B. Similarly, alternatively or additionally, SD MPU 246B can output to SD adder 248 one or more products and/or dot products of matrix 242C multiplied by matrix 242D. SD adder 248 can receive products/dot products output to matrix 250A, and/or from SD MPU 246A, and products/dot products output to matrix 250B, and/or from SD MPU 246B, and can add the products/dot products to compute dot product elements of matrix 250C.
In implementations, SD adder 248 can comprise an adder ALU and, optionally, accumulator, such as adder ALU 218 and ACC 220 in
In
The examples of
Additionally, in implementations pairs of split matrices need not comprise the same number of column/row portions (e.g., K/n for n number of split matrices). For example, shared dimension K of two parent matrices (M×K and K×N) can be odd, such that splitting the parent matrices into two pairs of column- and row-split matrices leaves one pair with a (K/2) portion and the other with (K/2)−1 portion.
However, it can be advantageous to generate symmetric pairs of matrices, such that each column-split matrix and each row-split matrix among pairs of column- and row-split matrices all have the same row and column dimensions. This can facilitate computing partial dot products of the pairs of split matrices in parallel in a uniform number of compute cycles to compute products and sum of products of each of the pairs of matrices. For example, if K=10, an SD splitter can split the parent matrices into 3 pairs of split matrices having dimensions M×3 and 3×N—such as A0/B0, A1/B1, and A2B2—and 1 pair of split matrices, A3/B3, having dimensions M×1 and 1×N.
As matrices A3 and B3 are asymmetric with respect to matrices A0/B0, A1/B1, and A2B2, SD MPUs computing a partial dot product of A3 and B3 can compute the partial dot product in one dot product computation cycle, while SD MPUs computing partial dot products of matrices A0/B0, A1/B1, and A2B2 compute their respective partial dot products in three dot product computation cycles. Alternatively, an SD splitter can generate matrices A3 and B3 to include respective columns and rows of all zeros, such that matrices A3 and B3 are generated as respective M×3 and 3×N matrices and are symmetric to matrices A0/B0, A1/B1, and A2B2. The SD MPUs can then compute their respective partial dot products in parallel in the same 3 dot product computation cycles, without having to synchronize computation of a partial dot product computed in a single dot product computation cycle with computation of partial dot products computed in an asymmetric (e.g., 3) number of dot product computation cycles.
In operation 302 the SD splitter determines that matrix A and matrix B share dimension K. Based on matrix A and B sharing dimension K, in operation 304 the SD splitter divides matrix A into column-split matrices A0 and A1 and the divides matrix B into row-split matrices B0 and B1. In operation 304, the SDMP SD splitter can form the split matrices as previously described in reference to
In operation 306, the SD splitter can, optionally, determine if dimension K is odd. If so, splitting matrix A and B into two pairs of SD matrices can result in one of SD matrix A0 and A1 having dimension M×(K/2) and the other of matrix A0 and A1 having dimension M×(K/2+1), and one of SD matrix B0 and B1 having dimension (K/2)×N and the other of matrices B0 and B1 having dimension (K/2+1)×N. For example, if K=5, splitting matrices A and B into two pairs of SD matrices results in, for example, matrix A0 having dimension M×3 and the and matrix A1 having dimension M×2. Similarly, splitting matrices A and B on dimension K=5 results in matrix B0, for example, having dimension 3×N and matrix B1 having dimension 2×N.
Based on determining, in operation 306, that K is odd, in operation 308 the SD splitter can add an extra column (e.g., column 3 of M×2 matrix A1 in the foregoing example) of all zeros, and can add an extra row (e.g., row 3 of 2×N matrix B1 in the foregoing example) of all zeros. SD MPU 246B can, concurrently, each execute 3 MACC computations to compute, respectively, a complete dot product of a row of M×3 matrix A0 multiplied by a column of 3×N matrix B0, and a complete dot product of a row of M×3 matrix A1 (as extended with all zeros in column 3) multiplied by a column of 3×N matrix B1 (as extended with all zeros in row 3). The all-zeros column and/or row can permit the SDMP to compute dot products of each pair of matrices symmetrically (each performing the same number of concurrent MACC computation), as the SDMP multiplying last column element of a row of matrix A1 and the last row element of a column of matrix B1 produces all a value of zero to include in dot products of matrices A1 and B1.
Alternatively, based on determining, in operation 306, that the shared dimension (e.g., K) is odd, in operation 308 an SDMP can program a processor, circuit, or memory (e.g., a processor, memory, or memory read or other special circuit of MPU0 and/or MPU1) to output zeros as elements of the (K/2)+1 column of a row of matrix A1 and/or (K/2)+1 elements of a row of matrix B1. In computing in computing product (a13×b13), for example, the SDMP can output a value of zero for element b13 and/or a value of zero for an. Value zero for elements a13 and/or b13 produces a zero-value product to include in dot products of matrices A1 and B1, such that SD MPU 246A and SD MPU 246B can concurrently execute a symmetric number (3) of MACC computations to compute respective dot products of matrix A0 multiplied by matrix B0 and matrix A1 multiplied by matrix B1.
In operation 310, the two sets of SD MPUs, MPU0 and MPU1, performs MACC cycles to compute dot products of a row of matrix A0 multiplied by a column matrix B0 and dot products of a row of matrix A1 multiplied by a column matrix B1. In implementations, MPU0 and MPU1 can each comprise one MPU, or one or both MPU0 and MPU1 of can comprise a plurality of MPUs operating in parallel as one combined SD MPU. To compute the dot products symmetrically (and, optionally, concurrently), MPU0 and MPU1 each perform K/2 (K/2 plus 1 if K is odd) number of MACC cycles.
In operation 312 of the (K/2) MACC cycles MPU0 computes products and/or dot products of a row of matrix A0 multiplied by a column of matrix B0, and in operation 314 MPUs computes products and/or dot products of a row of A1 multiplied by a column of matrix B1. In operation 316 of the (K/2) MACC cycles MPU0 can, optionally, output products computed in operation 312. In operation 318, MPU0 can, optionally, output dot products computed in operation 312, and the dot products output by MPU0 can be partial dot products and/or can be complete dot products. Similarly, in operation 320 of the (K/2) MACC cycles MPU1 can, optionally, output products computed in operation 314 and/or, in operation 322 MPU1 can, optionally, output dot products computed in operation 314. In operation 322 dot products output by MPU1 can be partial dot products and/or can be complete dot products.
To compute products/dot products in operations 312 and 314, as described in reference to operation 308, for odd values of K the SD splitter can add a column of zeros to the smaller of split matrices A0 and A1, and can add a row of zeros to the smaller of split matrices B0 and B1. Alternatively, as also described in reference to operation 308, to compute products/dot products in operations 312 and 314 MPU0 and MPU1 (or, a read circuit reading matrices A0, A1, B0, and B1 from a memory, for example) can output zeros for the (K/2)+1 elements of the smaller of split matrices A0 and A1, and the smaller of split matrices B0 and B1.
In operations 316, 318, 320, and/or 322 MPU0 and/or MPU1 can output products/dot products to an adder component of the SDMP. In implementations, an adder component of the SDMP can comprise, for example, an adder ALU such as adder ALU 218 in
In operation 324, the adder can add products/dot products output by MPU0 and MPU1 to compute a complete dot product corresponding to a dot product of a row of parent matrix A multiplied by a corresponding column of parent matrix B. In implementations MPU0 and/or MPU1 can output, in operations 316, 318, 320, and/or 322 products/dot products to memories and/or registers, and the adder can access the products and/or dot products of in the memories/registers. Alternatively, in operations 316, 318, 320, and/or 322 MPU0 and/or MPU1 can output the products and/or dot products directly to the adder. In operations 316, 318, 320, and/or 322 MPU0 and MPU1 can output any combination of products and/or dot products and in any particular order or sequence. In operation 324 the adder can receive and/or add outputs of MPU0 and MPU1 in any combination or sequence to produce a complete dot product.
In operation 326, the adder outputs the complete dot product. In operation 326 the adder can output the complete dot product of a row and column of respective matrices A and B to other MPUs, such as a successive forward and/or backward layer in a neural network. Additionally, or alternatively, in operation 326 the adder can output the complete dot product of a row and column of respective matrices A and B to a memory or registers, such as a memory containing a complete matrix C to receive the results of matrix A multiplied by matrix B.
In implementations, SDMPs, and/or components of SDMPs (e.g., SD MPUs), such as in the examples of
As has been described in reference to operations 316, 318, 320, and 322, for example, SD MPUs can compute products and/or dot products for one split matrix (e.g., a row of one split matrix multiplied by a column of another split matrix) and can output the products/dot products to another SD MPU. The receiving SD MPU can add the products/dot products to product/dot products computed by that and/or other SD MPUs.
Similar to
For purpose of illustrating the method, but not intended to limit implementations, K is assumed to be even. However, as illustrated in the example of method 300 in
Turning now to the details of method 400, based on two parent matrices having shared dimension K, in operation 402 the SDMP initiates computation of left side matrix A multiplied by right side matrix B (ΣAB) utilizing split column-matrices A0 and A1, and row-split matrices B0 and B1. More particularly, in operation 402 the SDMP initiates MPU0 computing ΣA0B0 and MPUs computing ΣA1B1. Thus, in operation 404 MPU0 computes products and/or dot products of ΣA0B0 and, in operation 408 MPUs computes products and/or dot products of ΣA1B1. In particular, in operation 404, MPU0 computes products/dot products of c11 among (a11b11+a12b21+ . . . +a1(k/2)b(k/2)1) and, in operation 408 MPU1 computes products/dot products of c11 among (a1(k/2+1)b(k/2+1)1+a11b11+a12b21+ . . . +a1kbk)).
In operation 406 MPU0 outputs products and/or dot products of ΣA0B0 to MPU1. For example, in operation 406 MPU0 can output products/dot products of a multiplier ALU, and/or an accumulator of MPU0, to MPU1. MPU0 can comprise a MACC ALU similar or equivalent to MACC ALU 210 in
In operation 410, MPU1 receives the products and/or dot products output from MPU0. In operation 410 MPU1 can receive the outputs of MPU0 as, for example, inputs to an input such as input 226 of MACC ALU 210 in
In operation 412 MPU1 adds the products and/or dot products received from MPU0 to products/dot products computed by MPU1. MPU1 can comprise a MACC ALU similar or equivalent to MACC ALU 210 of
In operations 406-412, to compute products/dot products of the split matrices, MPU0 and/or MPU1 can perform computations similar or equivalent to computations (e.g., MACC computations) of the example of SD MPU 200 in
In operation 414, MPU1 determines if the dot product computed in operation 412 is a complete dot product of all elements of a row of matrix A0 multiplied by all corresponding elements of a column of matrix B0, and all elements of a corresponding row of matrix A1 multiplied by all elements of a corresponding column of matrix B1. That is, in operation 414 MPU1 determines if the dot product computed in operation 412 comprises a complete dot product c11=(a11b11+a12b21+ . . . +a1kbk)).
If MPU1 determines, in operation 414, that the dot product computed in operation 412 is not a complete dot product, MPU0 and/or MPU1 repeat operations 404-412 to compute products/dot products needed to compute the complete dot product. If, on the other hand, MPU1 determines in operation 414 that the dot product computed in operation 412 is a complete dot product, element c u of matrix C, in operation 416 MPU1 outputs the complete dot product to matrix C.
In implementations, in operation 416 MPU1 can output the complete dot product to a memory and/or to additional MPUs of the SDMP, such as successor forward and/or backward layer MPUs of a neural network. The SDMP can repeat operations 402 to 416 until MPU0 and MPU1 have computed a M times N number of elements of M×N matric C (e.g., all elements from c11 to cmn of matrix C).
SDMP 500 is shown in
SDMP 500 is shown in
Additionally, as will also be seen from further discussion of
Continuing with the example of SDMP 500, in implementations SD MPUs 510A and/or 510B can be SD MPUs similar or equivalent to SD MPU 200 of
In implementations, SD MPU 510A and/or SD MPU 510B can compute products and/or dot products of matrix 508A multiplied by matrix 508B and matrix 508C multiplied by matrix 508D. For example, SD MPU 510A can compute products and/or dot products of matrix 508A multiplied by matrix 508C, and SD MPU 510B can compute products and/or dot products of matrix 508C multiplied by matrix 508D. SD MPU 510A and/or SD MPU 510B can access matrix 508A, matrix 508B, matrix 508C, and/or matrix 508D in memories among memories 508, for example, to compute products and dot products of matrices 508.
SD MPUs 510 can compute the product and/or dot product results using a method, or operations of a method similar to method 300 of
In implementations, one SD MPU can compute products/dot products of one pair of split matrices and another SD MPU can compute products/dot products of another pair of split matrices. One of the SD MPUs, another SD MPU, and/or an adder component of an SDMP, can add the products/dot products together to compute a complete dot product of a row of matrix A multiplied by a column of matrix B to store in an M×N results matrix C.
SD MPU 510A can input elements of matrix 508A from memory 508A, and elements of matrix 508B from memory 508B, to compute products, and/or dot products, of matrix 508A multiplied by matrix 508B. SD MPU 510B can input elements of matrix 508C from memory 508C, and elements of matrix 508D from memory 508D, to compute products, and/or dot products, of matrix 508C multiplied by matrix 508D.
As shown in
As SD MPU 510A outputs the products and/or dot products to SD MPU 510B, SD MPU 510B (e.g., SD MACC ALU 512B of SD MPU 510B) can receive the products/dot products via output/input 518 and can add the products/dot products received from SD MPU 510A to dot products computed by SD MPU 510B (and/or computed by another SD MPU, not shown in
SD MPU 510A can output to SD MPU 510B products of matrix 508A multiplied by matrix 508B from, for example, a multiplier ALU, such as a multiplier ALU similar to multiplier ALU 216 of
SD MPU 510B can input the products/dot products and add these to products/dot products of optional matrix C1 in memory 516B or, alternatively, to an accumulator of SD MPU 510B containing a dot product. The accumulator can comprise (accumulate) a sum of products/dot products computed by SD MPU 510B, computed by SD MPU 510A, and/or computed by another SD MPU of SDMP 500 not shown in
The examples of the disclosure are illustrated using two SD MPUs and two pairs of column- and row-split matrices for simplicity of the illustrations. However, these examples are not intended to limit implementations; as previously described, SDMP systems, and/or configurations of SDMP systems, can utilize a plurality of SD MPUs, and/or a plurality of SD split matrices, to perform SD-based matrix multiplication of parent matrices. Further, an SD MPU is not limited to outputting products/dot products to only one other SD MPU (and/or to one memory or storage element), nor is an SD MPU limited to receiving products/dot products from only one other SD MPU (and/or from one memory or storage element).
In implementations, an SD MPU can have a plurality of product/dot product outputs and/or inputs to output and/or input product/dot product outputs computed by other SD MPUs of an SDMP. Multiple SD MPUs can compute and/or output/input product/dot products in parallel. A single SD MPU can accumulate product/dot products of multiple other SD MPUs to compute a dot product of products/dot products output by multiple other SD MPUs.
Multiple SD MPUs can compute products/dot products of the same pairs of column- and row-split matrices. As one alternative to SD MPU 510A operating on matrix 508A and matrix 508B, and SD MPU 510B operating on matrix 508C and matrix 508D, as shown in
While not shown in
Additionally, or alternatively, as illustrated in
Implementations can comprise a computer program product and can include a computer readable storage medium (or media) having computer readable program instructions of the computer program product incorporated therein. It will be understood by one of ordinary skill in the art that computer readable program instructions can implement each or any combination of operations and/or structure of the disclosure, such as illustrated by the drawings and described herein.
The computer readable program instructions can be provided to one or more processors, and/or other elements, of a computing system or apparatus to produce a machine which can execute, via the processor(s), to implement operations and/or actions similar or equivalent to those of the disclosure. The computer readable program instructions can be stored in a computer readable storage medium that can direct one or more processors, and/or other elements, of a computing system or apparatus to function in a particular manner, such that the computer readable storage medium comprises an article of manufacture including instructions to implement operations and/or structures similar or equivalent to those of the disclosure.
The computer readable program instructions of the computer program product can cause one or more processors to perform operations of the disclosure. A sequence of program instructions, and/or an assembly of one or more interrelated programming modules, of the computer program product can direct one or more one or more processors and/or computing elements of a computing system to implement the elements and/or operations of the disclosure including, but not limited to, the structures and operations illustrated and/or described in the present disclosure.
A computer readable storage medium can comprise any tangible (e.g., hardware) device, or combination of tangible devices, that can store instructions of the computer program product and that can be read by a computing element to download the instructions for use by a processor. A computer readable storage medium can comprise, but is not limited to, electronic, magnetic, optical, electromagnetic, and/or semiconductor storage devices, or any combination of these. A computer readable storage medium can comprise a portable storage medium, such as a magnetic disk/diskette, optical disk (CD or DVD); a volatile and/or non-volatile memory; a memory stick, a mechanically encoded device, and any combination of these. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as electrical signals transmitted through a wire, radio waves or other freely propagating electromagnetic waves, or electromagnetic waves propagating through a wave transmission medium (e.g., a wave guide or fiber-optic cable).
The computer readable program instructions can be communicated from the computer readable storage medium to the one or more computing/processing devices, via a programming API of a computing system, and/or a communications interface of a computing system, having access to the computer readable storage medium, and/or a programming API of a computing system, and/or a communications interface of the one or more computing/processing devices. The API(s) and/or communications interface(s) can couple communicatively and/or operatively to a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The API(s) and/or communications interface(s) can receive the computer readable program instructions read from computer readable storage medium and can forward the computer readable program instructions to the one or more computing/processing devices via the API(s), communications interface(s), and/or network.
In implementations, the computer readable program instructions of the computer program product can comprise machine language and/or assembly language instructions, instruction-set-architecture (ISA) instructions, microcode and/or firmware instructions, state-setting data, configuration data for integrated circuitry, source code, and/or object code. The instructions and/or data can be written in any combination of one or more programming languages.
The computer readable program instructions can execute entirely, or in part, on a user's computer, as a stand-alone software package; partly on a user's computer and partly on a remote computer; or, entirely on a remote computer. A remote computer can be connected to a user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN). In implementations, electronic circuitry including, for example, FPGA, PLAs, and or CGRPs can execute the computer readable program instructions by utilizing state information of the computer readable program instructions to configure the electronic circuitry to perform operations or elements of the disclosure, such as illustrated by the drawings and described herein.
In implementations, computer readable program instructions can also be loaded onto a computing system, or component(s) thereof, to cause the computing system and/or component(s) thereof to perform a series of operational steps to produce a computer implemented process, such that the instructions which execute on the computing system, or component(s) thereof, implement the operations or elements of the disclosure, such as illustrated by the drawings and described herein.
The flowchart and block diagrams in the Drawings and Incorporations illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various implementations of the present invention. Individual elements illustrated in the Figures—such as individual operations illustrated in the flowcharts or individual blocks of block diagrams—may represent a module, segment, or portion of executable instructions for implementing the disclosed function(s). In various alternative implementations, particular operations may occur in an order differing from that illustrated in the examples of the drawings. For example, two operations shown in succession in a diagram of the disclosure may, in a particular implementation, be executed substantially concurrently, or may sometimes be executed in a reverse order, depending upon the functionality involved. It will be further noted that particular blocks of the block diagrams, operations of the flowchart illustrations, and/or combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented using special purpose hardware and/or systems that, individually or in combination, perform the specified functions, acts, and/or computer instructions.
Terminology used herein, and the examples disclosed, are chosen to illustrate the principles of the implementations, the practical application or technical improvement over alternative technologies, and to enable others of ordinary skill in the art to understand the implementations disclosed herein. The disclosure illustrates various example implementations, and the examples are intended to illustrate principles and aspects of the disclosure, but are not intended to limit implementations, nor intended to be exhaustive of implementations that may be conceived within the scope of the disclosure. It would be apparent to one of ordinary skill in the art that alternative implementations can comprise modifications and combinations within the spirit of the disclosure and the scope of the claims.
As can be seen in the foregoing examples, features of the disclosure can comprise methods and apparati of computing systems. A summary of example implementations of such features includes:
A method comprises: determining, by a computing system, that a left hand matrix, comprising M number of rows and K number of columns, and a right hand matrix, comprising K number of rows and N number of columns, share dimension K;
The example of implementation 1, wherein the dot product comprises a complete dot product.
The example of implementation 1, wherein the first MPU and the second MPU comprise different MPUs.
The example of implementation 1, wherein P is numerically less than Q; wherein the method of the computing system generating the second column-split matrix comprises generating, by the computing system, the second column-split matrix comprising P minus Q number of columns, columns (P+1) to Q of the second column-split matrix comprising all zeros; wherein the method of the computing system generating the second row-split matrix comprises generating, by the computing system, the second row-split matrix comprising P minus Q number of row, rows (P+1) to Q of the second row-split matrix comprising all zeros; and, wherein the method of the second MPU computing the second partial dot product comprises computing, by the second MPU, products of elements among columns (P+1) to Q of the row of the second column-split matrix multiplied by respective elements among row (P+1) to Q of the column of the second row-split matrix.
The example of implementation 1, wherein P is numerically less than Q; and,
The example of implementation 1, wherein the method of the first MPU computing the first partial dot product comprises the first MPU computing the first partial dot product as a multiply-accumulate (MACC) computation.
The example of implementation 6, wherein the MACC computation comprises adding, by the first MPU, the products of the row of the first column-split matrix multiplied by the column of the first row-split matrix, to an accumulator.
The example of implementation 7, wherein the method of the second MPU computing the second partial dot product comprises adding, by the second MPU, an output of the accumulator to the second partial dot product.
A computer program product, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the program instructions are executable by at least one processor to cause the at least one processor to:
The example of implementation 9, wherein P is numerically less than Q; and,
11. A computing system, the system comprising: a plurality of matrix compute units (MPUs), and a Shared Dimension (SD) splitter; the SD splitter configured to: determine that a left hand matrix, comprising M number of rows and K number of columns, and a right hand matrix, comprising K number of rows and N number of columns, share dimension K;
The example of implementation 11, wherein the dot product comprises a complete dot product.
The example of implementation 11, wherein the first MPU and the second MPU comprise different MPUs.
The example of implementation 13, wherein the first MPU is further configured to output the first partial dot product to the second MPU; and, wherein the second MPU configured to compute the second partial dot product comprises the second MPU further configured to add the first partial dot product to the products among the products of the row of the second column-split matrix multiplied by the column of the second row-split matrix.
The example of implementation 11, wherein P is numerically less than Q; wherein the SD splitter configured to generate the second column-split matrix comprises the SD splitter further configured to generate the second column-split matrix comprising P minus Q number of columns, columns (P+1) to Q of the second column-split matrix comprising all zeros; wherein the SD splitter configured to generate the second row-split matrix comprises the SD splitter further configured to generate the second row-split matrix comprising P minus Q number the SD splitter configured to generate of row, rows (P+1) to Q of the second row-split matrix comprising all zeros; and, wherein the SD splitter configured to compute the second partial dot product comprises the SD splitter configured to compute products of elements among columns (P+1) to Q of the row of the second column-split matrix multiplied by respective elements among row (P+1) to Q of the column of the second row-split matrix.
The example of implementation 11, wherein P is numerically less than Q; and, wherein the second MPU configured to compute the second partial dot product comprises the second MPU further configured to compute a (P+1) product as a value of zero and adding the (P+1) product to products among products included in the second partial dot product.
The example of implementation 11, wherein the first MPU comprises a multiply-accumulate arithmetic logic unit; and, wherein the first MPU configured to compute the first partial dot product comprises the multiply-accumulate arithmetic logic unit configured to compute the first partial dot product as a multiply-accumulate computation.
The example of implementation 17, wherein the multiply-accumulate arithmetic logic unit comprises an accumulator; and, wherein the multiply-accumulate arithmetic logic unit configured to compute the first partial dot product as a multiply-accumulate computation comprises the multiply-accumulate arithmetic logic unit configured to: compute a product of a column element of the row of the first column-split matrix and a corresponding row element of the column of the first column-split matrix; compute the first partial dot product a sum of the product and a first value of the accumulator; and, store the first partial dot product in the accumulator.
The example of implementation 11, wherein at least one of the first MPU, the second MPU, and the third MPU comprise more than one MPU among the plurality of MPUs.
The example of implementation 11, wherein at least one of the first MPU, the second MPU, and the third MPU comprise a reconfigurable dataflow unit.
A method comprises receiving, by a first Matrix Processing Unit (MPU) included in a computing system, based on a left side matrix and a right side matrix having a shared dimension, column elements of a row of a first column-split matrix and row elements of a column of a first row-split matrix, the left side matrix comprising the shared dimension number of columns and the right side matrix comprising the shared dimension number of rows, the first column-split matrix comprising a first number of columns among columns of the left side matrix, the first row-split matrix comprising the first number of rows among rows of the right side matrix;
The method further comprises computing, by the first MPU, a first partial dot product comprising a sum of products of elements among the column elements of the row of the first column-split matrix multiplied by corresponding elements among the row elements of the column of the first row-split matrix; computing, by the second MPU, concurrent with the first MPU computing the first partial dot product, a second partial dot product comprising a sum of products of elements among the column elements of the row of the second column-split matrix multiplied by elements among the row elements of the column of the second row-split matrix; and, computing, by a third MPU included in a computing system, a dot product comprising a sum of the first partial dot product and the second partial dot product.
The example of implementation 21, wherein the first MPU computing the first partial dot product comprises the first MPU outputting the first partial dot product to the third MPU; and, wherein the method of the third MPU computing the sum of the first partial dot product and the second partial dot product comprises the third MPU adding the first partial dot product, output by the first MPU, to the second partial dot product.
The example of implementation 21, wherein the first number is greater than the second number; wherein, based on the first number greater than the second number, the second column-split matrix further comprises an all-zeros column, each element of the all-zeros column having value zero; wherein, based on the first number greater than the second number, the second row-split matrix further comprises an all-zeros row, each element of the all-zeros row having the value zero; and, wherein the second MPU computing the second partial dot product comprises the second MPU adding, to the second partial dot product, a product of a row element of the all-zeros column of the second column-split matrix multiplied by a row element of the all-zeros row of the second row-split matrix.
The example of implementation 21, wherein the first number is greater than the second number; and, wherein the second MPU computing the second partial dot product comprises the second MPU adding to the second partial dot product, based on the first number greater than the second number, a value of zero.
The example of implementation 21, wherein the first MPU comprises an accumulator; and, wherein the first MPU computing the first partial dot product comprises the computing system adding a product, among products included in the first partial dot product, to the accumulator.
The example of implementation 25, wherein the first MPU computing the first partial dot product further comprises the first MPU computing the first partial dot product as a first multiply-accumulate (MACC) computation.
The example of implementation 26, wherein the third MPU computing the dot product comprises the third MPU computing the sum of the first partial dot product and the second partial dot product as a second MACC computation.
A computing system, the computing system comprises a first matrix processing unit (MPU), a second MPU, and a third MPU.
The first MPU is configured to receive, based on a left side matrix and a right side matrix having a shared dimension, column elements of a row of a first column-split matrix and row elements of a column of a first row-split matrix, the left side matrix comprising the shared dimension number of columns and the right side matrix comprising the shared dimension number of rows, the first column-split matrix comprising a first number of columns among columns of a left side matrix, the first row-split matrix comprising the first number of rows among rows of a right side matrix; and, compute a first partial dot product comprising a sum of products of column elements of a row of the first column-split matrix multiplied by corresponding row elements of a column of the first row-split matrix.
The second MPU is configured to receive, based on the left side matrix and the right side matrix having the shared dimension, column elements of a row of a second column-split matrix and row elements of a column of a second row-split matrix, the second column-split matrix comprising a second number of columns among the shared dimension number of columns of the left side matrix, the second row-split matrix comprising the second number of rows among the shared dimension number of rows of the right side matrix; and, to compute, concurrent with the first MPU computing the first partial dot product, a second partial dot product comprising a sum of products of column elements of a row of the second column-split matrix multiplied by corresponding row elements of a column of the second row-split matrix. The third MPU is configured to compute a sum of the first partial dot product and the second partial dot produce.
The example of implementation 28, wherein the second MPU and the third MPU comprise the same MPU.
The example of implementation 28, wherein the first MPU and the second MPU comprise different MPUs.
The example of implementation 28, wherein the computing system further comprises a memory; wherein the first MPU is further configured to output the first partial dot product to the memory; and, wherein the third MPU configured to compute the dot product comprises the third MPU further configured to: input the first partial dot product from the first memory; and, add the first partial dot product input from the first memory to the second partial dot product to compute the sum of the first partial dot product and the second partial dot product,
The example of implementation 28, wherein the first number is greater than the second number; wherein the processor configured to generate the second column-split matrix comprises the processor included in the processor further configured to generate the second column-split matrix, based on the first number greater than the second number, the second column-split matrix further comprising an all-zeros column, each element of the all-zeros column having value zero; wherein the processor configure d to generate the second row-split matrix comprises the processor further configured to generate the second row-split matrix, based on the first number greater than the second number, further comprising an all-zeros row, each element of the all-zeros row having the value zero; and, wherein the second MPU configured to compute the second partial dot product comprises the second MPU further configured to add, to the second partial dot product, a product of a row element of the all-zeros column of the second column-split matrix multiplied by a row element of the all-zeros row of the second row-split matrix.
The example of implementation 32, wherein the computing system further comprises read logic configured to: input, to the second MPU, from a first memory included in the computing system, the row element of the all-zeros column of the second column-split matrix; and, input, to the second MPU, from a second memory included in the computing system, the column element of the all-zeros row of the second row-split matrix.
The example of implementation 28, wherein the first number is greater than the second number; and, wherein the second MPU configured to compute the second partial dot product comprises the second MPU further configured to add, based on the first number greater than the second number, a value of zero to the second partial dot product.
The example of implementation 28, wherein at least one of the first MPU, the second MPU, and the third MPU comprises a reconfigurable dataflow unit.
The example of implementation 28, wherein the first MPU comprises a first accumulator; and, wherein the first MPU configured to compute the first partial dot product comprises the first MPU further configured to add, to the first accumulator, products among the products of column elements of the row of the second column-split matrix multiplied by corresponding row elements of the column of the second row-split matrix.
The example of implementation 36, wherein first MPU is further configured to output, from the first accumulator to the third MPU, the first partial dot product; and; wherein the third MPU configured to compute the dot product comprises the third MPU further configured to add the first partial dot product, output to the third MPU from the first accumulator, to the second partial dot product.
The example of implementation 17, wherein the third MPU comprises a second accumulator; and, wherein the third MPU configured to add the first partial dot product, output to the third MPU from the first accumulator, to the second partial dot product comprises the third MPU further configured to add the first partial dot product, output to the third MPU from the first accumulator, to the second accumulator.
The example of implementation 28, wherein the first MPU configured to compute the first partial dot product comprises the first MPU further configured to perform a multiply-accumulate (MACC) computation to compute the first partial dot product.
The example of implementation 39, wherein the first MPU comprises a MACC arithmetic logic unit (ALU); and, wherein the first MPU configured to perform the MACC computation comprises the MACC ALU configured to perform the MACC computation.
This application is a continuation of U.S. Non-Provisional patent application Ser. No. 18/105,695, filed Feb. 3, 2023, titled “Exploiting Shared Dimensions In Matrix Computations”, which is incorporated by reference herein in its entirety. This application claims the benefit of U.S. Provisional Patent Application No. 63/307,593 filed Feb. 7, 2022, which is incorporated by reference herein in its entirety. This application claims the benefit of U.S. Provisional Patent Application No. 63/307,594 filed Feb. 7, 2022, which is incorporated by reference herein in its entirety. This application claims the benefit of U.S. Provisional Patent Application No. [63/307,604 filed Feb. 7, 2022, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63307593 | Feb 2022 | US | |
63307594 | Feb 2022 | US | |
63307604 | Feb 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18105695 | Feb 2023 | US |
Child | 18378278 | US |