The technology disclosed relates to computing devices and methods for performing matrix and tensor computations in computing systems. The computations can be utilized in applications such as artificial intelligence (e.g., knowledge-based systems, reasoning systems, machine learning systems, and knowledge acquisition systems), unstructured data (e.g., video, audio, and natural language) analysis, and neural networks. Computing systems and/or devices utilizing technology disclosed herein can comprise Coarse-Grained Reconfigurable Architectures (CGRAs).
The present disclosure relates to computing systems for executing data parallel and/or DP computing applications, such as in machine learning and neural networks. The disclosure further relates to methods and structures of a computing system to perform tensor and/or matrix computations such as can be included in machine learning and/or neural networks. Computing systems of the present disclosure include computing systems utilizing reconfigurable processing architectures, such as computing systems comprising Coarse-Grained Reconfigurable Processors (CGRPs).
The drawings included in the present disclosure are incorporated into, and form part of, the specification. They illustrate implementations of the present disclosure (hereinafter, “the disclosure) and, along with the description, serve to explain the principles of the disclosure. The drawings are intended to be only illustrative of certain implementations and are not intended to limit the disclosure.
Aspects of the present disclosure (hereinafter, “the disclosure”) relate to methods of performing matrix sum-product computations in computing systems. More particular aspects relate to improving parallelism of matrix computations and reducing processing cycles times computing systems by means of integrating a matrix addend in an additional column of a multiplicand matrix and extending a row or column of another multiplicand matrix to include a constant. Implementations of the disclosure (hereinafter, “implementations”) can perform matrix summation computations, such as a sum of a matrix addend and the sum-product of multiplicand matrices (Σw a+b), by computing a sum-product of two integrated summation (ISUM) multiplicand matrices ((Σwb a) and omitting a separate addition of an addend to the sum-product of the multiplicand matrices.
Aspects of the disclosure can also particularly apply to processors of data parallel (DP) computing systems, such as Central Processing Unit (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Digital Signal Processors (DSPs). Certain aspects of the disclosure relate to performing tensor and/or matrix computations in computing systems utilizing reconfigurable processor architectures, such as computing systems utilizing Coarse-Grained Reconfigurable Architectures (CGRAs), and/or reconfigurable Application Specific Integrated Circuits (ASICs) or Application Specific Instruction-set Processors (ASIP).
Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. The disclosure in some instances repeats references to these options. However, omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
Particular expressions of the disclosure will be understood to have the following operative meanings:
As used herein, “incorporated subject matter” refers, collectively, to subject matter disclosed, and/or otherwise encompassed, among the disclosures incorporated herein by reference. For purposes of illustrating the disclosure, but not intended to limit implementations, various terms of the disclosure are drawn from the incorporated subject matter. As used herein, unless expressly stated otherwise, such terms as may be found in the incorporated subject matter have the same meanings, herein, as their meanings in their respective incorporated disclosures.
Aspects of the disclosure can be appreciated through a discussion of example implementations and/or applications of methods and/or systems. However, such examples are for purposes of illustrating the disclosure. It should be understood that the intention is not to limit the disclosure to the example implementations described herein, but to encompass all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. Thus, the disclosure is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein. Various modifications to the disclosed examples will be readily appreciated by those of ordinary skill in the art, and the general principles defined herein may be applied to other implementations of the disclosure without departing from the spirit and scope of the disclosure.
Turning now to more particular aspects of the disclosure, some computing applications comprise computations that can be executed concurrently, in parallel among a plurality of computational elements, and/or by a pipeline of computational elements (processors and/or programs executing on processors, of a dataflow computing system). As the application data and computational results “flow” through successive processing elements of a dataflow computing system, such pipelined dataflow applications can be referred to also as “dataflow” application. Examples of such dataflow applications include machine learning (ML) and deep machine learning (DML) methods of Artificial Intelligence (AI) applications; image processing; stream processing (e.g., processing of streaming video and/or audio data); natural language processing (NLP); and/or recommendation engines.
Dataflow computing systems can comprise reconfigurable processing elements (reconfigurable processors, or “RPs”) particularly designed and/or configured to efficiently perform dataflow computing applications. Reconfigurable processors, such as field programmable gate arrays FPGAs and/or CGRA-based processors, can be configured to implement a variety of computational and/or data transfer functions more efficiently or faster than might be achieved using a general-purpose processor executing a computer program. Prabhakar, et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA '17, Jun. 24-28, 2017, Toronto, ON, Canada, (hereinafter, “Prabhakar”) describes example CGRAs and, systems utilizing such CGRAs, that can be particularly advantageous in dataflow computing system. Accordingly, aspects of the disclosure relate to methods and systems utilizing reconfigurable dataflow resources, such as resources of a CGRA. However, the disclosure is not necessarily limited to such applications and/or computing systems.
As used herein, the term “CGRA” refers interchangeably to a coarse grain reconfigurable architecture and a computing hardware implementation—such as an integrated circuit, chip, or module—based on, or incorporating, a coarse grain reconfigurable architecture. In implementations of the disclosure (hereinafter, “implementations”), systems based on, and/or incorporating, CGRAs, such as the example of Prabhakar, can be particularly adaptable to, and increasingly efficient in, performing dataflow and/or data parallel application processing. Hardware resources of a CGRA (e.g., PCUs, PMUs, tiles, networks, and/or network interfaces) can comprise one or more Integrated Circuits (ICs). As used herein, the term “chip” refers to an IC (or, combination of ICs) that can embody elements of a CGRA. A chip can typically be packaged in a chip module (e.g., a single chip module, “SCM” or, alternatively, a multi-chip module, “MCM”).
As used herein, the term “reconfigurable dataflow system (RDS)” refers to a computing system that is based on, and/or can utilize, reconfigurable dataflow resources, such as resources of CGRAs, to perform operations of dataflow applications. Owing to reconfigurability, reconfigurable dataflow systems can perform these operations more efficiently than systems comprising fixed or non-reconfigurable resources. As also used herein, the term “application” refers to any computing application (e.g., software program), and/or computing system, that utilizes an RDS, to perform algorithms and/or computations of the application. An application can execute, for example, on a processor included in, or coupled to, an RDS.
U.S. Nonprovisional patent application Ser. No. 16/239,252, “VIRTUALIZATION OF A RECONFIGURABLE DATA PROCESSOR”, to Grohoski, et al, (hereinafter, “Grohoski”), and U.S. Nonprovisional patent application Ser. No. 16/922,975, “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATA FLOW RESOURCES”, to Kumar, et al, (hereinafter, “Kumar”), both incorporated herein by reference, illustrate example implementations of a reconfigurable dataflow architecture and reconfigurable dataflow systems.
Kumar illustrates a dataflow system (e.g., an RDS) comprising user applications, programming libraries (e.g., deep learning frameworks), a software development kit, computation graphs associated with user applications, compilers, execution files that can specify operations of a user application to perform using resources (reconfigurable data flow resources) of the dataflow system, and host and runtime processors. User applications can comprise data parallel and/or dataflow applications. As illustrated by the examples of Kumar an RDS can comprise a plurality of physical racks each comprising one or more compute nodes (hereinafter, for brevity, “nodes”).
In the examples of Kumar a host and runtime processors can, for example, facilitate compiling a dataflow application, determining particular RDS resources to execute the application, and managing execution of the RDS resources in performing operations of the application. In the examples of Kumar a node can comprise a host processor, a runtime processor, and reconfigurable processors (“RPs”), and a runtime processor can include kernel drivers and/or a user space library (e.g., a library of programs a user can include, or can invoke, in a dataflow application and that can execute in a user space of a runtime processor).
In implementations, an RP can comprise reconfigurable processing elements with reconfigurable interconnections. In the examples of Grohoski and Kumar, reconfigurable processing elements of RPs can comprise one or more arrays (“tiles”) of configurable processors (pattern compute units, “PCUs”) and/or memory units (pattern memory units, “PMUs”). Within a tile the PCU processing and memory units can be interconnected by an ALN of switches. Tiles can be interconnected, such as via a TLN, to form RPs comprising multiple tiles. Thus, in the examples of Grohoski and Kumar, an RP can comprise a set of tiles and/or subarrays of a tile.
As illustrated by Kumar and Grohoski, a reconfigurable data-flow unit (RDU) of a dataflow system can comprise a dynamically reconfigurable hardware resource of the system that includes processing elements (e.g., RPs) to perform operations of dataflow applications. RDUs of a dataflow system can comprise (e.g., be based upon), for example, a CGRA. An RDU can comprise a set of processing elements (e.g., RPs), I/O interfaces to communicate among processors of differing RDUs, and, optionally, a memory. In the examples of Kumar and Grohoski an RDU can, comprise other than simply computational elements (e.g., processors, such as PCUs) and/or memories (e.g., PMUs), such as clock circuits, control circuits, switches and/or switching circuits, interconnection interface circuits (e.g., processor, memory, I/O bus, and/or network interface circuits, etc. Kumar also illustrates that an RDU can include virtualization logic and/or, RP configuration logic.
For purposes of illustrating the disclosure, but not intended to limit implementations, the disclosure occasionally refers to the example of an RDU comprising RPs of Kumar to illustrate a reconfigurable processing element for executing operations (e.g., computations and/or data transfer) of dataflow applications, such as matrix and tensor computations of dataflow applications. However, it would be appreciated by one of ordinary skill in the art that a processing element of a dataflow computing system can comprise any form of hardware processor, or combination of hardware processor, memories, interconnection, and/or ancillary circuits (e.g., clocks, control, interface, and/or status circuits), that can perform operations of dataflow applications. dataflow processing elements can comprise, for example, central processing units (CPUs); accelerator-class processors; matrix processing units (MCUs), intelligence processing units (IPUs), graphics processing units (GPUs); and/or, field programmable gate arrays (FPGAs) configured to perform particular dataflow application computations. According to examples of the incorporated references RPs can comprise (e.g., can be based upon), for example, a coarse-grained reconfigurable architecture (CGRA).
Many dataflow applications—such as machine learning, streams processing, image/video processing, and other complex computational applications—involve linear algebra computations over tensor data, such as matrix multiplication, transposition, and addition. Algorithms commonly employed in dataflow applications include algorithms such as linear regression and gradient descent over tensors and/or matrices of tensors. As used herein, “Tensor Computing Systems (TCS)” refers to a computing system configured to process tensors, such as dataflow computing systems, systems including neural networks, and any other computing system that includes hardware and/or software components for processing tensors.
A TCS can include general processors and can include specialized processors and/or computation units, such as accelerators, GPUs, FGPAs, CGRA accelerators, and other types of compute units. With reference to the examples of Grohoski and Kumar, processors and/or memories of a TCS can comprise processors and/or memories of RDUs and/or RPs of RDUs (e.g., tiles, PCUs, and/or PMUs). A TCS comprise programs executable on such processors. A TCS can comprise specialized programs for processing tensors, such as programs for compiling dataflow applications for execution on particular TCS processing elements, programs to configure particular TCS processing elements for executing dataflow applications (e.g., matrix computations of dataflow applications), and/or programs for executing dataflow applications on particular TCS processing elements.
Tensors can comprise matrices of varying dimensions and a variety of computing systems, including dataflow computing systems, can perform matrix computations, such as General Matrix Multiplication (GeMM), matrix summation, matrix transposition, gradient computations, and/or backpropagation of matrix computations, to process tensors in dataflow applications such as machine learning in neural networks. As used herein, brackets and a capital letter, such as [M], is used to refer to a matrix as a whole, while lowercase letters, such as m, are used to refer to an element, or set of elements, of a matrix [M]. For example, an expression such as (w×a) refers, herein, to a multiplication of a set of elements of matrices [W] and [A], such as elements of a row of matrix [W] multiplied by elements of a corresponding column of matrix [A]. The term “element”, in reference herein to a matrix, refers to the contents (e.g., a scalar value) of a row and column cell of the matrix.
A common computation for processing tensors in dataflow applications is a sum of products of two multiplicand matrices added to a matrix addend. The products comprise products of elements of a row of one multiplicand matrix multiplied by corresponding elements of a column of a second multiplicand matrix, where the row and column are the same (shared) matrix dimension. As used herein, the term “sum-product” refers to a sum of two or more products of elements of multiplicand matrices. An expression such as (Σw a) refers to a sum-product of elements w and a (e.g., a sum of products w×a for elements of a row of a matrix [W] multiplied by elements of a column of a matrix [A]). As an example, a sum-product of elements w11 of matrix [W multiplied by a11 of matrix [A], and w11 multiplied by a21 of matrix [A], is [w11×a11+w11×a21].
A “matrix summation” computation, as used herein, refers to a matrix computation in which a sum-product of two multiplicand matrices is added to a matrix addend. A matrix addend can comprise a constant or can comprise a matrix (which can itself be multiplied by a matrix multiplied by a constant) sharing a row dimension of the sum-product of two multiplicand matrices. A “weight-bias function”, y=Σw a+b, is one example of such a computation, in which a weights matrix [W] is multiplied by an activation matrix [A] and the sum-products, Σw a, for each row/column set of products, is added to elements of a bias matrix [B]. A more general form of a matrix summation computation can be expressed as y=Σw a+sb, where “s” is a constant, such as one or another constant. When “s” equals constant one, the more general matrix summation computation becomes the weights-bias function y=Σw a+b. Thus, while the examples of the disclosure frequently refer to an example weights-bias function in which “s”=1, it will be understood by one of ordinary skill in the art that “s” can equally have values other than “1” without materially altering the examples of the disclosure.
Tensor computing systems can utilize neural networks to execute dataflow application algorithms, and neurons in a neural network can process tensors (e.g., can perform matrix computations) of such algorithms. A combination of neurons in a layer of a neural network is often referred to as an “operator” and an operator can perform an activation function involving tensor computations.
In
In
The term “Addend sum-product”, as used herein, refers to a sum yij=Σwi,dim adim,j+bdim computed for elements of row i of an M×K multiplicand matrix [W], column j of a K×N multiplicand matrix [A], and rowdim of an M×1 addend matrix [B] (or, a column of constant values) computed for some or all values of dim within 1 to K. Correspondingly, as used herein, the term “Addend Sum matrix” refers to an M×N matrix of Addend sum-product elements, yij, computed as yij=Σwi, dim adim,j+bdim for all values of i from 1 to M, all values of j from 1 to N, and all values of dim from 1 to K. As depicted In
In an activation function such as a weight-bias computation a conventional TCS, as presently known in the art, commonly computes sum-products of two multiplicand matrices (e.g., [W] and [A]) and then adds an Addend matrix (e.g., [B]) as a separate and subsequent computation. That is, a conventional TCS commonly computes a complete M×N intermediate sum-product matrix of (Σw a) and subsequently adds all row elements, from 1 to M, of matrix [B], to all elements, from row 1 to M, of all columns, from 1 to N, of the intermediate results matrix.
Continuing the example of a weight-bias function,
Turning briefly to
GEMM 206 can perform general matrix multiplication of weights matrix 210 and activation matrix 212. GEMM 206 can comprise a matrix multiply processor and can receive elements of weights matrix 210 and activation matrix 212 from memories 202A and 202B, and can compute sum-products of the weights and activation elements. GEMM 206 can store the sum-product results in memory 202C as elements of intermediate sum-product matrix 214. Subsequently, adder 208 can retrieve elements of intermediate sum-product matrix 214 from memory 202C and elements of bias matrix 216 from memory 202D, can add these, and can store the Addend Sum (elements of the bias matrix plus the sum-product elements of intermediate results matrix) results in Addend Sum matrix 218 in memory 202E.
Performing tensor sum-product and addition computations as two separate and serial computations, such as in a conventional TCS, can add computational latency and can correspondingly limit, or reduce, computational performance of a dataflow computing system. For example, serial addition of multiplicand matrix sum-products and addend elements can include additional latency associated with transfer of intermediate sum-product results between memories of (or, accessible to) computational elements a TCS, such as one memory holding intermediate sum-products and a second memory holding resulting Addend Sum matrix elements.
Serial multiplicand sum-product and addend addition computations can require dedicated memories (e.g., scratch pad memories) and/or computation units (e.g., additional MCUs) to perform sum-product computations prior to, and separate from, addition of a matrix addend. Computational units of a TCS (e.g., a sum-product ALU and/or adder ALU) can be underutilized while awaiting other computational results. For example, an adder ALU, and/or related circuits or processors, can be idle and, correspondingly, underutilized to await results (and/or transfer of results) of sum-product computations stored in an intermediate memory. A sum-product ALU can be idle, or underutilized, for example, to await completion of addend addition utilizing an intermediate sum-product matrix or memory containing an intermediate sum-product matrix.
To improve matrix computational efficiency, reduce computational and/or memory transfer latencies, increase computational throughput, and/or reduce the number and/or type of computational units and/or memories, implementations can comprise an enhanced, “Integrated Summation (ISUM)” TCS. An ISUM TCS can generate two “ISUM matrices” from multiplicand and addend matrices of a matrix summation computation (e.g., [W], [A], and [B] in a weights-bias computation). Using the ISUM matrices an ISUM TCS can compute a sum-product of the two ISUM matrices that is equivalent to an Addend Sum matrix computed as an intermediate sum-product matrix of two multiplicand matrices subsequently added to a matrix addend.
An ISUM TCS can generate ISUM matrices that take advantage of a shared dimension of multiplicand and addend matrices. An ISUM TCS can integrate an added matrix that shares a row dimension with a multiplicand matrix to generate an ISUM “integrated matrix”. For example, an ISUM TCS can generate an M×(K+1) integrated matrix having, in columns 1 to K, columns to 1 to K of an M×K multiplicand matrix [W] and, in an additional (K+1) column of the ISUM integrated matrix, an M×1 addend matrix [B]). More generally, an ISUM TCS can generate an M×(K+P) ISUM integrated matrix comprising K number of multiplicand columns having, in columns 1 to K of the ISUM integrated matrix, corresponding columns of an M×K multiplicand matrix; and, comprising P number of “addend columns” having, in each of columns (K+1) to (K+P) of the ISUM integrated matrix, an “integrated addend”.
As used herein, the term “multiplicand column” refers to an M×1 column of an M×K multiplicand matrix, such as an M×K matrix [W], or a K×N matrix [A], in a weights-bias computation such as [Σw a+b] The term “integrated addend”, as used herein, refers to a single column of an addend matrix sharing the row dimension of a multiplicand matrix, such as an M×1 column of an addend matrix sharing row dimension M of an M×K multiplicand matrix.
Correspondingly, as used herein, the term “addend column” refers to a column of an ISUM integrated matrix comprising an integrated addend. In an M×(K+P) ISUM integrated matrix, each of the P number of addend columns of the ISUM integrated matrix can comprise an integrated addend of an addend matrix having row dimension M. In implementations, as just described an addend column of an ISUM integrated matrix can comprise elements of a column of an addend matrix (e.g., matrix [B] in computing [Σw a+b]) sharing the row dimension of a multiplicand matrix e.g., matrix [W] in computing [Σw a+b]). An addend column can comprise, alternatively, a value of a constant (e.g. constant value 1 or a value of another constant) in each row of the multiplicand column.
An ISUM TCS can generate a second ISUM multiplicand matrix based on a shared (or, partially shared) dimension of an ISUM integrated matrix and the second input multiplicand matrix, such as dimension K of an M×(K+P) ISUM integrated matrix and a K×N input multiplicand matrix. An ISUM multiplicand matrix can comprise, for example, a K×N input multiplicand matrix or, alternatively, can comprise a (K+P)×N “ISUM row-extended matrix” comprising the K×N input multiplicand matrix extended to have an additional (K+P) number of rows (or, K+P columns) of constants (e.g., a constant in each columns of the P rows of the ISUM row-extended matrix).
As used herein, the term “constant row” refers to a matrix having row dimension 1 and containing the same constant value in each column of the matrix. In an ISUM row-extended matrix, each row of the P rows of the ISUM row-extended matrix can comprise a constant row, and each constant row can comprise the same constant value, or can comprise different constants (e.g., values of a plurality of constant factors in a matrix summation computation such as will be seen in
As also used herein, the term “ISUM multiplicand matrix” refers to any input multiplicand matrix to be multiplied by an ISUM TCS (or, components thereof) to compute a sum-product of the ISUM multiplicand matrix and an ISUM integrated matrix. Thus, an ISUM multiplicand matrix can be an input multiplicand matrix as input (i.e., having only the elements of the input multiplicand matrix) or, alternatively, can be an ISUM row-extended matrix.
An ISUM TCS can compute an Integrated Sum matrix (or, elements thereof) equivalent to (Σw a+sb), where s is a constant, such as one, or another constant, by computing only sum-products of an ISUM integrated matrix and an ISUM multiplicand matrix. The ISUM TCS can compute the equivalent output matrix, [Y], without requiring, or utilizing, a separate and subsequent addition of the matrix addend to an intermediate sum-product matrix. An ISUM TCS can thereby improve overall TCS design and/or tensor computational performance, by simplifying TCS computations and eliminating latencies and/or under-utilization of TCS resources associated with storing intermediate sum-product matrices and performing serial sum-product and addend matrix addition computations.
Continuing with the example of a weight-bias function,
To simplify the illustration of generating an ISUM integrated and row-extended matrix, and computing a matrix from these ISUM matrices, the description of
In
As used herein, the term “integrated sum-product” refers to a sum of (all, or only some) products of elements of a row i of an M×(K+1) ISUM integrated matrix and respective elements of a column j of a (K+1)×N ISUM row-extended matrix, such as Σwi, dim adim,j for values of dim within the range 1 to (K+1) for a given value of i and j. Correspondingly, as used herein, the term “Integrated Sum” refers to an integrated sum-product computed over all (K+1) elements of a row i of an ISUM integrated matrix and corresponding (K+1) or, alternately, K, elements of a column j of an ISUM multiplicand matrix, and “Integrated Sum Matrix” refers to a matrix comprising Integrated Sums. As will be seen through a discussion of the examples of the disclosure, an Integrated Sum is equivalent to an Addend Sum, and an Integrated Sum matrix equivalent to an Addend Sum matrix.
In computing an Integrated Sum equivalent to an Addend Sum, an ISUM TCS can omit a separate and subsequent addition of a matrix addend, such as bias matrix [B] 118 in
While the example of
In
In implementations, an ISUM TCS can comprise an ISUM matrix integrator (hereinafter, for brevity, “an integrator”), illustrated by the example of integrator 228 in
An ISUM integrator can comprise processors and/or programs of an ISUM TCS (or, of one or more components of a TCS, such as a processing unit of an ISUM TCS), and/or can comprise logic circuits, configured to compute ISUM matrices. An ISUM integrator can comprise a processor of a TCS, such as a host or runtime processor of an RDS, or an RP of an RDU. An ISUM integrator can comprise a processor of a computer, or computing system, including or coupled to memories 230D and/or 230E, and/or can comprise a specialized logic circuit of a TCS, or of a component of a TCS.
While
An ISUM TCS can receive ISUM matrices (e.g., in a memory, or as an argument of an API) as inputs, and need not include a component to generate the ISUM matrices. Thus, while the examples of the disclosure refer to an ISUM integrator as a component of an ISUM TCS, it would be appreciated by one of ordinary skill in the art that an ISUM integrator can be any component of a dataflow system, or communicatively coupled to a dataflow system, that can generate ISUM matrices from input multiplicand and addend matrices.
Using the example of a weights-bias function, in
Integrator 228 can generate ISUM matrix A 234 (hereinafter, “matrix A 234”) as a (K+1)×N ISUM row-extended matrix containing rows 1 through K of K×N matrix A in rows 1 through K of matrix A 226, and a constant row in row (K+1), such as illustrated by the example of ISUM matrix 124 of
In
An ISUM TCS can compute an integrated sum-product, such as a sum of products of elements of a row i of matrix WB 232 and column j of matrix A 234, using a multiply-accumulate (MACC) computation, in which an accumulator stores a cumulative sum of products of elements of matrix WB 232 row i and matrix A 234 column j. As used herein, the term “MACC sum-product” refers to a sum of integrated sum-products computed as a sequence of MACC computations, and “MACC Sum” refers to a sum of MACC sum-products computed over all elements of a row i of an ISUM integrated matrix and a column j of an ISUM multiplicand matrix. Thus, an element, yij, of a matrix can comprise a MACC Sum, Σwbi,dim adim,j, computed over all values of dim from 1 to (K+1) for a row i of an integrated matrix [WB] and column j of an ISUM multiplicand matrix [A]. An Integrated Sum matrix of MACC Sums is equivalent to an Addend Sum matrix computed as a sum of a sum-products intermediate matrix and a matrix addend.
As shown by the example of TCS 220 in
In
Memories among memories 302 can be memories of components of a dataflow computing system, such as memories of an RDU, memories of a host and/or runtime processor, and/or memories of an ISUM TCS and/or ISUM MCU.
Integrator 304 can be an ISUM matrix integrator, such as integrator 228 in
ISUM MCU 310 can compute, in an integrated MACC computation, an Integrated Sum matrix, shown in
ISUM MCU 310 is shown, in
ISUM TCS 300 (or, ISUM MCU 310 of ISUM TCS 300) can perform MACC computations cyclically to compute an element of an Integrated Sum matrix. An ISUM MACC computation cycle (hereinafter, for brevity, simply “MACC cycle”) can comprise MACC computations that compute one Integrated Sum element of an Integrated Sum matrix. For example, in
In a buffer load cycle, as shown in
In
A constant input element, such as 336, can comprise, for example, a single instance of constant value s. Elements of an addend matrix can have a particular data size, such as 8 or 16 bits. A constant input element can, then, have a data size corresponding to the data size (e.g., a respective 8 or 16 bits) of elements of the addend matrix. A constant input element can comprise a scalar value stored in a location in a memory of a TCS or ISUM MCU, a register of an ISUM MCU, and/or a hard-wired input element having constant value s conforming to a data size of the elements of the addend matrix.
ACC 330 can comprise an accumulator to accumulate sums of matrix products. Prior to performing a sequence of ISUM MACC cycles, MACC ALU 320 can initialize ACC 330 to zero. In a MACC cycle ISUM MCU 310 can multiply pairs of tensor A buffer 322 and tensor WB buffer 324 elements and output the products to adder ALU 328 and adder ALU 328 can add the products to a value stored in ACC 330. Adder ALU 328 can store the sum-product result in ACC 330 to compute, in successive buffer load and MACC cycles, an Integrated Sum, yij, over all (K+1) elements of row i of matrix 302B and column j of matrix 302A.
As multiplier ALU 326 outputs tensor A buffer 322 and tensor WB buffer 324 element products, adder ALU 328 can add each product to ACC 330. For example, as multiplier ALU 326 generates a product of (a0×w0), adder 326 can add that product to the current value of ACC 330. Similarly, as multiplier ALU 328 generates a product of (a1×w1), adder ALU 328 can add that product to the current value of ACC 330 such that the accumulator now has the value of (a0×w0)+(a1×w1) added to a preceding value of ACC 330. Multiplier ALU 326 and adder ALU 328 can repeat MACC cycles to compute the sum product of all 4 elements of tensor A buffer 322 and tensor WB buffer 324.
Adder ALU 328 can receive each product and can serially (e.g., in each computation cycle of multiplier ALU 326) add it to a value stored in ACC 330. Alternatively, multiplier ALU 326 can compute some or all of tensor A buffer 322 times tensor WB buffer 324 products concurrently, adder ALU 328 can receive more than one product output from multiplier ALU 326 concurrently, and adder ALU 328 can add those products to the value of accumulator ACC 330. Adder ALU 328 and ACC 330 can thereby compute a sum of products output from multiplier ALU 330 over a sequence of MACC ALU 320 computation cycles. An ISUM TCS (and/or, an ISUM MCU of an ISUM TCS) can store computed MACC Sum elements in a memory. As shown in
As previously described with reference to
To illustrate in more detail, consider that in
In a buffer load cycle, ISUM MCU 310 can load elements a11, a21, a31, and a41 of matrix 302A (e.g., the first 4 elements of column 1 of ISUM matrix [A]), from memory 302A into tensor A buffer 322, and can load elements w11, w12, w13, and w14 of matrix 302B (e.g., the first 4 elements of row 1 of ISUM matrix [A]) from memory 302B into tensor WB buffer 324. MACC compute cycles of MACC ALU 320 can then compute [a11×w11+a21×w12+a31×w13+a41×w14] for the four (i.e., “K”) elements of row 1 of matrix 302B and column 1 of matrix 302A.
In computing element K+1, element K+1 of matrix 302A column 1 comprises scalar 1, and element K+1 of matrix 302B comprises element 1 of column 1 of addend matrix [B]. Thus, the product (a51×w15) is computed as (1×b1) and the sum-product of all K+1 products [a11×w11+a21×w12+a31×w13+a41×w14+a5,1×w15] is equivalent to [a11×w11+a21×w12+a31×w13+a41×w14+1×b1]. Thus, by computing K+1 products of an ISUM integrated matrix and an ISUM multiplicand matrix (in the example just described, an ISUM row-extended matrix), MACC ALU 320 can compute an Integrated Sum, equivalent to an Addend Sum, utilizing only sum-product (e.g., MACC) computations, without performing a subsequent addition of a sum-product matrix and a matrix addend.
In implementations, a multiplier ALU, such as multiplier ALU 326, and an adder ALU and accumulator, such as adder ALU 328 and ACC 330, can perform multiplication and addition computations concurrently (in parallel). For example, multiplier ALU 326 can compute a subset of tensor A buffer 322 and tensor WB buffer 324 products and output these to adder ALU 328 to add and accumulate to prior products. Concurrent with adder ALU 328 adding the output products to current values of ACC 330, multiplier ALU 326 can continue to compute additional (new) products of tensor A buffer 322 and tensor WB buffer 324 elements. Likewise, concurrent with multiplier ALU 326 computing additional (new) products of tensor A buffer 322 and tensor WB buffer 324 elements, adder ALU 328 can compute an accumulated sum of previous products received from multiplier ALU 326.
In implementations an ISUM MCU can, optionally, include multiplier selection logic, shown as selection logic 340 in
During a MACC cycle select 332 can receive outputs of tensor A buffer 322 and constant input element 336 and can output to multiplier ALU 326 either an input received from tensor A buffer 322 or constant input element 336 for multiplier ALU 326 to compute a product of the output of select 332 and an element of tensor WB buffer 324. In computing a product of an element column (K+1) of matrix 302B (elements B1 to BN of addend matrix [B]) and a constant in row (K+1) of matrix 302A, on a (K+1) MACC cycle select 332 can output constant s of constant input element 336 to multiplier ALU 326, as an alternative to outputting an element of a row (K+1) of matrix 302A. For example, prior to computing a MACC Sum of a row of matrix 302B and a column of matrix 302A, ISUM MCU 310 can set the value of counter 334 to “1”. After computing a sum-product of each element of the row of matrix 302B and column of matrix 302A, MCU 310 can increment counter 334.
For values of counter 334 from 1 to K, counter 334 can configure select 332 to output elements received from tensor A buffer 322. When the value of counter 334 reaches (K+1), the counter can configure select 332 to output the value, “s”, of constant input element 336 as a multiplicand of a (K+1) element of matrix 302B received from tensor WB buffer 324. If the value of “s” is 1, for example, the (K+1) product computation of the column (K+1) element of matrix 302B, which is an element b of addend matrix [B], is then (1×b) and the MACC Sum of that row (e.g., row i) of matrix 302B and column (e.g., column j) of matrix 302A for dim=1 to (K+1) is [wi,1×a1,j+wi,2×a2,j+ . . . +wi,k×ak,j+1×wi,k+1] in which wi,k+1 is b1 of addend matrix [B]. As can be seen in this example, by select 332 selecting constant input element 336 on the (K+1) MACC cycle, matrix 302A can be a K×N ISUM multiplicand matrix, omitting the (K+1) row of constants (value “s” of constant input element 336, for example).
In implementations, an ISUM MCU, such as ISUM MCU 310 in
The example of
While not shown in
In implementations, an ISUM TCS can comprise one or more ISUM Processing Units (ISUM PUs). ISUM PUs can comprise, for example, components for generating ISUM matrices, memories to contain ISUM matrices, and/or MCUs (or, components of ISUM MCUs, such as MACC ALUs, etc.).
Integrator 354 can be an ISUM matrix integrator, such as the example of integrator 228 in
ISUM PU 352 (or, ISUM MCU 360 of ISUM PU 352) can compute ISUM integrated sum-products, and/or an Integrated Sum, such as (Σwb aE) over matrix 356A and matrix 356B. An ISUM PU (or, an ISUM MCU of an ISUM PU) can perform K+P computation cycles to compute an Integrated Sum of a row of an M×(K+P) ISUM integrated matrix and a (K+P)×N, or K×N, ISUM multiplicand matrix.
While not shown explicitly in
Also while not shown explicitly in
While not shown in
Programs of an ISUM PU can comprise programs executable on a process of an ISUM PU (and/or an ISUM MCU of an ISUM PU) to performs operations of an ISUM integrator, to generate ISUM integrated and/or multiplicand matrices. Programs of an ISUM PU can comprise programs to compute products of ISUM matrix elements, and/or sum-products of ISUM matrix elements, and can compute the sum-products using MACC computations. Programs of an ISUM PU can comprise programs to program multiplier selection logic. Memories of an ISUM PU can contain program instructions of programs of an ISUM PU; can comprise matrix element buffers, such as tensor A buffer 322 and/or tensor WB buffer 324 in
While
ISUM TCS 400 is further shown comprising integrator 410 and memories 404A, 404B, and 404C (collectively, “memories 404”).
In
In another example, the number, “n”, of MCUs among MCUs 402 can be larger than the number of rows, M, of matrix 404B. A TCS can comprise many thousands of MCUs (e.g., in the example of Grohoski and Kumar, an RDS can comprise many thousands of PCUs and/or PMUs) such that the number, “n”, of MCUs 402 can be many thousands and MCUs among MCUs 402 can compute a subset of products, and/or sum-products, of matrices 404B and 404A and can thereby greatly increase parallel computations of Integrated Sums of matrices 404B and 404A.
Based on respective subset elements received from matrices 404A and 404B, each of MCUs 402 can compute a corresponding subset, shown in
MCUs among MCUs 402 can include multiplier selection logic (not shown explicitly in
ISUM TCS 400, and/or ISUM MCUs among MCUs 402, can include processors, such as a neural network, a host processor, runtime processor, RDU and/or processors of RDUs, and/or accelerator processors (CGRAs, FPGAs, GPUs, etc.). TCS 400 and/or ISUM MCUs among MCUs 402 can comprise ISUM programs, such as programs for generating ISUM integrated matrices and/or computing ISUM integrated sum-products and/or Integrated Sums, and the programs can execute on processors of the TCS and/or MCUs.
As previously described, an ISUM TCS can comprise a ISUM PU.
Integrator 440 can be a component of ISUM PU 430, as shown in
In
Similar to the example of
Also similar to the example of
As previously described with respect to
For example, in one implementation an ISUM TCS can utilize K number of ISUM MCUs in which each ISUM MCU computes a sum-product of one row of an ISUM integrated matrix and one column of an ISUM multiplicand matrix, and one of the ISUM MCUs computes an Integrated sum of the sum-products of all of the ISUM MCUs over the row of the ISUM integrated matrix and column of the ISUM multiplicand matrix. In another example, an ISUM MCU can utilize K number of ISUM ALUs in which each ISUM ALU computes a sum-product of one row of an ISUM integrated matrix and one column of an ISUM multiplicand matrix, and one of the ISUM ALUs computes an Integrated sum of the sum-products of all of the ISUM ALUs over the row of the ISUM integrated matrix and column of the ISUM multiplicand matrix. It would be appreciated by one of ordinary skill in the art that an ISUM TCS can employ any combination of individual ISUM PUs, ISUM MCUs, and/or ISUM ALUs to compute any individual product and/or subset of sum-products of ISUM matrices.
As described in reference to TCS 400, in
In operation 502 of method 500, the TCS receives, or otherwise accesses, input matrices [A], [W], and [B] and generates ISUM matrix [AE] as a (K+1)×N ISUM row-extended matrix, having rows 1-K of input matrix [A] in rows 1-K of ISUM matrix [AE] and a constant row having constant s in row (K+1) of ISUM matrix [AE]. In operation 502, the TCS can, additionally or alternatively, generate ISUM matrix [WB] as an M×(K+1) ISUM integrated matrix, having columns 1-K of input matrix [W] in columns 1-K of ISUM matrix [WB] and M×1 addend matrix [B], as an integrated addend, in column K+1 of ISUM matrix [WB].
In implementations, the TCS can include an integrator, such as integrator 228 in
In operation 504 the TCS initializes loop counters R and C, which can be counters corresponding to respective rows and columns of ISUM matrices [WB] and [AE] in computing sum-products of ISUM matrices [WB] and [AE]. Counter R can correspond, for example, to a row index of ISUM matrix [WB] and C can correspond, for example, to a column index of ISUM matrix [AE].
In operation 506, for a particular value of R and C, the TCS (e.g., an ISUM MCU of the TCS) computes an Integrated Sum (yR,C=ΣwbR,DIM aE DIM, c) for a particular row R of ISUM matrix [WB] and column C of ISUM matrix [AE]. In operation 506 the TCS can utilize a counter, dim, to count products of [wbR,DIM×aDIM, C], for values of dim from 1 to (K+1), to compute and sum (K+1) products of elements of row R of ISUM matrix [WB] and column C of ISUM matrix [AE]. Thus, in operation 506 the TCS computes, yR,C over all (K+1) elements of row R of ISUM matrix [WB] and column C of ISUM matrix [AE] utilizing only sum-product computations (e.g., MACC computations). In operation 506 the TCS can compute (ΣwbR,DIM aE DIM,C) utilizing an ISUM MCU, such as example ISUM MCU 310 in
In operation 508 the TCS outputs the Integrated Sum yR,C computed in operation 506. In operation 508 the TCS can output yR,C to, for example, an Integrated Sum matrix stored in a memory, such as matrix [Y] in memory 302C of
In operation 510 the TCS determines if loop counter C equals the value of N, corresponding to column dimension N of ISUM matrix [AE] and indicating operation 506 has computed an Integrated Sums yR,C for all columns of ISUM matrix [A] multiplied by all (K+1) elements of column R of ISUM matrix [WB]. If C does not equal N, in operation 512 the TCS increments C and repeats operations 506-512 until these operations have iterated over all N columns of ISUM matrix [AE].
If, in operation 510, the TCS determines that C has incremented to value N, in operation 514 the TCS determines if R has reached a value of M, corresponding to dimension M of ISUM matrix [WB] and indicating that operation 506 has computed an Integrated Sum, yR,C, for all M rows of ISUM matrix [WB] multiplied by all (K+1) elements of all N columns of ISUM matrix [AE]. If C does not equal M, in operation 516 the TCS increments R and, in operation 518 the TCS resets counter C to 1 (to compute an Integrated Sum for the next row of ISUM matrix [WB] and all N columns of ISUM matrix [AE]. The TCS repeats operations 506-518 until these operations have iterated over all M rows of ISUM matrix [WB] computed with all N columns ISUM matrix [AE] to compute a complete M×N Integrated Sum matrix [Y].
Alternatively, if in operation 514 the TCS determines that C has reached a value of M, in operation 520 the TCS can, optionally, output a complete Integrated Sum matrix computed over all M rows of ISUM matrix [WB] and all N columns of ISUM matrix [AE]. For example, if the TCS output Integrated Sums yR,C to an Integrated Sum matrix [Y] in a memory, in operation 520 the TCS can output Integrated Sum matrix, [Y], and/or sum-products included in Integrated Sum matrix [Y], to one or more alternative memories (e.g., memories other than the memory used, in operation 508, to store Integrated Sums yR,C), and/or to one or more ISUM PUs and/or ISUM MCUs of the TCS for the TCS to perform back propagation computations, such as in a gradient descent computation, utilizing an Integrated Sum (or, alternatively, a partial sum-product of an Integrated Sum) computed in operation 506.
Method 500 illustrates an example of ISUM Integrated Sum computations using an ISUM row-extended matrix (ISUM matrix [AE] in the example of method 500) having a row of constants, S, such as scalar 1 or other constants. However, as illustrated with the example of optional multiplier selection logic 340 and the example of ISUM MCU 310 in
Similar to the description of method 500 in
As in operation 502 of method 500, in operation 602 of method 600 the TCS can generate an M×(K+1) ISUM matrix [WB] that integrates multiplicand matrix [W], in columns 1-K and row 1 through M of ISUM matrix [WB], and addend matrix [B], as an integrated addend, in column K+1 of ISUM matrix [WB]. The TCS can then compute an Integrated Sum of matrix [WB] and an ISUM multiplicand matrix [AM] comprising input matrix [A]. In operation 602 the TCS can, optionally, generate matrix [AM] as a (K+1)×N ISUM integrated matrix, with a (K+1) row of constants.
However, as illustrated in the example of ISUM MCU 310 in
In operation 602 of method 600, the MCU receives, or otherwise accesses, input multiplicand matrices [A] and [W], and input addend matrix [B], to generate ISUM matrix [WB] and (optionally) ISUM matrix [AM]. The TCS can include an integrator, such as integrator 228 in
Similar to operation 504 of method 500, in operation 604 the TCS initializes loop counters R and C, which can correspond, respectively, to a row R of ISUM matrix [WB] and a column C of matrix [AM] in computing an Integrated Sum of ISUM matrix [WB] and matrix [AM].
In operation 606 of method 600, the TCS initializes a counter, DIM, to count sum-product computations within column R of matrix [AM] and row C of ISUM matrix [WB]. Counter DIM can serve to select elements of matrix [AM] and ISUM matrix [WB] to compute sum-product yR,C=[ΣwbR,DIM aM DIM, C] for row R and column C for all (K+1) elements of a row, R, of ISUM matrix [WB]. The TCS (or, a ISUM PU or MCU of the TCS), can include multiplier selection logic, such as multiplier selection logic 340 in
In operation 608, the TCS (e.g., a ISUM PU or MCU of the TCS) determines if DIM has reached a value of K+1, indicating that the TCS has computed a sum-product of all K elements of row R of matrix [WB] and all K elements of column C of matrix [AM]. If not, in operation 610 the TCS computes a current value of yR,C as the product (wbR,DIM×aDIM, C) of elements DIM of the row R and column C of respective matrices [WB] and [AM] added to an accumulated sum (e.g., a value of an accumulator, such as ACC 330 of
If the TCS determines in operation 608 that DIM has reached a value of K+1, in operation 614 the TCS computes the product (wbR,K+1×s), where “s” is a constant multiplied by column element (K+1) of row R, which in matrix [WB] is element bR of addend matrix [B]. In operation 608 (or, alternatively, operation 614) multiplier selection logic of the TCS can, for example, set an input gate, such as input select 332 in
In operation 616 the TCS resets the value of DIM to 1 and, in operation 618, the TCS outputs the Integrated Sum yR,C computed in operations 606-614. In operation 618 the TCS can output Integrated Sum yR,C to, for example, an Integrated Sum matrix [Y] stored in a memory, such as matrix [Y] in memory 302C of
In operations 620 and 622 the TCS can increment counter C and, in operations 624-628, can increment loop counter R and reset counter C to 1 (to compute sum-products with the next column of matrix [AM]) to repeat operations 608-626 over all M rows of matrix [WB] and all N columns of matrix [AM].
Upon determining, in operation 624, that counter C has reached a value of M, similar to operation 520 of method 500 in operation 624 the TCS can determine that the TCS has computed all Integrated Sums to generate an M×N Integrated Sum matrix [Y] and, in operation 630, the TCS can output Integrated Sum matrix [Y]. In operation 624 the TCS can output Integrated Sum matrix [Y] to, for example, one or more memories (e.g., memories other than a memory used, in operation 618, to store a sum-product computed in operations 608-614), and/or to ISUM PUs and/or MCUs of the TCS, to perform back propagation of Integrated Sum matrix [Y] elements, such as in a gradient descent computation utilizing sum-products included in Integrated Sum matrix [Y].
For example, in performing a method such as method 500 of
In another example, in performing a method such as method 600 of
Method 700 can be performed by a TCS (hereinafter, with reference to
In operation 702 of method 700 the TCS (e.g., an integrator of the TCS) generates subsets of a row R of ISUM matrix [WB] and column C of ISUM matrix [AM] to compute an Integrated Sum of elements of the row R and column C. The TCS can generate subsets of the K+1 elements of row R of matrix [WB] and K+1 elements of column C of matrix [AM] (or, alternatively, subsets of K elements of column C, if the TCS utilizes multiplier selection logic to input a constant as a K+1 multiplicand of [WBR,K+1]). The TCS can generate a subset 1 to include, for example, elements 1 to n of each of row R of matrix [WB] and column C of matrix [A], and a subset 2 to include elements (n+1) to (K+1) of row R of matrix [WB] and elements (n+1) to K+1 (or, n+1 to K) column C of matrix [AM]. The TCS can determine the size of the subsets (e.g., the value of “n”) based on factors such as, for example, sizes, performance, and/or design characteristics of computation units (e.g., ISUM PUs/MCUs of the TCS) and/or memories to store elements of ISUM matrices [WB] and/or [AM], and/or to store MACC Sum outputs.
For purposes of illustrating method 700, the TCS can compute the Integrated Sum as an ISUM MACC sum computed by a combination of two MCUs of the TCS, MCU0 and MCU1 (hereinafter, with reference to method 700, collectively “the MCUs”). MCU0 and/or MCU1 can be, for example, an MCU similar or equivalent to ISUM MCU 310 of
In operation 704 MCU0 computes products and/or MACC sum-products over elements of subset 1 and, in operation 706, MCU0 outputs the products/sum-products to MCU1. In operation 708 MCU1 computes products and/or MACC sum-products over elements of subset 2 and, in operation 710, MCU1 adds products/sum-products output by MCU0 to products or, alternatively, to sum-products, computed by MCU1.
In operation 704 MCU0 can compute only products of elements of subset 1 and can output the products to MCU1. Alternatively, in operation 704 MCU0 can compute a complete sum-product, or can compute partial sum-products, of all elements of subset 1 and can, in operation 706, output the sum-product(s) to MCU1. Similarly, in operation 708 MCU1 can compute products of elements of subset 2 and, or, alternatively, can compute a complete sum-product, or can compute partial sum-products, of all elements of subset 2.
In operation 710, MCU1 can add the products/sum-products computed in operation 708 to products/sum-products output, in operation 706, from MCU0. In operation 710 MCU1 can add outputs of MCU0 to products or, alternatively, to sum-products, computed by MCU1 as a MACC sum, adding the products/sum-products output by MCU0 to an accumulator of MCU1, for example.
In operation 712 the MCUs determine if they have computed all of their respective products/sum-products such that, in combination, they have computed an Integrated Sum of all (K+1) computations of subsets 1 and 2 elements. If not, the MCUs repeat operations 704-710 until all each of MCU0 and MCU1 have computed products/sum-products over all of the elements in their respective subsets 1 and 2.
If, in operation 712, the MCUs determine that they have computed an Integrated Sum of all (K+1) computations of subsets 1 and 2 elements, in operation 714 MCU1 outputs the complete Integrated Sum of row R of matrix [WB} and column C of matrix [AM]. In operation 712 MCU1 can output the Integrated Sum to a memory (e.g., to a memory containing an Integrated Sum matrix of sum-products of matrices [WB} and [AM]), and/or to other computational elements of the TCS, such as other ISUM PUs/MCUs configured to compute functions utilizing Integrated Sums, or sum-products of Integrated Sums computed by MCU0 and MCU1. For example, in operation 714 MCU1 can output an Integrated Sums, or sum-products of an Integrated Sum, to a forward operator of a neural network, or other computing model, or in a backpropagation computation (e.g., a gradient computation), to a backward operator of a neural network, or other computing model. Similarly, while not shown in
In operation 716, the TCS (or, one of the MCUs) determine if the MCUs have computed an Integrated Sum for all N columns of ISUM matrix [AM]. If not, in operation 718 the MCUs increment column counter, C, and the TCS and MCUs repeat operations 702 through 718. In operation 720, the TCS (or, one of the MCUs) determine if the MCUs have computed an Integrated Sum for all M rows of ISUM matrix [WB]. If not, in operation 722 the MCUs increment row counter, R, reset column counter C to 1, and the TCS and MCUs repeat operations 702 through 720 for the next row R of ISUM matrix [WB] multiplied by all N columns of matrix [AM].
If, in operation 720, the TCS (or, one of the MCUs) determines that the MCUs have computed an Integrated Sum for all M rows of ISUM matrix [WB] (and, by implication, for each row of matrix [WB], for all N columns of matrix [AM]), in operation 724 the TCS (or, one or both of MCU0 and MCU1), optionally, output an Integrated Sum matrix, [Y], comprising the Integrated Sums of all rows/columns of matrices [WB} and {AM], which corresponds to an Addend Sum matrix of (Σw a+sb) where s is a constant multiplied by elements of addend matrix [B]. In operation 720, the TCS/MCUs can output the Integrated Sum matrix [Y] to a memory and/or to other computational units of the TCS, such as forward and/or backward operator computational units of the TCS.
While the disclosure illustrates method 700, in
While not shown in
In
In
While not shown explicitly in
As described in reference to TCS 400, in
The examples of
In implementations, a matrix addend can comprise a constant. For example, in a function such as (Σw a+s), addend s can be a constant added to each sum-product of Σw a. In another example, an Integrated Sum addend can be a product of a scalar and elements of a matrix addend, such as (Σw a+sb), where s is a constant multiplied by elements of a matrix addend [B].
An ISUM integrator can combine M×K multiplicand matrix [W] with a “constant integrated addend” to generate M×(K+1) ISUM integrated matrix WS 802. As used herein, “constant integrated addend” refers to an integrated addend having the same constant in each row element of the matrix. In
Alternatively, as illustrated in the example of method 600 of
Similar to the example of
Implementations are also not necessarily limited to computing Integrated Sums for functions having a single addend matrix. For example, using ISUM integrated and/or ISUM extended matrices, an ISUM TCS can compute (Σw a+s1b1+s2 b2+ . . . spbp) for P number of addend matrices, [B1] to [Bp], and in which each of the matrix addend matrices can be multiplied by a constant, respectively s1 through sp.
Correspondingly, an ISUM integrator can generate a (K+P)×N ISUM row-extended matrix, shown in
As in the examples of
The example of
The examples of
As shown in
Correspondingly, an ISUM integrator can generate (K+P)×N ISUM row-extended matrix AE 834, in
As in the examples of
An ISUM TCS can combine computations of the examples of
As illustrated in
Correspondingly, an ISUM integrator can generate (K+P)×N ISUM row-extended matrix AE 844, in
As described with reference to
As previously described an ISUM TCS can comprise a plurality, possibly many thousands, of ISUM PUs/MCUs and the plurality of ISUM PUs/MCUs can compute Integrated sum-products and/or Integrated Sums of a single row, or of a set of particular rows, of an ISUM integrated matrix, such as in the examples of
Components of an ISUM TCS, such as ISUM matrix integrators, ISUM PUs, and ISUM MCUs can perform any or all of the methods of the disclosure, and/or any or all of the operations of the methods of the disclosure, in any particular combination and/or order of the methods or operations thereof. ISUM components of a TCS, such as ISUM matrix integrators, ISUM PUs, and ISUM MCUs can be combined and/or subdivided in any particular arrangement suitable to perform ISUM matrix integration and computations, such as sum-product and/or transposition computations used to illustrate the disclosure (but, not limited to only these example computations and matrix operations).
As illustrated in the examples of the disclosure, an ISUM TCS, ISUM PU, and/or ISUM MCU can compute Integrated Sums of an ISUM integrated matrix, comprising a multiplicand and one or more addend matrices, and an ISUM row or column row-extended matrix, and/or can compute Integrated Sums of an ISUM integrated matrix and an ISUM multiplicand matrix, using only MACC computations. The resulting MACC sum Integrated Sums are equivalent to a computation of a sum-product of two multiplicand matrices added, as a subsequent matrix computation, to an added matrix, and/or a product of a scalar and a matrix addend. The ISUM integrated matrices can comprise a plurality of addend matrices, and addend matrices, integrated into an ISUM integrated matrix, can comprise column dimensions of an arbitrary size greater than 1.
Computing applications, such as machine learning and applications utilizing neural networks, can utilize a “backpropagation” algorithm to tune results of tensor computations (e.g., to achieve closer agreement of machine learning and/or data analysis with predicted, or known, results). In a backpropagation algorithm, computational results output from a “forward” computational element can be used to adjust parameters of tensor computations, such as weights and/or bias values in a weights-bias function. A tensor computation system, and/or tensor computing application, can use a “loss function” to optimize tensor computations to achieve closer agreement with predicted, or known, results of an application, such as machine learning or data analysis applications.
For example, in a weights-bias function, a forward ISUM TPU/MCU can compute sum-products of input multiplicand and addend matrices, such as (Σwb aE). The forward TPU/MCU can output a resulting Integrated Sum matrix (or, can output integrated sum-products to an Integrated Sum matrix), and the Integrated Sum matrix can be input to a TPU/MCU to compute a loss function over the Integrated Sum matrix. The loss function TPU/MCU can use a loss function to compute adjusted weight and bias values of weights-bias computations. For example, a loss function TPU/MCU can utilize a gradient descent algorithm to compute gradients of elements of a weights and/or bias matrix. The loss function TPU/MCU can output weight and/or bias gradient values to matrices of adjusted weights and biases, such as to a weights matrix [W] and/or a bias matrix [B]. In a backpropagation algorithm, the loss function TPU/MCU can feed the adjusted weights-bias matrices “backward” to an ISUM TCU/MCU to repeat weights-bias computations using the adjusted (gradient) weights and/or bias values.
An ISUM TCS can generate ISUM integrated and, optionally, ISUM extended matrices and can compute a loss function Integrated Sum matrix as an integrated summation computation, such as in the foregoing examples of the disclosure.
For purposes of illustrating the example, but not intended to limit implementations, the description of
However, this is not intended to limit implementations; any variety of alternative processors and/or combinations of processors processing elements of a TCS, such as RDUs, MCUs, tiles and/or processors of tiles of an RDU, can generate an ISUM transpose-extended matrix can compute a forward output matrix, generate an ISUM transpose-extended matrix, and/or compute gradients (or, other sum-products of an application) using an ISUM transpose-extended matrix. It will be further appreciated by one of ordinary skill in the art that a forward PU, XP PU, and/or a BP PU, such as used to illustrate the examples of
The forward PU can compute an Integrated Sum (e.g., MACC sum-products) matrix of matrix WB 900 and matrix AE 902 to compute, for example, a weights-bias function. Matrix FO 904, as shown, in
In backpropagation algorithms, one method to compute a weight gradient is to compute a sum-product of a row of a loss function input matrix (e.g., a row of an Integrated Sum matrix) multiplied by a column of a transposed multiplicand matrix. For example, the BP PU can compute a weights gradient, [Δw=ΣlfIN aT], as a sum-product of each of the N column elements of a row of a loss function input matrix, [LFIN], multiplied by an element of a corresponding row element among the N rows of an N×K transposition of a K×N matrix [A], denoted as matrix [AT].
One method of computing a bias gradient in a backpropagation algorithm is to compute a sum-product of a row of a loss function input matrix multiplied by a multiplicand column comprising a scalar const in each element of the multiplicand column or, alternatively, a column of a multiplicand matrix having a row dimension (e.g., “N” of an N×K multiplicand matrix) shared with the column dimension of a loss function matrix (e.g., “N” in an M×N loss function matrix). For example, the BP PU can compute a bias gradient of an M×N loss function input matrix, [LFIN], as a sum-product of a row of the matrix [LFIN] multiplied by an N×1 multiplicand column, [Δb=ΣlfIN s], where s comprises N number of elements of the multiplicand column.
In a case in which the multiplicand column comprises constant value 1, a bias gradient [Δb=ΣlfIN s] is computed as [Δb=ΣlfIN 1], which computes the sum of all elements of a row of matrix [LFIN]. In an alternative case in which the multiplicand column comprises elements of a column of a multiplicand matrix, bias gradient [Δb=ΣlfIN s] is computed as a sum-product of a row of matrix [LFIN] multiplied by a multiplicand column of a constant s or, alternatively a multiplicand column of a multiplicand matrix having (in this example) row dimension M.
In the example of
A conventional computation of a weights gradient and bias gradients (e.g., to compute a gradient-adjusted weights and/or bias matrix) computes the weights and bias gradients as two separate sum-product computation, one to compute ΣlfIN aT and another to compute ΣlfIN s. This can require either dedicating additional compute resources of a TCS (e.g., a set of MCUs to compute the weight gradients and additional MCUs to compute the bias gradients), or can serialize the computations within a set of MCUS configured to compute both gradients.
However, an XP PU can generate an N×(K+P) ISUM “transpose-extended” as a multiplicand matrix of a loss function input matrix to compute weights and/r bias gradients using the foregoing equations. As used herein, the term “transpose-extended matrix” refers to an N×(K+P) ISUM matrix that extends an N×K matrix transposition of a K×N matrix to have P number of N×1 multiplicand columns in each of columns (K+1) to (K+P) of the transpose-extended matrix. The XP PU can transpose an N×K multiplicand matrix to generate, in columns 1 to K of the ISUM transpose-extended matrix, corresponding rows 1 to K of the loss function input matrix. The XP PU can generate columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise columns of scalar constants, and/or columns of one or more multiplicand matrices having row dimension N.
Similar to the manner of computing an Integrated Sum by means of sum-product computations of an ISUM integrated matrix and an ISUM multiplicand matrix, the BP PU can compute weights gradients and bias gradients as sum-products (e.g., MACC sum-products) of a loss function input matrix and an ISUM transpose-extended matrix. As will be seen from further discussion of
By executing a single sequence of integrated sum-product computations, an ISUM PU can avoid computing each of the weights and bias gradients as separate computations. Further, computing each of the weights and bias gradients as separate computations can require computing each gradient using different MCUs. By executing a single sequence of integrated sum-product computations, an ISUM PU can, alternatively, compute the gradients using a single MCU configured to compute the sum-products of the loss function input matrix and an ISUM transpose-extended matrix. Additionally, as will be seen in the examples of
To illustrate,
The XP PU can generate matrix ATE 908 as a transposition of matrix A 902 (although, not necessarily as extracted from matrix AE 902 itself) and can append column (K+1) of matrix ATE 908 as a multiplicand column. The XP PU can generate matrix ATE 908 as a transposition of matrix AE 902. In this example, the XP PU can generate columns 1 to (K+1) of matrix ATE 908 as a transposition of matrix AE 902. In this case, column (K+1) of matrix ATE 908 comprises row (K+1) of matrix AE 902 transposed. Alternatively, the XP PU can generate columns 1 to K of matrix ATE 908 as a transposition of matrix A 902. In this case, the XP PU can generate column (K+1) of matrix ATE908 to comprise a column of scalar constants or, alternatively, an N×1 multiplicand matrix.
The example of
While
The example of
By performing method 920, or a method similar or equivalent to method 920, the XP PU can generate an N×(K+P) transpose-extended matrix, [ATE], having rows 1 to K of a K×N input matrix, [A], in columns 1 to K of matrix [ATE] and having, in columns (K+1) to (K+P) of matrix [ATE], multiplicand columns comprising constants or N×1 matrices. The XP PU can generate columns 1 to K of matrix [ATE] from an N×K matrix, [AT], transposed from K×N matrix [A]. Alternatively, the XP PU can generate columns 1 to K of matrix [ATE] by transposing the matrix [A] or, alternatively, by transposing rows 1 to K of a (K+P)×N ISUM extended matrix [AE]. Accordingly, in describing method 920, matrix [AIN] represents any one of matrix [A], matrix [AT], or matrix [AE] used to generate columns 1 to K of matrix [ATE].
To perform the method, the XP PU can utilize a row counter, R, and a column counter, C, corresponding to row R of the matrix [AIN] to be transposed to column C of matrix [ATE]. In operation 922 of method 920, the XP PU initializes counters R and C to 1, corresponding initially to row 1 of the matrix [AIN] to be transposed to column 1 of matrix [ATE]. In operation 924, the XP PU outputs row R of matrix [AIN] to column C of matrix [ATE]. In operation 926 the XP PU increments R and C to indicate the next successive row of matrix [AIN] and next successive column of matrix [ATE]. In implementations, counter R and/or counter C can comprise a simple integer counter or, alternatively, can comprise, for example, an address of elements of respective matrices [AIN] and [ATE] in a memory of the TCS.
In operation 928 the XP PU determines if counter R is greater than dimension K, indicating that rows 1 to K of matrix [AIN] have been transposed to corresponding columns 1 to K of matrix [ATE]. If not, the XP PU repeats operations 924-926. If, on the other hand, the XP PU determines, in operation 928, that counter R is greater than dimension K, in operation 930 the XP PU determines if columns (K+1) to (K+P) of matrix [ATE] are to be generated as a transposition of an ISUM row-extended matrix; generated by insertion of an M×1 multiplicand matrix [S]; or, generated by the XP PU injecting a column of constants (e.g., constant value 1 or another constant value).
In operation 930 the XP PU can determine to generate columns (K+1) to (K+P) of matrix [ATE] as a transposition of an ISUM row-extended matrix based on, for example, that matrix [AIN] is a (K+P)×N ISUM extended matrix, [AE]. As seen in the foregoing examples of the disclosure, rows (K+1) to (K+P) of matrix [AE] can comprise constant rows such that transposing rows (K+1) to (K+P) of matrix [AE] generates columns (K+1) to (K+P) of matrix [ATE] comprising the constants of respective rows (K+1) to (K+P) of matrix [AE].
The XP PU can, alternatively, determine in operation 930 that columns (K+1) to (K+P) of matrix [ATE] are to be generated by insertion of an M×1 multiplicand matrix [S] or by injecting a column of constants. The XP PU can make this determination based on, for example, that matrix [AIN] comprises matrix [A] or the transposed matrix [AT] of matrix [A].
If, in operation 930, the XP PU determines that columns (K+1) to (K+P) of matrix [ATE] are to be generated as a transposition of matrix [AE], in operation 932 the XP PU outputs row R of the matrix [AE] to column C of matrix [ATE].
If the XP PU determines, in operation 930, that columns (K+1) to (K+P) of matrix [ATE] are to be generated inserting a multiplicand matrix [S], in operation 934 the XP PU outputs matrix [S] to column C of matrix [ATE]. As previously described, matrix [S] can comprise, for example, a matrix of constants, or of differing scalar values.
If the XP PU determines, in operation 930, that columns (K+1) to (K+P) of matrix [ATE] are to be generated injecting a column of constants, in operation 936 the XP PU outputs to column C of matrix [ATE] an N×1 multiplicand column having constant s in each element of the multiplicand column. To output a column of matrix [ATE] as a column of constants, in operation 936, the XP PU can include a constant input element similar, for example, to constant input element 336 of
To inject constant s from a constant input element, in operation 936 the XP PU can output constant s from the constant input element into each row element of column C of matrix [ATE]. For example, the XP PU can perform N number of output cycles that each output an instance of constant S into each of rows 1 to N of column C of matrix [ATE]. In another example, the XP PU can have (or, have access to) a scratchpad column stored in a register, or a memory and can output the N instances of constant s into row elements of the scratchpad column. Upon completing the N output cycles, in operation 936 the XP PU can output the scratchpad column to column C of matrix [ATE]. In a third example, a constant input element can comprise an N×1 constant matrix having constant s in each row of the constant matrix, and in operation 936 the XP PU can output the constant matrix to column C of matrix [ATE].
In operation 940 the XP PU determines if C is greater than P, indicating that the XP PU has generated all (K+P) columns of matrix [ATE]. If not, the XP PU repeats operations 926 through 940 to generate the remaining columns among columns (K+1) to (K+P) of matrix [ATE]. If, alternatively, the XP PU determines in operation 940 that counter C is greater than P, in operation 942 the XP PU outputs matrix [ATE]. In operation 942 the XP PU can output matrix [ATE] to, for example, a memory, and/or to a BP PU, such that matrix [ATE] can be utilized to compute weights and bias gradients as in the example of
In implementations, an ISUM PU, ISUM MCU, or an ISUM matrix integrator can perform a method such as method 920 to generate an ISUM transpose-extended matrix, and/or to compute gradients of a loss function input matrix using a transpose-extended matrix.
In
In
Matrix WB 1002 can be an ISUM integrated matrix, such as in the examples of
PUs among PUs 1004 can comprise ISUM PUs and/or ISUM MCUs such as illustrated in the foregoing examples of the disclosure. PUs among PUs 1004 can comprise hardware circuits and/or include programs executable on processors of TCS. PUs 1004 can comprise, for example, RDUs, and/or tiles of RDUs, that can be included (but not shown explicitly in
FWD PU 1004A can compute a forward Integrated Sum matrix (e.g., a matrix of MACC sum-products, Σwb aE) of matrix AE 1002 and matrix WB 1002, shown in
XP PU 1004C can be configured to generate matrix ATE 1002 from matrix A 1002, from matrix AT 1002, or, from matrix AE 1002; and, can be configured to, optionally, generate columns among columns (K+1) to (K+P) of matrix ATE 1002 to include matrix [S]. XP PU 1004C can, for example, input matrix A 1002 to generate matrix AT 1002 and/or matrix AE 1002, and can store one or both matrices in memory 1002E. XP PU 1004C can input matrix A 1002, matrix AT 1002, or matrix AE 1002 to generate matrix ATE 1002 in memory 1002E. XP PU 1004C can input matrix A 1002, matrix AT 1002, or matrix AE 1002 to generate columns 1 to K of matrix ATE 1002. XP PU 1004C can input rows (K+1) to (K+P) of matrix AE 1002 or, optionally, matrix [S] 1002, and/or a constant input element, such as constant input element 1008 in
In
While
Considering again operation 936 of method 920 in
In
XP PU 1104 can execute a (K+P) number of transposition cycles to generate matrix ATE 1102. In transposition cycles 1 to K, XP PU 1004 can input (e.g., read from memory 1102A) elements of matrix AIN 1102 for output to columns 1 to K of matrix ATE 1102. In transposition cycles (K+1) to (K+P) XP PU 1004 can input a value of constant s from constant input element S 1112 (e.g., overriding a read operation from memory 1102A) to output to columns (K+1) to (K+P) of matrix ATE 1102.
In
Count 1118 can comprise, for example, a count of transposition cycles, from 1 to (K+P). In each of the (K+P) transposition cycles XP PU 1004 can input to output vector 1108, via input 1124A, an output of gate 1116. In a transposition cycle, boolean 1114 can operate to selectively output from gate 1116 either data read from matrix AIN 1102, via input 1122A to gate 1116 or, via input 1122B to gate 1116, value s of constant input element 1112 input. Gate 1116 can output the selected input to output vector 1108. Correspondingly, in each transposition cycle column output logic 1110 can receive from output vector 1108 one or more elements of matrix AIN 1102, or one or more instances of constant value s, to output, via input 1128 to memory 1102B, to a column of matrix ATE 1102.
To illustrate in more detail, boolean 1114 can be hardwired, and/or can be programmable, to evaluate a boolean expression and, in a transposition cycle, based on a result of the evaluation, can select among input 1122A (i.e., a row of matrix AIN 1102) and input 1122B (i.e., constant input element S 1112) for output from gate 1116 to output vector 1108 via input 1124A. For example, boolean 1114 can evaluate a boolean expression such as [C>K] (or, [C<K+1], for example), where C is a value of count 1118 input to boolean 1114 via input 1126. In cycles 1 to K of the (K+P) transposition cycles, RD logic 1106 can read, via input 1122A, from memory 1102A, elements of matrix AIN 1102 (where matrix AIN is un-transposed matrix [A], elements of a row of matrix AIN 1102 or, alternatively, where matrix AIN is transposition matrix [AT] elements of a column of matrix AIN 1102). Boolean 1114 can evaluate [C>K] as FALSE and, in response, can configure gate 1116 to output to output vector 1108, during that transposition cycle, elements of matrix AIN 1102 read on input 1122A. Alternatively, in cycles (K+1) to (K+P) of the (K+P) transposition cycles, boolean 1114 can evaluate [C>K] as TRUE. In response boolean can configure gate 1116 to output to output vector 1108, during that transposition cycle, constant s from constant input element s.
As described in reference to BP PU 1004B of
Output vector 1108 can comprise a number of storage elements to store elements of matrix AIN 1102 or instances of constant s for output to column output logic 1110. For example, output vector 1108 can comprise a memory location or register to input one element of matrix AIN 1102, or to input one instance of constant s. In a transposition cycle, RD logic 1106 can, accordingly, perform N number of read cycles to read N elements of matrix AIN 1102, or N instances of constant s, and output each element of instance of constant s, via input 1124B, to column output logic 1110. Column output logic 1110 can generate a column of matrix ATE 1102 from outputs of output vector 1108.
Alternatively, output vector 1108 can comprise multiple memory locations or registers to input some or all elements of a row (where matrix AIN is un-transposed matrix [A]) or column (where matrix AIN is transposition matrix [AT]) of matrix AIN 1102, or multiple instances of constant s. In this case, RD logic 1106 can generate a column of matrix ATE 1102 in a single, or fewer than N, input cycles to input output vector 1108.
While XP PU 1104 in
Components of a TCS, such as ISUM matrix integrators, ISUM TPUs, and ISUM MCUs can perform techniques of the disclosure, and/or any or all of the operations of the methods of the disclosure, in any particular combination and/or order. Components of a TCS, such as ISUM matrix integrators, ISUM PUs, and ISUM MCUs can be combined and/or subdivided in any particular arrangement suitable to perform ISUM matrix integration and computations, such as sum-product, transposition, and/or backpropagation computations used to illustrate the disclosure (but, not limited to only these example computations and matrix operations).
Implementations can comprise a computer program product and can include a computer readable storage medium (or media) having computer readable program instructions of the computer program product incorporated therein. It will be understood by one of ordinary skill in the art that computer readable program instructions can implement each or any combination of operations and/or structure of the disclosure, such as illustrated by the drawings and described herein.
The computer readable program instructions can be provided to one or more processors, and/or other elements, of a computing system or apparatus to produce a machine which can execute, via the processor(s), to implement operations and/or actions similar or equivalent to those of the disclosure. The computer readable program instructions can be stored in a computer readable storage medium that can direct one or more processors, and/or other elements, of a computing system or apparatus to function in a particular manner, such that the computer readable storage medium comprises an article of manufacture including instructions to implement operations and/or structures similar or equivalent to those of the disclosure.
The computer readable program instructions of the computer program product can cause one or more processors to perform operations of the disclosure. A sequence of program instructions, and/or an assembly of one or more interrelated programming modules, of the computer program product can direct one or more one or more processors and/or computing elements of a computing system to implement the elements and/or operations of the disclosure including, but not limited to, the structures and operations illustrated and/or described in the present disclosure.
A computer readable storage medium can comprise any tangible (e.g., hardware) device, or combination of tangible devices, that can store instructions of the computer program product and that can be read by a computing element to download the instructions for use by a processor. A computer readable storage medium can comprise, but is not limited to, electronic, magnetic, optical, electromagnetic, and/or semiconductor storage devices, or any combination of these. A computer readable storage medium can comprise a portable storage medium, such as a magnetic disk/diskette, optical disk (CD or DVD); a volatile and/or non-volatile memory; a memory stick, a mechanically encoded device, and any combination of these. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as electrical signals transmitted through a wire, radio waves or other freely propagating electromagnetic waves, or electromagnetic waves propagating through a wave transmission medium (e.g., a wave guide or fiber-optic cable).
The computer readable program instructions can be communicated from the computer readable storage medium to the one or more computing/processing devices, via a programming API of a computing system, and/or a communications interface of a computing system, having access to the computer readable storage medium, and/or a programming API of a computing system, and/or a communications interface of the one or more computing/processing devices. The API(s) and/or communications interface(s) can couple communicatively and/or operatively to a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The API(s) and/or communications interface(s) can receive the computer readable program instructions read from computer readable storage medium and can forward the computer readable program instructions to the one or more computing/processing devices via the API(s), communications interface(s), and/or network.
In implementations, the computer readable program instructions of the computer program product can comprise machine language and/or assembly language instructions, instruction-set-architecture (ISA) instructions, microcode and/or firmware instructions, state-setting data, configuration data for integrated circuitry, source code, and/or object code. The instructions and/or data can be written in any combination of one or more programming languages.
The computer readable program instructions can execute entirely, or in part, on a user's computer, as a stand-alone software package; partly on a user's computer and partly on a remote computer; or, entirely on a remote computer. A remote computer can be connected to a user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN). In implementations, electronic circuitry including, for example, FPGA, PLAs, and or CGRAs can execute the computer readable program instructions by utilizing state information of the computer readable program instructions to configure the electronic circuitry to perform operations or elements of the disclosure, such as illustrated by the drawings and described herein.
In implementations, computer readable program instructions can also be loaded onto a computing system, or component(s) thereof, to cause the computing system and/or component(s) thereof to perform a series of operational steps to produce a computer implemented process, such that the instructions which execute on the computing system, or component(s) thereof, implement the operations or elements of the disclosure, such as illustrated by the drawings and described herein.
The flowcharts and block diagrams in the Drawings and Incorporations illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various implementations of the present invention. Individual elements illustrated in the Figures—such as individual operations illustrated in the flowcharts or individual blocks of block diagrams—may represent a module, segment, or portion of executable instructions for implementing the disclosed function(s). In various alternative implementations, particular operations may occur in an order differing from that illustrated in the examples of the drawings. For example, two operations shown in succession in a diagram of the disclosure may, in a particular implementation, be executed substantially concurrently, or may sometimes be executed in a reverse order, depending upon the functionality involved. It will be further noted that particular blocks of the block diagrams, operations of the flowchart illustrations, and/or combinations of blocks in the block diagrams and/or flowcharts illustrations, can be implemented using special purpose hardware and/or systems that, individually or in combination, perform the specified functions, acts, and/or computer instructions.
Terminology used herein, and the examples disclosed, are chosen to illustrate the principles of the implementations, the practical application or technical improvement over alternative technologies, and to enable others of ordinary skill in the art to understand the implementations disclosed herein. The disclosure illustrates various example implementations, and the examples are intended to illustrate principles and aspects of the disclosure, but are not intended to limit implementations, nor intended to be exhaustive of implementations that may be conceived within the scope of the disclosure. It would be appreciated by one of ordinary skill in the art that alternative implementations can comprise modifications and combinations within the spirit of the disclosure and the scope of the claims.
As can be seen in the foregoing examples, features of the disclosure can comprise methods and apparati of computing systems. A summary of example implementations of such features includes:
A computer-implemented comprises: generating, by a computing system, an Integrated Summation (ISUM) integrated matrix comprising number K of multiplicand columns and number P of addend columns, wherein each of columns 1 though the number K of multiplicand columns comprises respective columns 1 through the number K of a first multiplicand matrix having the number K of columns, and wherein each of the number P of addend columns comprises an integrated addend; generating, by the computing system, an ISUM row-extended matrix comprising the number K of multiplicand rows and the number P of extended rows, wherein rows 1 through the number K of the multiplicand rows comprise respective rows 1 though the number K of a second multiplicand matrix having the number K of rows, and wherein each extended row, among the number P of extended rows, comprises a constant row; computing, by the computing system, (K+P) number of products, the (K+P) number of products comprising each column element of columns 1 through (K+P) of a row of the ISUM integrated matrix, multiplied by a corresponding row element, among rows 1 through (K+P), of a column of the ISUM row-extended matrix; and, computing, by the computing system, an Integrated Sum comprising a sum of the (K+P) number of products.
The example of implementation 1, wherein the method of the computing system computing the Integrated Sum comprising the sum of the (K+P) number of products comprises computing, by the computing system, the Integrated sum as a multiply-accumulate computation of each column element of the columns 1 through (K+P) of the row of the ISUM integrated matrix multiplied by the corresponding row element, among rows 1 through (K+P), of the column of the ISUM row-extended matrix.
The example of implementation 1, wherein the method further comprises outputting, by the computing system, the Integrated Sum to an element of an Integrated Sum Matrix, the element of the Integrated Sum matrix included in a row element of the Integrated Sum matrix corresponding to the row of the ISUM integrated matrix and included in a column element of the Integrated Sum matrix corresponding to the column of the ISUM row-extended matrix.
The example of implementation 1, wherein an integrated addend, among the number P of addend columns included in the ISUM integrated matrix, is selected from a group consisting of: a column of a first addend matrix and a column of a second addend matrix comprising products of a constant multiplied by each element of a column of a third addend matrix.
The example of implementation 1, wherein column element of an extended row, among the number P of extended rows, is a constant.
The example of implementation 1, wherein the computing system comprises a plurality of matrix computation units (MCUs); and, wherein the method of the computing system computing the Integrated Sum comprises: computing, by a first MCU among the plurality of MCUs, a first sum-product, the first sum-product comprising a sum of a first subset of the (K+P) number of products; computing, by a second MCU among the plurality of MCUs, a second sum-product, the second sum-product comprising a sum of a second subset of the (K+P) number of products; and adding, by a third MCU among the plurality of MCUs, the first sum-product and the second sum-product.
The example of implementation 6, wherein the method of the first MCU computing the first sum-product and the second MCU computing the second sum-product comprises the first MCU computing the first sum-product and the second MCU computing the second sum-product in parallel.
The example of implementation 6, wherein the computing system comprises an accumulator; and, wherein the method of the third MCU adding the first sum-product and the second sum-product comprises the third MCU adding product among the first subset of the (K+P) number of products, and adding a product among the second subset of the (K+P) number of products, to the accumulator.
A computer program comprises a computer readable storage medium having first program instructions embodied therewith, wherein the first program instructions are executable by at least one processor to cause the at least one processor to: generate an Integrated Summation (ISUM) integrated matrix comprising number K of multiplicand columns and number P of addend columns, wherein each of columns 1 though the number K of multiplicand columns comprises respective columns 1 through the number K of a first multiplicand matrix having the number K of columns, and wherein each of the number P of addend columns comprises an integrated addend; generate an ISUM row-extended matrix comprising the number K of multiplicand rows and the number P of extended rows, wherein rows 1 through the number K of the multiplicand rows comprise respective rows 1 though the number K of a second multiplicand matrix having the number K of rows, and wherein each extended row, among the number P of extended rows, comprises a constant row; compute a (K+P) number of products, the (K+P) number of products comprising each column element of columns 1 through (K+P) of a row of the ISUM integrated matrix, multiplied by a corresponding row element, among rows 1 through (K+P), of a column of the ISUM row-extended matrix; and, compute an Integrated Sum comprising a sum of the (K+P) number of products.
The example of implementation 9, wherein the first program instructions are executable by at least one processor to further cause the at least one processor to output the Integrated Sum to an element of an Integrated Sum Matrix, the element of the Integrated Sum matrix included in a row element of the Integrated Sum matrix corresponding to the row of the ISUM integrated matrix and included in a column element of the Integrated Sum matrix corresponding to the column of the ISUM row-extended matrix.
The example of implementation 9, wherein the first program instructions are executable by at least one processor to further cause the at least one processor to compute the Integrated Sum as a multiply-accumulate computation.
The example of implementation 9, wherein the first program instructions are executable by at least one processor to further cause the at least one processor to compute, in parallel, the Integrated Sum as a sum of a first sum-product and a second sum-product, the first sum-product comprising a sum of a first subset of the (K+P) number of products, the second sum-product comprising a sum of a second subset of the (K+P) number of products.
A computing system comprises: an ISUM Integrated Summation (ISUM) matrix integrator and an ISUM processing unit (ISUM PU), wherein the ISUM matrix integrator is configured to:
The example of implementation 13, wherein the ISUM PU configured to compute the Integrated Sum comprises the ISUM PU further configured to compute the Integrated sum as a multiply-accumulate computation of each column element of the columns 1 through (K+P) of the row of the ISUM integrated matrix multiplied by the corresponding row element, among rows 1 through (K+P), of the column of the ISUM row-extended matrix.
The example of implementation 13, wherein the first multiplicand matrix comprises a matrix of weight values; and, wherein an addend column of the ISUM integrated matrix comprises a column of a matrix of bias values.
The example of implementation 13, wherein the ISUM PU comprises a first matrix computation unit (MCU) and a second MCU; and, wherein the ISUM PU configured to compute the Integrated Sum comprises: the first MCU configured to compute, in a first multiply-accumulate (MACC) computation, a first set of MACC sum-products; the second ISUM MCU configured to compute, in a second MACC computation, a second set of MACC sum-products, the first set of MACC sum-products comprising a sum of a first subset of the (K+P) number of products and the second set of MACC sum-products comprising a sum of a second subset of the (K+P) number of products; and, one of the first MCU and the second MCU further configured to compute the Integrated Sum comprising a sum of the first set of MACC sum-products and the second set of MACC sum-products.
The example of implementation wherein the computing system further comprises an accumulator; wherein the ISUM PU comprises a first MCU and a second MCU; wherein the ISUM PU is further configured to: input, to the first MCU, a first column element, among the each column element of the columns 1 through (K+P) of the row of the ISUM integrated matrix and input, to the first MCU, a first row element, among the corresponding row element of rows 1 through (K+P) of the column of the ISUM multiplicand matrix; and, input, to the second MCU, a second column element, among the each column element of the columns 1 through (K+P) of the row of the ISUM integrated matrix and, input, to the second MCU, a second row element, among the corresponding row element of rows 1 through (K+P) of the column of the ISUM multiplicand matrix.
The first MCU is configured to compute a first product, among the (K+P) number of products, comprising the first row element multiplied by the first column element; the second MCU is configured to compute a second product, among the (K+P) number of product comprising the second row element multiplied by the second column element; at least one of the first MCU and the second MCU are further configured to add the first product and the second product to the accumulator; and, the ISUM PU configured to compute the Integrated Sum comprises the ISUM PU further configured to compute the Integrated Sum including the accumulator.
The example of implementation 17, wherein the first MCU comprises a first tensor buffer, comprising a set of row element buffers, and a second tensor buffer comprising a set of column element buffers; wherein the ISUM PU configured to input the first column element to the first MCU comprises the ISUM PU configured to input the first column element into a column buffer among the set of column element buffers; wherein the first MCU configured to compute the first product comprises the first MCU further configured to input the first column element from the column buffer; wherein the ISUM PU configured to input the first row element to the first MCU comprises the ISUM PU configured to input the first row element into a row buffer among the set of row element buffers; and, wherein the first MCU configured to compute the first product comprises the first MCU further configured to input the first row element from the row buffer.
The example of implementation wherein the ISUM matrix integrator is a component of the ISUM PU.
The example of implementation 13, wherein the ISUM PU comprises a processor; and, wherein the ISUM PU configured to compute the (K+P) number of products comprises the processor configured to compute at least a subset of the (K+P) number of products.
A computer-implemented method comprises generating, by a computing system, an Integrated Summation (ISUM) integrated matrix comprising a number K of multiplicand columns and a number P of addend columns, wherein each of the number K of multiplicand columns comprises a corresponding column of a first multiplicand matrix, and wherein each of the number P of addend columns of the ISUM integrated matrix comprises an integrated addend; computing, by the computing system, a set of products comprising products of each column element, among the number K of multiplicand columns, of a row of the ISUM integrated matrix multiplied by a corresponding row element of a column of a second multiplicand matrix; computing, by the computing system, an addend product comprising an addend element multiplied by a constant, the addend element comprising an element of the row of the ISUM integrated matrix included an addend column among the number P of addend columns of the ISUM integrated matrix; and, computing, by the computing system, an Integrated Sum comprising a sum of the products included in the set of products and the addend product.
The example of implementation 21, wherein the method further comprises outputting, by the computing system, the Integrated Sum to an element of an Integrated Sum Matrix, the element of the Integrated Sum matrix included in a row element of the Integrated Sum matrix corresponding to the row of the ISUM integrated matrix and included in a column element of the Integrated Sum matrix corresponding to the column of the second multiplicand matrix.
The example of implementation 21, wherein the integrated addend comprises one of a constant integrated addend and a column of an addend matrix.
The example of implementation 21, wherein the first multiplicand matrix comprises a matrix of weight values; and, wherein an addend column of the ISUM integrated matrix comprises a column of a matrix of bias values.
The example of implementation 21, wherein the computing system comprises at least one matrix computation unit (MCU); and, wherein the method of the computing system computing the Integrated Sum comprises:
The example of implementation 25, wherein the method of the first MCU computing the first sum-product comprises the first MCU computing the first sum-product as a multiply-accumulate computation.
The example of implementation 21, wherein the constant comprises a value of a constant input element of the computing system.
The example of implementation 27, wherein the computing system comprises multiplier selection logic and the constant input element comprises an input to the multiplier selection logic; and, wherein the multiplier selection logic outputs the value of the constant input element to compute the addend element multiplied by the constant.
A computing system comprises an Integrated Summation (ISUM) matrix integrator, at least one memory, and at least one matrix computation unit (MCU),
The example of implementation 29, wherein the computing system further comprises a constant input element, the constant input element comprising a value of the constant; and, wherein the computing system configured to compute the addend product comprising the addend element multiplied by the constant comprises the computing system further configured to multiply the addend element by the value of the constant included in the constant input element to compute the addend product.
The example of implementation 29, wherein the ISUM matrix integrator comprises a processor and a program; and, wherein the ISUM matrix integrator configured to generate the ISUM integrated matrix comprises the processor executing the program to generate at least a portion of the ISUM integrated matrix.
The example of implementation 29, wherein the at least one MCU configured to compute the Integrated Sum comprises a first MCU, among the at least one MCU, configured to compute a first subset of the set of products and a second MCU, among the at least one MCU, configured to compute a second subset of the set of products; and, wherein a third MCU, among the at least one MCU is configured to compute a sum of first products, included among the first subset of the set of products, and second products included among products among the second subset of the set of products.
The example of implementation 29, wherein the at least one MCU configured to compute the Integrated Sum comprises the at least one MCU further configured to: compute, in a first multiply-accumulate (MACC) computation, a first MACC sum-product comprising a sum of a first subset of the set of products; compute, in a second MACC computation, a second MACC sum-product comprising a sum of a second subset of the set of products; and, compute, in a third MACC computation, a third MACC sum-product comprising a sum of the addend product and at least one of the first MACC sum-product and the second MACC sum-product.
The example of implementation 29, wherein the integrated addend comprises one of a constant integrated addend and a column of an addend matrix.
The example of implementation 29, wherein the first multiplicand matrix comprises a matrix of weight values; and, wherein an addend column of the ISUM integrated matrix comprises a column of a matrix of bias values.
A matrix computation unit (MCU) comprises a multiply-accumulate (MACC) Arithmetic Logic Unit (ALU), multiplier selection logic, and a constant input element, wherein the MACC ALU comprises a first multiplier input and a second multiplier input; wherein the multiplier selection logic comprises a multiplicand input and a constant input; wherein the constant input element comprising a value of a constant;
The example of implementation 36, wherein the multiplier selection logic comprises a counter coupled to a counter, the counter configured to count computations of products by the MACC ALU; and, wherein the multiplier selection logic configured to determine that the column element is input from the addend column of the ISUM integrated matrix comprises the multiplier selection logic further configured to determine that the column element is input from the addend column of the ISUM integrated matrix based on the counter reaching a value greater than the number K.
The example of implementation 37, wherein the counter is further configured to output, to the multiplier selection logic, a status indicating to the multiplier selection logic to output the constant input of the multiplier selection logic to the second multiplier input of the MACC ALU from the constant input element; and, wherein the multiplier selection logic is further configured to output the constant input of the multiplier selection logic to the second multiplier input of the MACC ALU responsive to the status.
The example of implementation 36, wherein the first multiplicand matrix comprises a matrix of weight values; and, wherein an addend column of the ISUM integrated matrix comprises a column of a matrix of bias values.
The example of implementation 36 wherein the integrated addend comprises one of a constant integrated addend and a column of an addend matrix.
Implementations can comprise, additionally or alternatively, methods and apparati of computing systems disclosed herein to process matrices in backpropagation. A summary of examples of such implementations includes:
A computer-implemented method comprises executing, by a computing system, (K+P) number of transposition cycles to generate an Integrated Summation (ISUM) transpose-extended matrix having N number of rows and (K+P) number of columns; generating, by the computing system, in cycles 1 to K of the (K+P) number of transposition cycles, columns 1 to K of ISUM transpose-extended matrix to comprise a matrix transposition of corresponding rows 1 to K of a first multiplicand matrix; generating, by the computing system, in cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise a multiplicand column having N number of rows; computing, by the computing system, a first sum-product comprising a sum of products of elements of a row of a second multiplicand matrix, having M rows and N columns, multiplied by corresponding elements of a first column of the ISUM transpose-extended matrix, the first column among columns 1 to K, of the ISUM transpose-extended matrix; and, computing, by the computing system, a second sum-product comprising a sum of products of the elements of the row of the second multiplicand matrix multiplied by corresponding elements of a second column of the ISUM transpose-extended matrix, the second column among columns (K+1) to (K+P), of the ISUM transpose-extended matrix.
The example of implementation 41, wherein the first multiplicand matrix comprises an ISUM row-extended matrix having (K+P) number of rows and N number of columns; and, wherein the method of the computing system generating each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise the multiplicand column comprises transposing, by the computing system, in the cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, rows (K+1) to (K+P) of the ISUM row-extended matrix to comprise corresponding columns of columns (K+1) to (K+P) of the ISUM transpose-extended matrix.
The example of implementation 41, wherein the first multiplicand matrix has K number of columns; and, wherein the method of the computing system generating each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise the multiplicand column comprises the computing system including in a third column, among columns (K+1) to (K+P) of the ISUM transpose-extended matrix, a column of a third multiplicand matrix having N rows and one column.
The example of implementation 41, wherein the method of the computing system generating each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise the multiplicand column comprises: generating, by the computing system, a constant column consisting of N number of constant elements each comprising a value of a constant; and, including, by the computing system, in a third column among columns (K+1) to (K+P) of the ISUM transpose-extended matrix, the constant column.
The example of implementation 44, wherein the computing system includes a constant input element having the value of the constant; and, wherein the method of the computing system generating the constant column comprises the computing system generating the value of the constant from the constant input element.
The example of implementation 45, wherein the constant input element is included in multiplier selection logic of the computing system; and, wherein the method of the computing system generating the value of the constant from the constant input element comprises computing system generating the constant column further comprises the computing system, in the cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, configuring the multiplier selection logic to output the value of the constant from the constant input element.
The example of implementation 41, wherein the second sum-product consists of a sum of products of elements of columns 1 to N of the row of the second multiplicand matrix computed by multiplying the elements of the row of the first multiplicand matrix multiplied by the corresponding elements of the second column among the columns (K+1) to (K+P) of the ISUM transpose-extended matrix.
The example of implementation 41, wherein the second multiplicand matrix comprises a loss function input matrix having M rows and N columns; wherein the first sum-product comprises a gradient of elements a row of the loss function input matrix multiplied by a third column of the ISUM transpose-extended matrix, the third column among columns 1 to K of the ISUM transpose-extended matrix; and, wherein the second sum-product comprises a gradient of elements of the row of the loss function input matrix multiplied by a fourth column of the ISUM transpose-extended matrix, the fourth column among columns (K+1) to (K+P) of the ISUM transpose-extended matrix.
A computing system comprises at least one memory, the at least one memory comprising a first multiplicand matrix having at least K number of rows and N number of columns and a second multiplicand matrix having M rows and N columns; a transposition processing unit (XP PU) configured to execute a (K+P) number of transposition cycles to: generate, in cycles 1 to K of the (K+P) number of transposition cycles, columns 1 to K of an Integrated Summation (ISUM) transpose-extended matrix to comprise a matrix transposition of corresponding rows 1 to K of the first multiplicand matrix, the ISUM transpose-extended matrix having N number of rows and (K+P) number of columns; and, generate, in cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise a multiplicand column having N number of rows.
The computing system further comprises a backpropagation processing unit (BP PU) configured to: compute a first sum-product comprising a sum of products of elements of a row of a second multiplicand matrix, having M rows and N columns, multiplied by corresponding elements of a first column of the ISUM transpose-extended matrix, the first column among columns 1 to K, of the ISUM transpose-extended matrix; and, compute a second sum-product comprising a sum of products of the elements of the row of the second multiplicand matrix multiplied by corresponding elements of a second column of the ISUM transpose-extended matrix, the second column among columns (K+1) to (K+P), of the ISUM transpose-extended matrix.
The example of implementation 49, wherein the first multiplicand matrix comprises an ISUM row-extended matrix having (K+P) number of rows and N number of columns; and, wherein the XP PU configured to generate the ISUM transpose-extended matrix to comprise the multiplicand column in each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix comprises the XP PU further configured to transpose, in the cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, rows (K+1) to (K+P) of the ISUM row-extended matrix to comprise corresponding columns among columns (K+1) to (K+P) of the ISUM transpose-extended matrix.
The example of implementation 49, wherein the first multiplicand matrix comprises an ISUM row-extended matrix having (K+P) number of rows and N number of columns; and, wherein the XP PU configured to generate the ISUM transpose-extended matrix to comprise the multiplicand column in each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix comprises the XP PU further configured to include, in a third column, among columns (K+1) to (K+P) of the ISUM transpose-extended matrix, a column of a third multiplicand matrix having N rows and one column.
The example of implementation 51, wherein the first multiplicand matrix having at least K number of columns comprises the first multiplicand matrix having K number of columns; and, wherein the XP PU configured to generate each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise the multiplicand column comprises the XP PU further configured to include, in a third column, among columns (K+1) to (K+P) of the ISUM transpose-extended matrix, a column of a third multiplicand matrix having N rows and one column.
The example of implementation 49, wherein XP PU configured to generate each of columns (K+1) to (K+P) of the ISUM transpose-extended matrix to comprise the multiplicand column comprises the XP PU further configured to: generate a constant column consisting of N number of constant elements each comprising a value of a constant; and, include, in a third column among columns (K+1) to (K+P) of the ISUM transpose-extended matrix, the constant column.
The example of implementation 53, wherein the computing system includes a constant input element having the value of the constant; and, wherein the XP PU configured to generate the constant column comprises the XP PU further configured to generate the value of the constant from the constant input element
The example of implementation 54, wherein the computing system further comprises multiplier selection logic configurable to output the value of the constant from the constant input element; and, wherein the XP PU configured to generate the value of the constant from the constant input element comprises the XP PU further configured to configure the multiplier selection logic, in the cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, to output the value of the constant from the constant input element to generate the value of the constant from the constant input element.
The example of implementation 49, wherein the second column comprises a constant column having constant value one; and, wherein the BP PU configured to compute the sum of products of the elements of the row of the second multiplicand matrix multiplied by the corresponding elements of the second column comprises the BP PU further configured to compute a sum of elements of columns 1 to N of the row of the second multiplicand matrix by multiplying the elements of the row of the second multiplicand matrix by the constant value one in the corresponding elements of the second column.
A transposition processing unit (XP PU) comprises an output vector and column output logic, wherein the XP PU is configured to: execute a (K+P) number of transposition cycles to generate (K+P) number of columns of an Integrated Summation (ISUM) transpose-extended matrix; input to the output vector, in transposition cycles 1 to K of the (K+P) number of transposition cycles, a column element included a row, among respective rows 1 to K of an input matrix having K number of row; input into the output vector, in transposition cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, a value of a constant; and, output to a column of the ISUM transpose-extended matrix, the output vector, the column of the ISUM transpose-extended matrix corresponding to a first cycle number corresponding to a first transposition cycle among the (K+P) number of transposition cycles.
The example of implementation 57, wherein the column element is selected from a column of the row of the input matrix corresponding to a second cycle number corresponding to a second transposition cycle among the (K+P) number of transposition cycles, the second transposition cycle among the transposition cycles 1 to K; and, wherein the column of the ISUM transpose-extended matrix comprises a column of the ISUM transpose-extended matrix corresponding to the second cycle number.
The example of implementation 57, wherein the XP PU further comprises a counter, an input gate, a constant input element comprising the value of the constant, and boolean expression logic; wherein the XP PU is further configured to set a value of the counter to correspond to a transposition cycle among the (K+P) number of transposition cycles; wherein the input gate is configured to receive, on a matrix input of the input gate, the column element, and to receive, on a constant input of the input gate, an output of the constant input element; wherein the output vector is configured to receive an output of the input gate; and, wherein the boolean expression logic is configured to receive a value of the counter and, based on the value of the counter, select one of the matrix input and the constant input for output from the input gate to the output vector.
The XP PU configured to input, in transposition cycles 1 to K of the (K+P) number of transposition cycles, the column element into the output vector comprises the boolean expression logic selecting, based on the counter corresponding to a second transposition cycle, the matrix input of the input gate for output from the input gate to the output vector, the second transposition cycle among the transposition cycles 1 to K of the (K+P) number of transposition cycles; and, the XP PU configured input into the output vector, in transposition cycles (K+1) to (K+P) of the (K+P) number of transposition cycles, the value of the constant, comprises the boolean expression logic selecting, based on the counter corresponding to a third transposition cycle, the constant input of the input gate for output from the input gate to the output vector, the third transposition cycle among the transposition cycles (K+1) to (K+P) of the (K+P) number of transposition cycles.
The following are incorporated by reference for all purposes as if fully set forth herein: Prabhakar et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA '17, Jun. 24-28, 2017, Toronto, ON, Canada;U.S. patent application Ser. No. 16/239,252, filed Jan. 3, 2019, entitled “VIRTUALIZATION OF A RECONFIGURABLE DATA PROCESSOR,” (Attorney Docket No. SBNV1000USN01; and,U.S. patent application Ser. No. 16/922,975, filed Jul. 7, 2020, entitled “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATA FLOW RESOURCES,” (Attorney Docket No. SBNV1026USN01). This application is a continuation of U.S. Non-Provisional patent application Ser. No. 18/102,658/filed Jan. 27, 2023, entitled “MATRIX SUMMATION USING INTEGRATED MATRICES”, which is incorporated by reference herein in its entirety. This application is a continuation of and claims benefit of priority to U.S. Provisional Patent Application No. 63/308,916 filed Feb. 10, 2022, titled “INTEGRATED TENSOR COMPUTATIONS IN A COMPUTING SYSTEM”, which is incorporated by reference herein in its entirety. This application is a continuation of and claims benefit of priority to U.S. Provisional Patent Application No. 63/310,058 filed Feb. 14, 2022, titled “INTEGRATED TENSOR COMPUTATIONS UTILIZING CONSTANTS”, which is incorporated by reference herein in its entirety. This application is a continuation of and claims benefit of priority to U.S. Provisional Patent Application No. 63/310,049 filed Feb. 14, 2022, titled “INTEGRATED TENSOR COMPUTATIONS WITH BACK PROPAGATION”, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63310049 | Feb 2022 | US | |
63310058 | Feb 2022 | US | |
63308916 | Feb 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18102658 | Jan 2023 | US |
Child | 18225339 | US |