The present disclosure generally relates to systems and methods for decompositions, and more specifically, to systems and methods for Cholesky decompositions.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
In certain matrix operations, such as a matrix decomposition, the matrix is divided or factorized into a product of multiple matrices. For example, when solving a system of linear equations Ax=b, a Lower-Upper (LU) decomposition may factorize a matrix A into a lower triangular matrix L and an upper triangular matrix U. L(Ux)=b and/or Ux=L−1b may then be used to result in fewer operations (e.g., additions, multiplications) when solving Ax=b. Cholesky decomposition may also be used to solve Ax=b, for example, if A is a symmetric and positive definite matrix. The Cholesky decomposition may find A=LL* where L is a lower triangular matrix with real and positive diagonal entries, and L* is a conjugate transpose of L. Forward substitution may then be used to solve for y via Ly=b, and back substitution may be used to solve for x via L*x=y. It may be advantageous to provide for systems and methods that more efficiently factorize A via Cholesky decomposition.
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
In certain operations, Cholesky decomposition techniques may such be applied to solve for a variety of matrix-based problems where a matrix A may be a Hermitian positive-definite matrix. Accordingly, the matrix A may be decomposed into A=LL* where L is a lower triangular matrix with real and positive diagonal entries, and where L* is a conjugate transpose of L. A linear equation in the form Ax=b may then be more easily solved, for example, when compared to solutions arrived at via Lower-Upper (LU) decompositions. For example, one can solve Ax=b by first deriving the Cholesky decomposition A=LL*. Forward substitution may then be used to solve for y via Ly=b and back substitution may be used to solve for x via L*x=y. Accordingly, a variety of problems in digital signal processing (DSP), Monte Carlo simulations, Kalman filters, wireless communication, machine learning, and so on, may apply Cholesky decomposition. For instance, Linear Minimum Mean Squared Error (LMMSE) and Principle Component Analysis (PCA) both may use Cholesky decomposition.
Applications that include Cholesky decomposition may include computationally intensive derivations, and may also include more stringent throughput and timing delay requirements. For example, a data dependency in a traditional approach for Cholesky decomposition may result in reduced parallelism for computing purposes and leave certain processing resources underutilized, especially when implemented in systems such as field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and the like. In one example of the traditional approach, each entry Lij (i≥j) may only be calculated after diagonal entries Lkk (k≤i) and non-diagonal entries Lmn(m=i, j; n<i) are derived first.
The techniques described herein provide for a hardware architecture and for processes that may compute certain numerators and denominators individually, minimizing or eliminating dependencies. For example, a data dependency between square root calculations of diagonal matrix elements as well as related multiplication and addition operations, may now be minimized or eliminated. A novel triangular systolic array (TSA) architecture which may include complex multiplier-accumulators (CMACs) cells may be designed for use with the techniques described herein. The TSA architecture may enable enhanced parallelism for certain operations, thus improving speed and/or efficiency of Cholesky decompositions. Additionally, the TSA architecture may be reused for other non-Cholesky problem solving derivations, such as for back substitution computations, matrix-matrix multiplications, and/or matrix-vector multiplications.
With the foregoing in mind,
The processor(s) 102 may communicate with the memory and/or storage circuitry 104, which may be a tangible, non-transitory, machine-readable-medium, such as random-access memory (RAM), read-only memory (ROM), one or more hard drives, flash memory, or any other suitable optical, magnetic or solid-state storage medium. The memory and/or storage circuitry 104 may hold data to be processed by the data processing system 100, such as processor-executable control software, configuration software, system parameters, configuration data, etc.
The data processing system 100 may also include a network interface 106 that allows the data processing system 100 to communicate with other electronic devices. In some embodiments, the data processing system 100 may be part of a data center that processes a variety of different requests. For instance, the data processing system 100 may receive a data processing request via the network interface 106 to perform machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, or some other specialized task. The data processing system 100 may also include one or more input/output systems 108, such as display devices (e.g., computer monitors), keyboards, mice, speakers, voice input devices, and so on, useful for entering and/or displaying information.
In the depicted embodiment, the processor 102 may be communicatively coupled to or include a Cholesky decomposition system 110. In use, the Cholesky decomposition system 110 may receive as input one or more matrices, such as Hermitian, positive-definite matrices A, and then decompose each matrix A into L and L* where L is a lower triangular matrix with real and positive diagonal entries, and L* is a conjugate transpose of L. In a traditional approach, L may be computed via Equation (1) and Equation (2).
There exists a dependency between the square root calculation in Equation (1) used to compute Lii and other operations in Equation (2) used to compute Lmn (m>n, m≥i). Only when the former square root calculation is complete can the latter part of the calculation start. The decomposition via Equation (1) and Equation (2) is thus tightly coupled as the index i goes from 1 to n. For example, the tight coupling may result in extra operations to complete the Cholesky decomposition, as shown in
The techniques described herein may improve decoupling of certain computations and thus increase parallelism and resource utilization. In one embodiment, certain L22 may be computed as a fraction. For example, an Equation (3) may be used to derive L22 based on A11, A21, A22, and L11=√{square root over (A11)} values.
As shown in Equation (3), a final numerator does not depend on L11. Therefore, L11 and the numerator for L22 may be computed in parallel, as shown in
A more efficient parallel computing process may thus be derived, as shown in
The process 300 may also perform (block 306) Part II square root calculations of A1,1k (k=0, . . . , n−1). The process 300 may additionally perform (block 308) Part III final result calculations as follows:
Accordingly, the process may then result in a Cholesky decomposed matrix 312 having L as a lower triangular matrix with real and positive diagonal entries, and L* as a conjugate transpose of L.
In the depicted process 300, most of the computations for the Cholesky decomposition may occur Part I (block 304). Part I may not depend on Part II and Part III. Accordingly, the computations for Part I (block 304) may be pipelined, for example, via a systolic array system further described in the figures below. Further, Parts I, II, and III may computed in parallel. For example, when Part II outputs √{square root over (A11k)}, the square root has already been divided into intermediate numerators Ai,jp, p≥k, i≥j. Accordingly, operations in Parts I, II, and III may be carried out simultaneously.
Advantageously, a time for dividing √{square root over (A11k)} into intermediate numerators may be flexible. In one embodiment, the time for dividing √{square root over (A11k)} into intermediate numerators may take place during the computations (block 304) of Part I or after the completion of Part I. However, early division and substituting quotients for intermediate numerators may improve numerical stability and also improve hardware parallelism. In Part II (block 306), we may compute a reciprocal square root (1/√) instead of the depicted square root (√). When using the reciprocal square root in Part II (block 306) the division operation in Part III becomes a multiplication operation. For floating point numbers, certain algorithm may compute 1/√ more efficiently than V and then reciprocate the result, thus resulting in even more efficient process 300 computations.
It may be beneficial to illustrate an example application of the process 300 to a 4×4 matrix A. For example,
As indicated in the “for” loops in Part I (block 304), processing may then move to columns 362 and 364, for example, column 362 is processed so that entry A221 is processed via (A111A211−A211A211*) to derive A112, and so that entry A321 is processed via (A111A321−A311A211*) to derive A212 shown in
As shown, in
Square root results of Part II (block 308) may then be used in Part III (block 310) final results calculations, as shown in
Each column 382, 384, 386, and 388 may then be transformed into columns 390,392, 394, and 396, respectively, included in an L matrix 398. The L matrix 398 is the lower triangular matrix of matrix A 350 shown in
In some embodiments, the process 300 or portions of the process 300 may be implemented via a complex-valued multiply-accumulate (CMAC) cell array, as illustrated in
The CMAC cell 400 array, in one embodiment, is a systolic array. In systolic arrays, such as in the CMAC cell array 400, the PEs are included in a homogenous network of tightly coupled data processing units where each PE independently computes a result (e.g., partial result) of a function of the data received from its upstream neighbors, the PE may then store the result within itself and then may pass the result downstream to other PEs. If the input matrix A is n×n, a first iteration through the CMAC cell array 400 may generate a deflated lower triangular matrix of (n−1)×(n−1), whose entries are Ai,j1, (1≤j≤n−1,j≤i≤n−1). There are n(n−1)/2 entries in the deflated matrix after the first iteration. Two CMAC operations are used to compute each entry, e.g., Ai−1,j−1k+1←A1,1kAi,jk−Ai,1kAj,1k*. Therefore the computational complexity in terms of CMAC in the first iteration is n(n−1). In the depicted embodiment, the n−1 PEs in the edge or boundary 402 of systolic array are used exclusively for the first iteration. For these PEs, the number of cycles to input data of one matrix is n cycles (largest number of cycles in order to be fully pipelined). The total number of CMACs that these n−1 PEs may provide is n(n−1) which is the exactly the same as the number of CMACs required in the first iteration. Full utilization of the PEs may thus be achieved. After each subsequent iteration, the dimension of a subsequent deflated matrix is decreased by one. For k-th (1≤k≤n−1) iteration, n−k PEs are assigned to the iteration, and each of these PE takes n−k+1 cycles for each input matrix.
More specifically,
For PEs used for one iteration, the order may be different for input and output entries, as shown in
For example, a first order 452 may be a diagonal pattern order, thus processing data via the main diagonal (e.g., edge 402). The next iteration 454 may then be column pattern order, thus processing via the first column in the CMAC cell array 400. The following iteration 456 may the flip back to diagonal order, thus processing via a second diagonal of the CMAC cell array 400. The subsequent iteration may then flip back to column, order, thus processing via columns 1 and 2 of the CMAC cell array 400, and so on. By providing for a systolic array, such as the CMAC cell array 400, the techniques described herein may more efficiently and rapidly process matrix data, for example, to result in a Cholesky decomposition of a matrix A.
Turning now to
The calculations may then be used in Part III (block 308). As mentioned earlier, it may be possible to pipeline the CMAC cell array 400 and subsequent systems downstream (or upstream) of the CMAC cell array 400 so that data flows may stream through the pipelines, which may improve throughput and efficiency.
It is to be noted that entries of the Hermitian positive-definite matrix A may include digital signal processing (DSP) signals. For example, the signals may include data sampled over a time domain, a space domain, or both. Accordingly, the entries may be used as input to the CMAC cell array 400 described above. Further, the algorithms described herein, including process 300 may be implemented as computer code or instructions executable by the CMAC cell array 400, via processors, such as via the processor 102, via FPGAs, and/or via ASICs. Indeed, the techniques described herein, such as the process 300, may now be more easily implemented in a variety of computing systems.
While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. However, it may be understood that the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims.