Many applications that are implemented in hardware take as input a list or vector of data words and produce a vector of data words of the same length as output. These applications are often implemented using streaming architectures. In these architectures, the input vector is divided into chunks of equal length and these chunks enter the architecture at regular intervals. Similarly, the output is produced in chunks exiting the architecture in regular intervals. Streaming means that the architectures can start processing the first chunk of the next input vector immediately after the last chunk of the current data vector entered. This present invention is concerned with the design of streaming architectures for permutations of a data vector. A permutation is a fixed reordering in a predetermined manner.
a shows a permutation in space. All 2n datawords are available in one time interval and can hence be permuted using only wires (the wires are not shown).
The present invention is concerned generally with designing streaming architectures for permutation in the scenario shown in
In one general aspect, the present invention is directed to computer-implemented systems and methods that provide an efficient technique for performing a large class of permutations on data vectors of length 2n, n>1, implemented with streaming width 2k (where 1≦k≦n−1). The technique applies to any permutation Q on 2n datawords that can be specified as a linear transform, i.e., as an n×n bit matrix (a matrix containing only 1s and 0s) P on the bit level. The relationship between Q and P is as follows: If Q maps (dataword) i to (dataword) j, then the bit representation of j is the bit-matrix-vector product of P with the bit representation of i. Bit-matrix-vector product means that the operations are over the finite field with two elements (often called GF(2)) 0 and 1, addition being XOR and multiplication being AND.
Given such a permutation specified by the matrix P and given the streaming width (k), an architectural framework (or datapath) is calculated to implement the permutation. In various embodiments, the datapath comprises: (i) a number of memory banks that are capable, during one time interval, of reading a data word and writing a data word; (ii) a write-stage connection network that takes in each time interval as input the chunks of the input vector and writes the contained data words to the memory banks according to a matrix M as explained below; and (iii) a read-stage connection network that reads in each time interval data words from the memory banks according to the matrix N as explained below and outputs them in chunks. N and M are such that P=NM (again, this matrix-matrix product is over the finite field with two elements) and rank(M1)=rank(N1′)=k, wherein M1 is a sub-matrix of M and N1′ is a submatrix of N−1. The details are described below.
In various embodiments, the computed solution always produces a partially optimal solution; that is, one of the stages (i.e., the write-stage or the read-stage) is optimal defined in terms of connectivity and control cost. Furthermore, in cases where the bit matrix P is a permutation matrix (i.e., a matrix having exactly one “1” in each row and in each column, with zeros elsewhere), both stages are optimal in terms of connectivity and control cost.
Various embodiments of the present invention are directed to a computer-implemented system and methods for computing the datapath, and, in other embodiments, to various classes of datapaths themselves.
Various embodiments of the present invention are described herein by way of example in conjunction with the following figures, wherein:
a is a diagram of a prior art permutation datapath that receives 2n inputs at once;
b is a diagram of a permutation datapath according to various embodiments of the present invention that receives streaming input data;
a) is a diagram illustrating the indexing of the streamed input vector with addresses according to various embodiments of the present invention;
b) is a diagram illustrating the indexing of the memory banks with addresses according to various embodiments of the present invention;
a and 7b are diagrams of the processes according to various embodiments of the present invention;
According to various embodiments, the present invention is directed to computer-implemented systems and methods that compute a datapath architecture for performing certain types of specified permutations. The architecture, with reference to
The process works on a large class of permutations, that is, any permutation that can be specified as a linear transform or matrix-vector multiplication with a matrix P on the bit level as defined in [0015]. In addition, the solution is optimal in terms of connectivity and cost for a subset of this class, namely permutations for which said matrix P is a permutation matrix, that is, it has exactly one (1) in each row and exactly one (1) in each column, with zeros elsewhere. This class of permutations includes stride permutations, which are a widely used type of permutation, especially in applications utilizing fast Fourier transforms and other transforms.
Initially, some background on permutations may be helpful. Consider a permutation on 2n points 0, . . . , 2n−1. For example, the cyclic shift is defined by:
is chosen. The set of all permutations on 2n points is denoted with S2n and is a group, called the symmetric group. It has 2n! elements. A field (also called Galois field) may be denoted with 2 elements with F2={0, 1}, where its elements may be considered the two states of a bit. Addition and multiplication in F2 may be equivalent to the “xor” and “and” operations, respectively.
Permutations on 0, . . . , 2n−1 can equivalently be seen as permutations on the corresponding bit representations x ∈ F2n of these numbers. These x may be viewed as column vectors and the least significant bit may be assumed to be on the bottom. For example, for n=2, 1 is represented as
In some cases the permutations of F2n correspond to linear mappings on F2n of the form,
δ=Px, P ∈ GLn(F2)
where GLn(F2) is the group of all invertible n×n bit matrices. Its size is (2n−1)(2n−2) . . . (2n−2n−1). Every bit matrix P ∈ GLn(F2) defies a permutation in S2n, which may be captured by the mapping:
π:GLn(F2)→S2
GLn (F2) may be identified with its image π (GLn(F2)) in S2n.
Consider an example where
implies:
To clearly distinguish matrices P that operate on n bits from permutation matrices that operate on 2npoints, the latter is marked with a →.
In addition, the “direct sum” and the Kronecker or “tensor product” of matrices P, Q respectively may be denoted:
Further, In may be the n×n identity matrix and Jn is In with the columns in reversed order. The stride permutation may be defined via:
{right arrow over (L)}
2
,2
:i2n−s+j├→j2s+i, 0≦i≦2s,0≦j≦2n−s
Equivalently,
With this notation, the following properties of π in equation (2) above can now be stated. If P1Q ∈ GLn(F2), then the following holds:
π(PQ)=π(P)π(Q) and π(P−1)=π(P)−1 (i.e., π is a group homomorphism) (1)
π(P⊕Q)=PQ (2)
π(In)={right arrow over (I)}2
π(Cnk)={right arrow over (L)}2
π(Jn)={right arrow over (R)}2
Based on linear algebra, if P ∈ F2m×n is a m×n bit matrix, then
are the “image” and “kernel” (or nullspace) respectively of the linear mapping defined by the matrix P. It holds that dim(im(P))=rank(P) and dim(ker(P))+dim(im(P))=n. Further, rank(P+Q)≦rank(P)+rank(Q) and rank(PQ)≦min(rank(P), rank(Q).
If V≦F2n is a subvector space of dimension dim (V)=k, then |V|=2k. If x ∈ F2n, then any x+V={x+υ|υ∈V} is called a coset of V in F2n. Its size is again 2k and there are precisely 2n−k many different cosets.
With this background, the problem posed is that, given an invertible linear bit mapping, or bit matrix, P ∈ GLn(F2) where π(P) is the corresponding permutation on 2n points, the desire is to design a logic block that permutes a vector of 2n data words with π(P). The vector is streamed with a streaming width 2k, k≦n. This means that every cycle the logic block takes a segment of length 2k of the vector as input, and produces a segment of equal length of the permuted vector as output. Further, the logic block takes advantage of 2k banks of memory (e.g., addressable portions of memory, such as random access memory or RAM), each of which can hold at least 2n−k data words of the vector. In other embodiments, if each available RAM can only hold 2c words, c<n−k, then 2n−k−c RAMs can be used to simulate a RAM of size 2n−k. Preferably, each memory can concurrently read and write one (1) data word in each cycle. For example, the RAMs may be dual-ported (e.g., having an input port and an output port). This constraint can easily be generalized to any two-power number of ports.
A logic block satisfying the conditions described above can be implemented on current FPGA (field programmable gate array) platforms or ASICs (application specific integrated circuits).
As shown in
In addition to finding a solution for the permutation, the logic blocks of the connection networks 12a-b are also preferably optimized in terms of cost. The steps shown in
For the address scheme and stages (step 31), the streamed input vector of length 2n may be indexed with addresses x ∈ F2, as shown in
The addressing of the output vector y is analogous to the addressing of the input vector. Preferably P, i.e., y=Px, is performed in two stages, the write-stage z=Mx and the read-stage y=Nz, which is equivalent to a factorization
P=NM, (3)
where M, N ∈ GLn(F2) are again (necessarily) invertible bit matrices. In words, M determines how the streaming data is stored into the RAMs (the write-stage), and N determines how it is read out of the RAMs into the resulting data stream (the read-stage). The addressing preferably is chosen such that M=I2n or N=I2n makes the write or read stage trivial, respectively. This means that the connection network and all address computations vanish.
By partitioning the addresses as explained above, the following expanded version of the write-stage z=Mx may be obtained:
The matrix tiling is compatible with the partitioning of the vectors. For example, M1 is a k×k matrix.
Analogous to the write-stage, the read-stage y=N z may also be expanded. However, there is one crucial difference. Namely, the control logic 16 in
The primes emphasize that N−1 is tiled and not N.
For step 32, because the RAMs are preferably dual-ported, they allow only one write (and read) per cycle. Thus, it is required that in the write-stage that for every fixed stage, the 2k data words are mapped into different RAMs 14. Mathematically, this means that for any fixed x2 the mapping
z
1
=M
2
x
2
+M
1
x
1
is bijective. This is the case if and only if M1 is invertible, i.e., has full rank: rank(M1)=k. A similar discussion yields the requirement rank(N′1)=k.
For step 33, the major cost factors in the circuit of
Given M and N, the required connectivity can be measured as follows. Let M be as in equation (4) (and N as in equation (5)), and assume that M1 (N′1) is invertible and set rank(M2)=s (rank(N′2)=s). Then, the 2k-to-2k connection network 12a in the write-stage (read-stage) of
x
1
+im(M1−1M2) and M1x1+im(M2)
for the write-stage and are the cosets
y
1
+im(N′1−1N′2) and N′1x1+im(N′2)
for the read-stage. Further, each block has to support an all-to-all connection and precisely 2s different configurations. We call this the “connectivity lemma.”
To prove the connectivity lemma, set s=rank(M2) and assume x1 is the address of a stage location. The RAM numbers z1 that x1 connects to can be accumulated, over all stages. This is the set
The size of this set is 2s. Now assume x′1 is another address within a stage, and satisfies Ux
M
1(x1−x′1)∈im(M2)x′1∈x1+M1−1im(M2)=x1+im(M1−1M2).
The size of this set is also 2s. Conversely, assume x′1 has the above form, i.e., x′1=x1+M1−1M2x2 for some x ∈ F2n−k. Then, using (6),
U
x′
=im(M2)+M1x1+M2x2=im(M2)+M1x1=Ux
In other words, if x1 and x′1 share one connection target z1, they share all 2s targets, which proves the block decomposition. The input and output index sets are also computed as desired.
The above also shows that each block has to support an all-to-all connection. The remaining question is the number of control configurations. Assume two stages x2, x′2 that connect all x1 equally, i.e., for all x1 ∈ F2k,
M
2
x
2
+M
1
x
1
=M
2
x′
2
+M
1
x
1
x′
2
∈x
2+ker(M2).
The size of this set is 2n−k−s, and the 2n−k stages partition into 2s many groups of size 2n−k−s each, such that within the groups, all connections are equal. These groups are the cosets x2+ker(M2). Between different groups of stages the connections differ since each x1 has 2s many targets (|Ux
The connectively lemma implies that it is desirable to minimize rank(M2) and rank(N′2) in (4) and (5) to minimize the area cost of the implementation in
The control blocks 16a-b in
With the above notation and discussion, the problem can be formally stated as, given an invertible bit matrix P ∈ GLn (F2) and a streaming width 2k, k≦n, determine a factorization P=N M, such that rank(M1)=rank(N′1)=k. The goal is to minimize rank(M2), rank (N′2), and the complexity of M and N−1. Necessarily, M, N ∈ GLn(F2). This problem is sometimes referred to hereinafter as the “factorization problem.”
As an example, let M be the matrix representing the write stage of a permutation, streamed with 23 ports, blocked as in (4).
From the connectivity lemma, it is seen that the 8-to-8 network decomposes into two 4-to-4 blocks. The input set is given by the coset
and the output set is given by the coset
Thus, the input sets are {0, 2, 4, 6} and {1, 3, 5, 7}. The output sets are {0, 1, 2, 3} and {4, 5, 6, 7}. From the “connectivity” definition, it is seen that the connectivity of this network is given by
conn(M)=2rank(M2)=22=4.
The definition for “control cost” given above shows that the control cost, cost (M), is given by the linear complexity of M. So,
cost (M)=2.
Before providing an explicit method to compute suitable factorizations P=N M, the lower bounds on the quality, i.e., the connectivity and the control cost, may be determined of a possible solution. This will allow later identification of the cases where the solutions are guaranteed optimal. To do this, similar to M and N−1 in (4) and (5), P and P−1 may also be tiled as
for the following discussion.
Our “connectivity theorem” is that, assuming P=N M is a solution of the factorization problem, then
rank(M2)≧k−rank(P′1) and rank(N′2)≧k−rank(P1),
which is equivalent to
conn(M)≧2k−rank(P′
To prove this theorem, assume a solution P=N M, which implies N−1 P=M, i.e., the equation
and M1 and N′1 have full rank k. Further, let r=rank (P1), which implies rank rank (P3)≧k−r, since P has full rank. Equation (9) implies M1=N′2P3+N′1P1. From this, the following is obtained:
As a consequence rank(N′2)≧k−r or
conn(N)≧2k−r=2k−rank(P
as desired. The bound on conn (M) may be obtained analogously, starting from M P−1=N−1.
Our control cost theorem is that, assuming P=N M is a solution, then
cost(M)≧k−rank(P′1) and cost (N)≧k−rank(P1)
To prove this theorem, in the proof of the connectivity theorem, it was asserted that rank (N′2)≧k−rank (P1). This implies that N′2 contains at least k−rank (P1) non-zero elements. Since rank (N′1)=k, the linear complexity of the matrix (N′2 N′1), and thus that of N−1, is also at least k−rank (P1). The bound on cost (M) may be obtained analogously.
The connectivity and control cost theorems show that the lower bounds for both the connectivity and the control cost are determined by rank (P1) and rank (P′1), respectively. In the worst case, both ranks are zero.
As an example, consider the stride permutation
streamed with width 2k. The corresponding tiling of P=Cn2 as in (8) has, independent of k, the form
In this case rank (P1)=rank (P′1)=k−2 (using P−1=PT). This implies that a solution P=NM that meets the lower bounds satisfies conn (M)=conn (N)=4, i.e., both connection networks decompose into 4-to-4 all-to-all networks. Further, cost (M)=cost (N)=2, i.e., each address computation requires two additions (or xor operations). It is shown later see that such an optimal solution does indeed exist.
In the connectivity lemma, it was established that the two connection networks 12a-b in the write and read stage decompose into block with 2s inputs and outputs, where s=rank (M2) or s=rank (N′2), respectively. In the following, these networks are further analyzed and decomposed to obtain an efficient implementation.
Consider the write stage given by M partitioned as in (4) and with invertible M1. The connection network 12a of the write stage, now considered without subsequent writing into the RAMs 14, performs for every stage a permutation within that stage according to Z1=M1X2+M1X1. If the addressing scheme in
Conversely, every matrix of this form with invertible M1 defines a connection network that decomposes into 2s-to-2s blocks, where s=rank (M2).
To implement efficiently the network, it may be decomposed further. First,
Next, the following may be used:
Namely, M2 may be written as a sum of s=rank (M2) many matrices Ti of rank one,
M
2
=T
1
+ . . . +T
s. (13)
This is possible constructively, for example, by performing Gauss elimination, M2=QM′2, where Q is invertible and M′2 has s nonzero rows. Each row yields a summand T′i:
M′
2
=T′
1
+ . . . +T′
s,
and setting Ti=QT′i yields the result.
Using equation (12), T in equation (11) can be factorized to decompose the connection network. Analyzing the factors with the connectivity lemma yields the following theorem (which is referred to herein as the “connection network decomposition” theorem): The connection network 12a of the write stage in
This implies that the connection network can be decomposed into a cascade of s stages. Each stage is a connection network that consists of parallel 2-to-2 connection networks, or blocks, simultaneously controlled by one control bit. The input and output sets of these blocks are again cosets, as determined by the connectivity lemma. A similar statement holds for connection network of the read stage 12b in
In equation (13), the summands can be permuted into any order, which implies that the stages can also be permuted. Each stage may be controlled by one bit. This implies that the entire network has 2s many possible configurations. This number coincides with the number stated in the connectivity lemma, as desired. The control bit for each stage may be calculated by Tix2 (or Tiy2 when performing N−1).
The input and output sets of the first (or write) state may be determined by:
For all other stages Ti, where 1≦i<s, both the input and output sets are given by
x1+im (Ti).
For the read stage, the same expressions hold, with N′1 substituted for M1.
Continuing the above example, let M be as given in (7). Based on the connection network decomposition theorem, it can be seen that M2 can be decomposed in the following way.
A matrix with rank 2 (M2) may be decomposed into two matrices of rank 1. This corresponds to breaking the 4-to-4 connection network into two independent 2-to-2 networks. This results in the following factorized connection network T:
Recall that T operates on a vector x, of length 6. x can be written as (X5, x4, x3, x2, x1, x0)T, where x5 indicates the most significant bit, and x0 indicates the least significant bit. Reading T from right to left allows the resulting connection network to be determined, as shown in
First, the rightmost term may be used to determine the initial permutation and first switching stage. This stage has input sets:
and output sets
So, the input sets are {0, 2}, {1, 3}, {4, 6}, and {5, 7}, and the output sets are {0, 1}, {2, 3}, {4, 5}, and {6, 7}. This, along with the initial blocking examined above, gives the initial permutation and switching structure seen in
Next, the second stage's input and output sets may be determined by
So, both the input and out put sets are {0, 2}, {1, 3}, {4, 6}, and {5, 7}. This gives the criss-crossing pattern seen before and after the switching column in
To solve the factorization problem outlined above (namely, determining a factorization P=N M, such that rank(M1)=rank(N′1)=k), so called “helper matrices,” denoted as H ∈ GLn(F2), may be used. According to various embodiments, a helper matrix has the form
and, due to (12), is always self-inverse:
H=H1.
Given P and any H, the factorization becomes:
P=H·HP.
Setting N=H and M=HP, it is observed that N satisfies the rank condition rankN′2=k set forth above. The remaining question is how to design H such that the rank condition on M is also satisfied and to minimize the connectivity and control costs.
In one embodiment, to solve this problem, assume P is given tiled as in (8). If M=HP, tiled as in (4), then
M
1
=H
2
P
3
+P
1.
In other words, an H2 must be found such that rank (M1)=k. H2 determines H in (15) and hence the factorization of P in (16). The following theorem (referred to sometimes herewith as the “helper matrix theorem”) explains that this is possible. Let P ∈ GLn(F2) be tiled as in (8) with rank (P1)=r≦k; then there exists H2 with rank (H2)=k−r and with exactly k−r non-zero entries such that M1=H2P3+P1 has full rank k. To prove this theorem at the outset, define E((i1, j1), . . . , (ik, jk) as the matrix that has ones at the locations (i1, j1), . . . , (ik, jk), and zeros elsewhere. The size of the matrix is clear from the context. Assuming H2=E((i, j)), then H2P3+P1 is the matrix P1 with the jth row of P3 added to its ith row. This gives the basic idea: k−r suitable rows of P3 are selected and added to suitable k−r rows of P1 to correct its rank deficiency. Intuitively, this is possible since P, and thus its submatrix
have full rank.
Consider first the special case where P is a permutation of bits, since it is somewhat simpler and important for applications. If P is a permutation, P1 contains r base vectors (as rows); the remaining k−r rows, with row indices i1, . . . , ik−r (in any order), are zero. The missing k−r base vectors are in P3, say at row indices j1, . . . , jk−r. It follows that H2=E((i1, j1), . . . , (ik−r, jk−r)) satisfies the requirements.
For a general P, more work has to be done to identify the proper row indices. Since P1 has rank r, r linear independent columns of P1 can be permuted into the first r locations, and a Gauss elimination on the columns may be performed to zero out the last k−r columns. In other words, there is an invertible matrix G ∈ GLk(F2) such that Q1=P1G has the last k−r columns equal to zero. Setting Q3=P3G, the following is obtained:
Now, r linear independent rows of Q1 may be identified, and the other row indices (in any order) may be called i1, . . . , ik−r. In each of the rows, the rightmost k−r entries are equal to zero. Since
has full rank, there are k−r rows whose subvectors, consisting of the k−r rightmost entries, are linear independent. Their indices may be denoted by j1, . . . , jk−r. Setting H2=E((i1, j1), . . . , (ik−r, jk−r)), it is clear that H2Q3+Q1 has full rank and so does
(H2Q3+Q1)G−1=M1
The other requirements are also satisfied by H2.
This proof of this theorem is constructive and yields the following algorithm (referred to as the “factorization algorithm”) for solving the factorization problem discussed above. Assume an input permutation of P ∈ GLn(F2) and k≦n, with the goal of obtaining an output N,M ∈ GLn(F2) such that P=NM and rank(M1)=rank(N′1)=k.
For a case where P is a permutation of bits (a matrix having exactly one 1 in each row and exactly one 1 in each column, with zero elsewhere), the process, as shown in
For cases of an arbitrary P, the process, as shown in
In addition, the factorization algorithm terminates and is correct. Also, the factorization algorithm always produces a “partially” optimal solution, that is, at least one stage is optimal. In some important cases, such as stride permutations and other, the solution is optimal (i.e., both stages are optimal). This theorem, referred to as the “optimality theorem,” is that, for a given permutation P and k (related to the streaming width), the factorization algorithm produces a solution in which the read-stage is optimal with respect to both connectivity and control cost. Further, if P is a permutation of bits, then also the write-stage is optimal with respect to both connectivity and control cost. To prove this theorem, the lower bounds for the connectivity and control cost theorems can be compared. For the read stage, the helper matrix theorem establishes that in the factorization algorithm N′2=H2 has rank k−r. Thus conn (N)=2k−rank(P
For the write-stage, if permutation P is a permutation of bits, then M=HP, which implies M2=H2P4+P2. Since rank (P1)=r, rank (P2)=k−r. H2 is constructed to extract the k−r nonzero rows with indices jl from P3. Since P is a permutation, this implies that the jlth row of P4 is zero. As a consequence, H2P4 is zero and thus M2=P2. Further, P−1=PT and, hence, P′1=P1T. In summary, rank (M2)=k−r=k−rank (P1)=k−rank (P′1), which is minimal (see connectivity theorem). Further, M incurs k−r additions, i.e., cost (M)=k−rank (P′1), which is minimal (see control cost theorem).
As just demonstrated, for bit permutations P, the factorization algorithm produces optimal solutions. A few other special properties hold in this case and are discussed next. The proof of the optimality theorem asserts that if P is a bit permutation, then the algorithm yields M2=P2. In other words, M has the form
and, hence, differs from P only in the bottom right k×k submatrix. In the general case, both bottom submatrices M1, M2 differ. Further, the matrix M1=H2P3+P1 is a permutation matrix in this case, since it has full rank and at most k and thus precisely k nonzero entries. Finally, the matrix H2 obtained by the factorization algorithm contains exactly k−r ones, which are located in precisely those rows of P1 that are zero. The same holds for P2. Since rank (P2)=rank (H2)=k−r, one gets im (M2)=im (P2)=im (H2)=im (N′2). As a consequence, recalling that if M is as in equation (4) (and N as in equation (5)), and assuming that M1 (N′1) is invertible and set rank(M2)=s (rank(N′2)=s), then the 2k-to-2k connection network 12a in the write-stage (12b in the read-stage) of
For the connection networks, the read stage of the solutions produced by the factorization algorithm has the form
This makes the decomposition of the connection network according to the connection network decomposition theorem easy by setting
T
l
=E((il,jl)).
As an example, one may derive a solution for P=C62 streamed with width 23, as shown below. Performing the above algorithm step by step, P is a permutation, thus the algorithm of
Now, the complete hardware implementation of this example is discussed, seen in
Considering the input vector of length 64 to be indexed with addresses, x ∈ F26, with the upper three (3) bits corresponding to the stage number, and the lower three (3) bits indicating the location within the stage. x may be written as (x5, x4, x3, x2, x1, x0)T, with x5 as the most significant bit and x0 as the least significant. Likewise, the output vector may be indexed with address y ∈ F26 of the same form.
The factorization of M's connection network was previously shown in (14). From this, the following characteristics for the write stage are determined. As explained in the example above, the first stage is controlled by x4 and the second stage is controlled by x5. As discussed above, the memory write addresses are calculated directly from M. The write addresses are given by M4x2+M3x1. So, this gives memory write addresses
It is seen that the memory address where each word must be written is the three bit value given by (x3, x2, x1)T. The input and output connections of each block may be determined from the cosets, as discussed above.
The same process may be performed for N. Note that N=N−1, so all N′i matrices are equal to the Ni versions. N−1 may be decomposed, using the connection network decomposition theorem, to
The first stage has input sets
and output sets given by
The second stage has input and output sets given by
So, the first stage has input and output sets {0, 1}, {2, 3}, {4, 5}, and {6, 7}. The second stage has input and output sets {0, 2}, {1, 3}, {4, 6}, and {5, 7}. Using the technique described above, it is seen that the first stage is controlled by y3, and the second by y4. The memory read addresses may be determined from
So, the memory read addresses are given by the three bit value (y5, y4, y3)T. The resulting implementation is visualized in
For purposes of illustration, now presented are seven example permutations with their full solutions. In addition to an exact specification, a visual representation of the solution matrices is used to show the form of the matrix under general conditions. In these figures, each box represents a matrix P, N, or M. The dashed lines show the divisions between the submatrices, and the solid black lines indicate segments with value 1 in the matrix. Gray boxes are used to indicate portions of the matrix that are unknown or not specified in the problem. Lastly, the blank areas of the box indicate portions of the matrix equal to zero. For example, the matrix Cn2 given in (10) is represented as shown in
P with full rank(P1). If the bottom-right matrix P1 has full rank (i.e., rank(P1)=k), then P fulfills the restrictions by itself. So, a solution of P=N·M is given by P=I·IP. This factorization can be visualized as shown in
cost(N)=0 and conn(N)=1.
The cost of M, cost(M), is equal to the linear complexity of P, and the connectivity, conn(M)=rank(P2).
π(Q)I. Given π(Q)I.2l, then P=Q⊕Il. For the case where l≧k, under this condition, P1 has full rank and the problem is solved as in Example 1 above. This produces solution shown in
N=N0⊕Il, M=M0⊕Il
The arithmetic and connection costs of N and M are identical to the costs for N0 and M0 (respectively).
Iπ(Q). Given π(P)=I2′π(Qm), then P=Il⊕Qm. For the case where m≦k, P1 has full rank, so the factorization is trivial. Using the “full-rank” solution seen in Example 1, the solution shown in
N=Ik−m ⊕ N0, M=Ik−m ⊕ M0
The resulting solution will have the costs:
cost(N)=cost(N0), conn(N)=conn(N0),
cost(M)=cost(M0), conn(M)=conn(M0).
The solution has the form shown in
Iπ(Q)I. If π(P)=I2′π(Qm)I2
Stride permutation L. A solution for stride permutation π(P)=L2
Applying the factorization algorithm described above, the helper matrix H2 is obtained:
H
2
=E((min(k, n−s)−i, n−max(s, k)−i|i=1 . . . , k−r) (20)
allows the final factorization to be computed according to N=H, M=HCns. This factorization yields cost(M)=cost(N)=k−r and conn(N)=conn(M)=2k−r. For the case where k<s and k<n−s, the P=Cns=N·M has the form shown in
Bit reversal R. If π(P)=R2
H
2
=E((r+i,i)|i=0, . . . , k−r) (22)
Then, using P=N·M=H·HP, a solution with the following costs is produced:
cost(M)=cost(N)=k−r, conn(M)=conn(N)=2k−r.
For the case where k≦n/2, the solution has the form shown in
Hadamard reordering. The Hadamard reordering permutes the 2n-element data vector X=(x0, x1, . . . , x2n−1)T to the vector Y=(xh2n(0), xh2n(1), . . . , xh2n(2n−1))T, where h2
h
1(0)=0
h
2K(2i)=hK(i)
h2K(2i+1)=2K−1−hK(i), i=0, 1, . . . , K−1.
This permutation is represented as the matrix Z2
If r=rank(P1), then r=max(0, 2k−n). Applying the factorization algorithm, the helper matrix H2 is obtained:
H
2
=E((k−1i, k−r−i)|i=0, . . . , k−r−1) (24)
For the case where k≦n/2, the solution P=N·M=H·HP has the form shown in
cost(M)=n−1+2(k−r) cost(N)=k−r conn(M)=conn(N)=2k−r.
Table I below includes a summary of the example bit matrices and their solutions.
The streaming permutations considered in this application have a wide range of applications.
Some of the most important applications include transforms, transposition, sorting networks, and Viterbi coding, many of which access data in a stride permutation pattern.
Permutations are extremely important in fast computation of linear transforms. The Cooley-Tukey fast Fourier transform (FFT) and its variants use stride permutation L and bit reversal R. A similar Cooley-Tukey-like fast algorithm for the Walsh-Hadamard transform (WHT) also uses stride permutation L. Streaming permutations are applicable to fast algorithms for the discrete cosine transform (DCT) and discrete sine transform (DST). Fast Cooley-Tukey type algorithms for these transforms have been derived. These algorithms use the stride permutation, as well as the permutation K2
K
2
,2
=(I2
If π(Q)=I2
So, if π(P)=K2
P=(Im−1⊕Q)Cnm.
The K permutation is linear on the bits, so it can be implemented by the factorization algorithm.
In addition to Cooley-Tukey type algorithms, fast regularized (or constant geometry) algorithms for computing the DCT and DST exist. These algorithms require the stride permutation L as well as the Hadamard reordering, Z2
Additionally, some optimizations in the transform domain can be simply implemented using embodiments of the present invention. For example, the prior art shows how the bit reversal and matrix transposition needed in the 2-D FFT can be performed together at a reduced cost. According to various embodiments, this optimization is easily realized as
π(P)=(I2
Permutations can be important applications themselves. For example, the transposition (or corner turn) of an n n matrix is simply the stride permutation Ln
Other possible applications for the present invention occur in a variety of domains. For example, sorting networks and Viterbi coding both access data in a stride permutation pattern. The techniques of the present invention can be used in streaming implementations of these applications.
The examples presented herein are intended to illustrate potential and specific implementations of the embodiments. It can be appreciated that the examples are intended primarily for purposes of illustration for those skilled in the art. No particular aspect or aspects of the examples is/are intended to limit the scope of the described embodiments.
It is to be understood that the figures and descriptions of the embodiments have been simplified to illustrate elements that are relevant for a clear understanding of the embodiments, while eliminating, for purposes of clarity, other elements. For example, certain operating system details for computer system are not described herein. Those of ordinary skill in the art will recognize, however, that these and other elements may be desirable in a typical processor or computer system. Because such elements are well known in the art and because they do not facilitate a better understanding of the embodiments, a discussion of such elements is not provided herein.
In general, it will be apparent to one of ordinary skill in the art that at least some of the embodiments described herein may be implemented in many different embodiments of software, firmware and/or hardware. The software and firmware code may be executed by a processor or any other similar computing device. The software code or specialized control hardware, which may be used to implement embodiments, is not limiting. For example, embodiments described herein may be implemented in computer software using any suitable computer software language type, such as, for example, C or C++ using, for example, conventional or object-oriented techniques. Such software may be stored on any type of suitable computer-readable medium or media, such as, for example, a magnetic or optical storage medium. The operation and behavior of the embodiments may be described without specific reference to specific software code or specialized hardware components. The absence of such specific references is feasible, because it is clearly understood that artisans of ordinary skill would be able to design software and control hardware to implement the embodiments based on the present description with no more than reasonable effort and without undue experimentation.
Moreover, the processes associated with the present embodiments may be executed by programmable equipment, such as computers or computer systems and/or processors. Software that may cause programmable equipment to execute processes may be stored in any storage device, such as, for example, a computer system (nonvolatile) memory, an optical disk, magnetic tape, or magnetic disk. Furthermore, at least some of the processes may be programmed when the computer system is manufactured or stored on various types of computer-readable media.
It can also be appreciated that certain process aspects described herein may be performed using instructions stored on a computer-readable medium or media that direct a computer system to perform the process steps. A computer-readable medium may include, for example, memory devices such as diskettes, compact discs (CDs), digital versatile discs (DVDs), optical disk drives, or hard disk drives. A computer-readable medium may also include memory storage that is physical, virtual, permanent, temporary, semipermanent, and/or semitemporary. A computer-readable medium may further include one or more data signals transmitted on one or more carrier waves.
A “computer,” “computer system,” “host,” or “processor” may be, for example and without limitation, a processor, microcomputer, minicomputer, server, mainframe, laptop, personal data assistant (PDA), wireless e-mail device, cellular phone, pager, processor, fax machine, scanner, or any other programmable device configured to transmit and/or receive data over a network. Computer systems and computer-based devices disclosed herein may include memory for storing certain software applications used in obtaining, processing, and communicating information. It can be appreciated that such memory may be internal or external with respect to operation of the disclosed embodiments. The memory may also include any means for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM) and/or other computer-readable media.
In various embodiments disclosed herein, a single component may be replaced by multiple components and multiple components may be replaced by a single component to perform a given function or functions. Except where such substitution would not be operative, such substitution is within the intended scope of the embodiments. Any servers described herein, for example, may be replaced by a “server farm” or other grouping of networked servers (such as server blades) that are located and configured for cooperative functions. It can be appreciated that a server farm may serve to distribute workload between/among individual components of the farm and may expedite computing processes by harnessing the collective and cooperative power of multiple servers. Such server farms may employ load-balancing software that accomplishes tasks such as, for example, tracking demand for processing power from different machines, prioritizing and scheduling tasks based on network demand and/or providing backup contingency in the event of component failure or reduction in operability.
While various embodiments have been described herein, it should be apparent that various modifications, alterations, and adaptations to those embodiments may occur to persons skilled in the art with attainment of at least some of the advantages. The disclosed embodiments are therefore intended to include all such modifications, alterations, and adaptations without departing from the scope of the embodiments as set forth herein.
The present application claims priority to U.S. provisional application Ser. No. 60/997,596, filed Oct. 4, 2007, entitled “Streaming Data Permutation Datapath Using Memory Arrays,” which is incorporated herein by reference in its entirety.
This invention was made with support from the U.S. government, under National Science Foundation No. ITR/ACI-0325687, and Defense Advanced Research Program No. NBCH-1050009. The U.S. government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
60997596 | Oct 2007 | US |