This invention relates to a processing device and to a method for performing a round of a Fast Fourier Transform.
The Discrete Fourier Transform (DFT) is a linear transformation that maps a sequence of N input numbers X1 to XN (linear operands) into a corresponding set of N transformed numbers (output operands). A Fast Fourier Transform (FFT) is a processing scheme for carrying out a DFT numerically in an efficient manner. The Cooley-Tukey algorithm is probably the most widely-used FFT algorithm. It transforms the input operands in a sequence of several rounds. Each round is a linear transformation between a set of input operands and a corresponding set of output operands. The output operands of a given round may used as the input operands of the next round, until the final output operands, i.e., the DFT of the initial input operands, are obtained. Each of these linear transformations may be represented by a sparse matrix and therefore can be carried out rapidly. The DFT can thus be represented as a product of sparse matrices.
Each round of the FFT may involve the evaluation of so-called butterflies. A radix P butterfly is a linear transformation between P input operands and P output operands. In each round, the N input operands may be partitioned into N/P sets of input operands. Each of these sets may be transformed individually, i.e., not dependent on the other sets of input operands, by means of the radix P butterfly. While the butterfly may be the same for each subset of input operands and for each round, the partitioning of the set of N input operands into the N/P subsets is generally different for each round.
The left part of
Each input operand may be stored at an addressable memory cell. Similarly, each output operand of the round may be stored at an addressable memory cell. A memory cell or a buffer cell may also be referred to as a memory location or a buffer location, respectively. Conveniently, the input operands may be stored at input memory cells labelled 0 to 127 in the present example. Similarly, the output operands 0 to 127 may be written to output memory cells labelled 0 to 127. In other words, the I-th input operant (I=0 to 127) may be provided at the I-th input memory cell. The I-th output operant (I=0 to 127) may be written to the I-th output memory cell.
As noted above, the partitioning of the set of input operands into subsets corresponding to butterflies may, in general, be different for different rounds of the FFT. The butterflies of a given round may be executed independently from one another, sequentially, or in parallel. In the example of
The present invention provides a processing device and method for performing a round of a Fast Fourier Transform as described in the accompanying claims.
Specific embodiments of the invention are set forth in the dependent claims.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.
Further details, aspects and embodiments of the invention will be described, by way of example only, with reference to the drawings. In the drawings, like reference numbers are used to identify like or functionally similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
Because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Referring now the diagram on the right side of
An alternative radix execution order, i.e., an order in which they may be executed, is indicated in
For example, in a first clock cycle, the first column in
For instance, in the case of N=64, the input operands may be required in the order illustrated by
An example of a scheme for determining the required order of input operands for different values of N is indicated in
Each input operand may have a certain real or, more generally, complex value. In the shown example, the 128 input operands are arranged in a first matrix M1 and a second matrix M2. M1 may comprise, for example, the input operands 0 to 7, 32 to 39 . . . 104 to 111. M2 may comprise, e.g., the input operands 16 to 23, 48 to 55 . . . 122 to 127. The matrices containing the input operands may also be referred to as the input operand matrices. Each input operand matrix may be arranged such that each of its lines may be read as a single block from, e.g., a memory unit. The memory unit may, for example, be a Static Random Access Memory (SRAM) unit. For instance, when the input operands are stored in the memory unit at consecutive locations in accordance with their numbering, each line of each input operand matrix may contain a sequence of consecutive input operands, as shown in the Figure. For instance, each of the eight lines of matrix M1 may be read in, e.g., a single clock cycle. The same may apply analogously to the second matrix M2. In the present example, each of the two matrices M1 and M2 may thus be read in, e.g., a total of eight clock cycles. Conveniently, each column of each of the matrices contains the input operands required as input data for a certain clock cycle of these eight clock cycles. Comparing
Conveniently, each of the input operand matrices is a square matrix, i.e., a matrix that has as many columns as lines. Reading a single line may take one clock cycle. Furthermore, processing a single column, i.e., computing the corresponding column of output operands, may also take a single clock cycle. For example, reading a set of, e.g., eight operands from the memory unit, e.g., an SRAM unit (see
The matrices may thus be processed efficiently in an overlapping or interlaced manner. Notably, when a first matrix, e.g., M1, has been read from an input operand memory unit and been buffered, the columns of the matrix may be transformed one by one with, e.g., one column per clock cycle. At the same time, the lines of the next matrix, e.g., M2, may be read from the input operand memory unit and buffered. Accordingly, the transformation of the I-th column of a given operand matrix, e.g., M1, and the retrieval of the I-th line of the next operand matrix, e.g., M2, from the input operand memory unit may be effected in parallel, e.g., within a single clock cycle.
The transformed matrices may be written to an output buffer. It is noted that when an operand matrix had been transformed, it may be replaced by the second next matrix (in a scenario in which there are more than two matrices). For example, the matrices may be read, buffered, and processed in accordance with the following scheme with input operand matrices M1, M2, M3. Buffer the matrix M1 in an input buffer A; process M1 and, at the same, time, buffer M2 in an input buffer B; process M2 and, at the same time, buffer M3 in input buffer A; buffer M4 in input buffer B and, at the same time, process M3. It is noted that the total number of input operand matrices may depend on the total number N of operands, the radix order P, and on the number of butterflies that are executed in parallel. The input operand matrices may thus be buffered by alternating between the two buffers.
In another example, a single input buffer may be used. The size of the single input buffer should match the size of a single operand matrix (but the buffer may, of course, be integrated in a larger buffer not further considered herein). The input buffer may be represented as a matrix (referred to herein as the buffer matrix) of the same dimension as the input operand matrices. The M*P lines and M*P columns of the buffer matrix may be referred to as the buffer lines and the buffer columns, respectively. A first operand matrix may be written to the buffer matrix by filling buffer lines with lines of the first input operand matrix. The first operand matrix may then be read, column by column, from the respective buffer columns. When a column of the first operand matrix has been read from the corresponding buffer column in order to be further processed, the respective buffer column may be filled with a line of the next (the second) input operand matrix. The second input operand matrix may thus be written to the buffer matrix by filling buffer columns (not buffer lines) successively with lines of the second input operand matrix. The next (i.e. the third) input operand matrix may again be written to the input buffer in the same manner as the first input operand matrix, namely by writing lines of the third input operand matrix to corresponding buffer lines (not buffer columns). Successive input operand matrices may thus be written to the input buffer one after the other by adapting the buffer write direction, i.e. either vertical (columnwise) or horizontal (linewise), to the buffer read direction of the respective preceding input operand matrix. This alternating scheme makes good use of the memory space provided by the input buffer and may avoid the need for a second input buffer.
At time t0, the input buffer may be empty or contain data from, e.g., a previous round of the FFT (see
By time t4, the first, second, third, and fourth lines of the first input operand matrix M1 have been written to corresponding lines of the input buffer (see
At time t4, the first column (M1_11, M1_21, M_1_31, M1_41)̂T of the first operand matrix M1 may be read from the input buffer and processed, e.g., fed to a radix-4 execution unit. At time t5, the second column (M1_12, M1_22, M_1_32, M1_42)̂T of the first operand matrix M1 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t6, the third column (M1_13, M1_23, M_1_33, M1_43)̂T of the first operand matrix M1 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t7, the fourth column (M1_14, M1_24, M_1_34, M1_44)̂T of the first operand matrix M1 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit.
At time t8, the first column (M2_11, M2_21, M_1_31, M2_41)̂T of the second operand matrix M2 may be read from the input buffer and processed, e.g., fed to a radix-4 execution unit. At time t9, the second column (M2_12, M2_22, M_1_32, M2_42)̂T of the second operand matrix M2 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t10, the third column (M2_13, M2_23, M_1_33, M2_43)̂T of the second operand matrix M2 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t11, the fourth column (M2_14, M2_24, M_1_34, M2_44)̂T of the second operand matrix M2 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit.
At time t12, the first column (M3_11, M3_21, M_1_31, M3_41)̂T of the third operand matrix M3 may be read from the input buffer and processed, e.g., fed to a radix-4 execution unit. At time t13, the second column (M3_12, M3_22, M_1_32, M3_42)̂T of the third operand matrix M3 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t14, the third column (M3_13, M3_23, M_1_33, M3_43)̂T of the third operand matrix M3 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t15, the fourth column (M3_14, M3_24, M_1_34, M3_44)̂T of the third operand matrix M3 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit.
At time t16, the first column (M4_11, M4_21, M_1_31, M4_41)̂T of the fourth operand matrix M4 may be read from the input buffer and processed, e.g., fed to a radix-4 execution unit. At time t17 (not shown), the second column (M4_12, M4_22, M_1_32, M4_42)̂T of the fourth operand matrix M4 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t18 (not shown), the third column (M4_13, M4_23, M_1_33, M4_43)̂T of the fourth operand matrix M4 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit. At time t19 (not shown), the fourth column (M4_14, M4_24, M_1_34, M4_44)̂T of the fourth operand matrix M4 may be read from the buffer and processed, e.g., fed to the radix-4 execution unit.
Considering that M radix-P butterflies are executed in parallel, wherein M is a natural number greater or equal to 1, each column of each input operand matrix may contain M times P input operands. Each of the input operand matrices may thus have M time P lines and M time P columns. Accordingly, the set of input operands may be partitioned into a total of N/(M*P)̂2 input operand matrices. The circumflex, i.e. the symbol “̂”, means “to the power of”. In the example shown in
Referring now to
Each group of M*P input operands, e.g., stored under a single address in the input address memory unit 12, may form a single line of one of the input operand matrices described above. Each line of each input operand matrix may thus be available as an addressable group of input operands in the input operand memory unit 12. When a complete input operand matrix has been buffered in the input buffer 18, it may be transformed into a corresponding output operand matrix by one or more radix P butterflies. These butterflies may be effected in parallel. For instance, in the shown example, there are two radix P operation units 28 and 30. The radix P may, for example, be 2, 4, or 8, or any other possible radix. The radix P operation units 28 and 30 may be identical. The first radix P operation unit 28 may be arranged to effect a first radix P butterfly on a first subset of operands in a current column of the input operand matrix available in the input buffer 18. The second radix P operation unit 30 may, at the same time, effect the same radix P butterfly on a second subset of input operands on the same column of the input operand matrix available in the input buffer 18. In a variant of the shown device 10, the radix P operation units 28 and 30 may be substituted by a single radix P operation unit or by more than two radix P operation units.
Each input operand matrix may thus be read line by line from the input operand memory unit 12 and transformed column by column by means of the one or more radix P operation units, e.g., the radix P operation units 28 and 30. Each column of the input operant matrix may notably be transformed within a single clock cycle. At the same time, i.e., within the same clock cycle, a line of a next input operand matrix may be read from the input operand memory unit 12.
Each transformed column of the input operand matrix may be written as an output operand column into the output buffer 2. The output operand matrix may thus be collected in the output buffer 20. When a complete output operand matrix has been collected, e.g., column by column, in the output buffer 20, the output operand matrix may be written, e.g., line by line, to the output operand memory unit 14.
The above-described operations may be repeated similarly for each input operand matrix. In the present example, each line of the respective output operand matrix may be written at an address of the output operand memory unit 14 generated by a bit reversal operation from the original input operand memory address. In other words, a line of M*P input operands from an input address characterizing a location in the input operand memory unit 12 may be transformed into a corresponding line of M*P output operands and saved to a location in the output operand memory unit 14 specified by a write address that is bit reversed input address. As described above, each line of input operands is not transformed individually but as part of a square input operand matrix, wherein the input operand matrix may be transformed column by column. The write addresses, i.e., the bit reversed read addresses, may be generated from the corresponding read addresses by means of the bit reversal unit 22. The constant coefficients required for each radix P butterfly may be stored in the coefficient memory unit 16 and read therefrom from the radix P operation units 28 and 30, for example. The various read and write operations in the processing device 10 may be controlled at least in part by the control unit 26.
An example of the proposed processing scheme is further described in reference to
It is noted that the present example of N=16 may be of little practical interest and is described here mainly for the purpose of illustrating the general principle, which is applicable also for larger values of N, e.g., for N>=128.
An example of performing a round of a FFT is described in reference to the flow chart shown in
Turning back to
If the input operand matrix read in block S1 is not the last matrix of the above-mentioned sequence of input operand matrices, the operations of block S1 may be repeated for the next input operand matrix (blocks S1, S3). Otherwise, i.e., when the last input operand matrix has been read from the input operand memory unit and buffed and processed in block S2, the current round of the FFT may end (block S4). Block S2 for a certain matrix and block S1 for the next input operand matrix may be executed in parallel.
Referring now to
The self-explanatory flowchart shown in
The invention may also be implemented in a computer program for running on a computer system, at least including code portions for performing steps of a method according to the invention when run on a programmable apparatus, such as a computer system or enabling a programmable apparatus to perform functions of a device or system according to the invention.
A computer program is a list of instructions such as a particular application program and/or an operating system. The computer program may for instance include one or more of: a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.
The computer program may be stored internally on computer readable storage medium or transmitted to the computer system via a computer readable transmission medium. All or some of the computer program may be provided on transitory or non-transitory computer readable media permanently, removably or remotely coupled to an information processing system. The computer readable media may include, for example and without limitation, any number of the following: magnetic storage media including disk and tape storage media; optical storage media such as compact disk media (e.g., CD-ROM, CD-R, etc.) and digital video disk storage media; nonvolatile memory storage media including semiconductor-based memory units such as FLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatile storage media including registers, buffers or caches, main memory, RAM, etc.; and data transmission media including computer networks, point-to-point telecommunication equipment, and carrier wave transmission media, just to name a few.
A computer process typically includes an executing (running) program or portion of a program, current program values and state information, and the resources used by the operating system to manage the execution of the process. An operating system (OS) is the software that manages the sharing of the resources of a computer and provides programmers with an interface used to access those resources. An operating system processes system data and user input, and responds by allocating and managing tasks and internal system resources as a service to users and programs of the system.
The computer system may for instance include at least one processing unit, associated memory and a number of input/output (I/O) devices. When executing the computer program, the computer system processes information according to the computer program and produces resultant output information via I/O devices.
In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims.
Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality. For example, the radix operation units 28 and 30 may be merged. The units 22 and 24 may be integrated in the control unit 26.
Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.
Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Also for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit (IC) or within a same device. For example, device 10 may be a single IC. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner. For example, the units 12, 14, 16, 18, 20, 22, 24, 26, 28, and 30 may be dispersed across more than one IC.
Also for example, the examples, or portions thereof, may implemented as soft or code representations of physical circuitry or of logical representations convertible into physical circuitry, such as in a hardware description language of any appropriate type.
Also, the invention is not limited to physical devices or units implemented in non-programmable hardware but can also be applied in programmable devices or units able to perform the desired device functions by operating in accordance with suitable program code, such as mainframes, minicomputers, servers, workstations, personal computers, notepads, personal digital assistants, electronic games, automotive and other embedded systems, cell phones and various other wireless devices, commonly denoted in this application as ‘computer systems’.
However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2013/054952 | 6/17/2013 | WO | 00 |