A digital implementation of a recursive system typically includes a processing pipeline with a feedforward path and a feedback path. The length of the processing pipeline and, in particular, the latency of the feedback path, are the fundamental factors limiting the throughput of the recursive system.
A digital accumulator block represents a simple example of such recursive system. In an accumulator block, an input number must be added with the result of a previous addition. In practice, however, it often takes more than a single clock cycle to perform each addition operation (especially for floating-point numbers). As a result, the expected result from the prior addition may not yet be ready in the next cycle, and the simple accumulated result will be erroneous.
It is within such context that the embodiments described herein arise.
The present embodiments relate to a recursive system with a feedback path having a feedback latency. The recursive system may be provided with pre-processing circuitry configured to decompose a single input data stream into corresponding principal and independent components using a predetermined mathematical transformation. The recursive system may further be provided with post-processing circuitry configured to perform the inverse mathematical transformation (i.e., the inverse operation of the predetermined mathematical transform), which converts the decomposed samples back to the original domain.
A recursive system configured in this way provides a tangible technical improvement to computer technology by allowing the iterative system to support situations where the input streams cannot be presented as independent channels while ensuring that the system maintains maximum throughput by hiding the any inefficiency due to the feedback latency. It will be recognized by one skilled in the art, that the present exemplary embodiments may be practiced without some or all of these specific details. In other instances, well-known operations have not been described in detail in order not to unnecessarily obscure the present embodiments.
Recursive systems (sometimes referred to as “iterative” systems) are often implemented on a programmable integrated circuit.
Functional blocks such as LABs 11 may include smaller programmable regions (e.g., logic elements, configurable logic blocks, or adaptive logic modules) that receive input signals and perform custom functions on the input signals to produce output signals. Device 10 may further include programmable routing fabric that is used to interconnect LABs 11 with RAM blocks 13 and DSP blocks 12. The combination of the programmable logic and routing fabric sometimes referred to as “soft” logic, whereas the DSP blocks are sometimes referred to as “hard.” logic. The type of hard logic on device 10 is not limited to DSP blocks and may include other types of hard logic. Adders/subtractors, multipliers, dot product computation circuits, and other arithmetic circuits which may or may not be formed as part of a DSP block 12 may sometimes be referred to collectively as arithmetic logic.
Programmable logic device 10 may contain programmable memory elements for configuring the soft logic. Memory elements may be loaded with configuration data (also called programming data) using input/output elements (IOEs) 16. Once loaded, the memory elements provide corresponding static control signals that control the operation of one or more LABs 11, programmable routing fabric, and optionally DSPs 12 or RAMs 13. In a typical scenario, the outputs of the loaded memory elements are applied to the gates of metal-oxide-semiconductor transistors (e.g., pass transistors) to turn certain transistors on or off and thereby configure the logic in the functional block including the routing paths.
Programmable logic circuit elements that may be controlled in this way include parts of multiplexers (e.g., multiplexers used for forming routing paths in interconnect circuits), look-up tables, logic arrays, AND, OR, NAND, and NOR logic gates, pass gates, etc. The logic gates and multiplexers that are part of the soft logic, configurable state machines, or any general logic component not having a single dedicated purpose on device 10 may be referred to collectively as “random logic.”
The memory elements may use any suitable volatile and/or non-volatile memory structures such as random-access-memory (RAM) cells, fuses, antifuses, programmable read-only-memory memory cells, mask-programmed and laser-programmed structures, mechanical memory devices (e.g., including localized mechanical resonators), mechanically operated RAM (MORAM), programmable metallization cells (PMCs), conductive-bridging RAM (CBRAM), resistive memory elements, combinations of these structures, etc. Because the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, configuration RAM (CRAM), configuration memory elements, or programmable memory elements.
In addition, programmable logic device 10 may use input/output elements (IOEs) 16 to drive signals off of device 10 and to receive signals from other devices. Input/output elements 16 may include parallel input/output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry used to connect one integrated circuit to another integrated circuit. As shown, input/output elements 16 may be located around the periphery of the chip. If desired, the programmable logic device may have input/output elements 16 arranged in different ways.
The routing fabric (sometimes referred to as programmable interconnect circuitry) on PLD 10 may be provided in the form of vertical routing channels 14 (i.e., interconnects formed along a vertical axis of PLD 10) and horizontal routing channels 15 (i.e., interconnects formed along a horizontal axis of PLD 10), each routing channel including at least one track to route at least one wire. If desired, routing wires may be shorter than the entire length of the routing channel. A length L wire may span L functional blocks. For example, a length four wire may span four functional blocks. Length four wires in a horizontal routing channel may be referred to as “H4” wires, whereas length four wires in a vertical routing channel may be referred to as “V4” wires.
Furthermore, it should be understood that the present embodiments may be implemented in any integrated circuit.
As shown in
In general, adaptive beamforming may be applied to a variety of applications, such as military applications of sonar and radar, wireless communication in commercial networks, radio applications, acoustic noise cancelling applications, microphone array speech processing, etc. The examples above in which recursive circuits are used to support adaptive beamforming applications are merely illustrative and are not meant to limit the scope of the present embodiments. In general, recursive circuits 202 may be included in any type of data computing system.
One example of a recursive circuit is a systolic array.
In the example of
One conventional way to solve the feedback latency in iterative systems is to decrease the input throughput by the ratio of the feedback path latency L_feedback. In other words, the user will throttle the input speed by a factor of L_feedback. This technique will produce correct system behavior, but the throughput of such system will be decimated by the input throttling ratio. In systems with feedback paths with tens of clock cycle latency, the resulting decrease in system throughput will be more than 10×, making the resulting system incredibly inefficient.
Another conventional way to solve the feedback latency problem is to stagger different independent channels along the L_feedback clock cycles (i.e., to interleave the independent channel inputs over time). This technique requires the iterative system to present data patterns with N completely independent data streams, where N must be greater than or equal to L_feedback for maximum efficiency. This technique is, however, application dependent and requires a multichannel configuration with N independent input channels. Unfortunately, many applications simply lack the concept or nature of multi-channel input data streams.
Yet another traditional way of solving the feedback latency problem involves unfolding the processing pipeline by a degree proportional to L_feedback. In the classical example of a floating-point accumulator, the system outputs partial sums into a shift register, and the parallel outputs of the shift register are summed in parallel using an explicit adder tree. This technique, however, can only be implemented in very simple iterative circuits. It is very challenging to convert complex iterative systems into such unfolded structure. Even in simple accumulator functions, the shift register and adder tree structure result in a large compute footprint, let alone more complex operations like dividing, trigonometric functions, or square roots which would make such implementation practically infeasible.
In accordance with an embodiment,
As shown in
Pre-processing circuit 410 may be configured to decompose, break, or channelize the received input data stream into independent/orthogonal components using some predetermined mathematical transformation algorithm. In the example of
Post-processing circuit 420 may be configured to receive the processed components S[i] from circuit 402 and to recombine the processed components back to the original input domain using an inverse mathematical transformation algorithm. In other words, the output data stream recombination circuit 420 may perform a corresponding inverse transform of the initial transformation provided by the input data stream decomposition circuit 410. Circuit 420 may transform or convert the output data stream into a corresponding output data stream {Y[1], Y[2], Y[3], . . . , Y[L], . . . }, where each Y[i] component within a group of size L are no longer completely independent/orthogonal components.
Conversely, the iFFT circuit 420 may operate as an inverse data stream channelizer (e.g., a synthesis filter bank circuit) that reconstructs the wideband signal by recombining from the individual spectral components. The use of an analysis/synthesis filter bank pair can help achieve higher quality of output by reducing the energy leakage between adjacent spectral bins. Such spectral decomposition and recombination is illustrated in
The FFT/iFFT transformation may require a group length L of radix-2 (i.e., a group length that is a power of two). Thus, when using FFT/iFFT transforms, the group length L (sometimes referred to as frame length) should be configured based on the following expression:
L≥2̂(ceiling(log2L_feedback)) (1)
where L_feedback represents the latency of the feedback path. Ideally, the group size L will be equal to the expression above for maximum throughput. In certain embodiments, one or more registers (see, e.g., registers 405 in
The example of
At step 702, the orthogonal transformation circuit 410 (e.g., an FFT circuit, a DFT circuit, a wavelet transform circuit, etc.) may be used to decompose or channelize the input data stream into independent components.
At step 704, the multi-channel processing circuit (e.g., an iterative processing circuit having at least one feedback path with a feedback latency) may be used to process the independent components received from the orthogonal transformation circuit 410.
At step 706, the inverse orthogonal transformation circuit 420 (e.g., an iFFT, an iDFT circuit, an inverse wavelet transform circuit, etc.) may be used to recombine the processed components output from the multi-channel processing circuit (e.g., to convert the processed channelized samples back to the original domain).
Although the methods of operations are described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or described operations may be distributed in a system which allows occurrence of the processing operations at various intervals associated with the processing, as long as the processing of the overlay operations are performed in a desired way.
The following examples pertain to further embodiments.
Example 1 is circuitry, comprising: an input configured to receive an input data stream; an orthogonal transformation circuit configured to receive the input data stream from the input and to decompose the input data stream into groups of independent components; and a processing circuit configured to receive the groups of independent components from the orthogonal transformation circuit.
Example 2 is the circuitry of example 1, wherein the processing circuit optionally has a feedback path with a feedback latency.
Example 3 is the circuitry of example 2, wherein the number of independent components in each of the groups is optionally a function of the feedback latency.
Example 4 is the circuitry of example 3, optionally further comprising: at least one register in the feedback path configured to balance the feedback latency with the number of independent components in each of the groups.
Example 5 is the circuitry of any one of examples 1-4, optionally further comprising: an inverse orthogonal transformation circuit configured to receive signals from the processing circuit and to recombine the signals into a corresponding output data stream.
Example 6 is the circuitry of example 5, wherein the orthogonal transformation circuit optionally comprises a fast Fourier transform (FFT) circuit.
Example 7 is the circuitry of example 6, wherein the inverse orthogonal transformation circuit optionally comprises an inverse fast Fourier transform (iFFT) circuit.
Example 8 is the circuitry of example 7, wherein the FFT circuit optionally comprises an analysis filter bank, and wherein the iFFT circuit optionally comprises a synthesis filter bank.
Example 9 is the circuitry of example 5, wherein the orthogonal transformation circuit optionally comprises a discrete Fourier transform circuit.
Example 10 is the circuitry of example 9, wherein the inverse orthogonal transformation circuit optionally comprises an inverse discrete Fourier transform circuit.
Example 11 is the circuitry of example 5, wherein the orthogonal transformation circuit optionally comprises a wavelet transform circuit, and wherein the inverse orthogonal transformation circuit optionally comprises an inverse wavelet transform circuit.
Example 12 is the circuitry of any one of examples 1-9, wherein the independent components generated by the orthogonal transformation circuit optionally comprise a plurality of independent spectral components in a frequency domain.
Example 13 is the circuitry of any one of examples 1-12, wherein there are no idle cycles at the processing circuit when processing the input data stream.
Example 14 is the circuitry of any one of examples 1-13, wherein the processing circuit optionally comprises a multi-channel processing circuit.
Example 15 is a method, comprising: receiving an input data stream; with an orthogonal transformation circuit, receiving the input data stream and decomposing the input data stream into a plurality of independent components; and with a processing circuit, receiving the plurality of independent components from the orthogonal transformation circuit and generating a plurality of processed components.
Example 16 is the method of example 15, wherein the input data stream optionally lacks independent streams of data.
Example 17 is the method of any one of examples 15-16, wherein the processing circuit optionally comprises a recursive circuit.
Example 18 is the method of any one of examples 15-17, optionally further comprising: with an inverse orthogonal transformation circuit, receiving the plurality of processed components from the processing circuit and recombining the plurality of processed components into a corresponding output data stream.
Example 19 is a system comprising: a recursive circuit having an input and an output; a pre-processing circuit that is coupled at the input of the recursive circuit and that is configured to channelize a wideband input signal into a plurality of independent spectral components; and a post-processing circuit that is coupled at the output of the recursive circuit and that is configured to reconstruct a wideband output signal based on the plurality of independent spectral components that have been processed by the recursive circuit.
Example 20 is the system of example 19, wherein the pre-processing circuit optionally comprises a transformation circuit selected from the group consisting of: a fast Fourier transform (FFT) circuit, a discrete Fourier transform circuit, and a wavelet transform circuit.
For instance, all optional features of the apparatus described above may also be implemented with respect to the method or process described herein. The foregoing is merely illustrative of the principles of this disclosure and various modifications can be made by those skilled in the art. The foregoing embodiments may be implemented individually or in any combination.