A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
In a typical system shown in
In a typical processor-based digital control loop for a plant, many inputs need to be processed, and possibly several outputs need to be generated.
In many control systems, designers simplify the design by sampling all analog input data from the plant at about the same time, and all with the same period between sampling a given input. The regular sampling ensures simpler and faster processing of the input data. Similarly, after all paths are processed and written to output storage, new output values are written to DACs or PWMs. The output storage is typically double buffered for each DAC or PWM, that is, a two-deep buffer is written at one location while the DACS and PWMs read from the other. When all new output value updates are completed, the DACs and PWMs are switched to read from the new values, and the previous set of DAC and PWM values then become available to be overwritten by the next new set of values, etc. Double buffering therefore can hide the order of processing each path within
Many applications require only linear processing operations, such as linear convolution (FIR filtering), multiplication (scaling), addition (offsets), and sometimes sine and cosine functions of sample time for the purposes of modulation and demodulation. Accordingly, there is a need for a special purpose and energy efficient programmable processor architecture that can nevertheless achieve high data throughput compared to a conventional DSP.
Some of the inventive principles of this patent disclosure relate to a special-purpose digital processor and controller, with the objective of trying to keep its central multiplier-accumulator (MAC) as fully utilized as possible. The controller may be externally programmed to execute a set of instructions within an A/D input sample period. All MAC data I/O may be stored in a dedicated and tightly coupled data memory, which may also take external data inputs, such as from the A/D converters. Multiple threads with very fast context-switching are supported in hardware in order to hide the pipeline delays inherent in MAC implementations, and thereby avoid write-before-read data hazards. The controller may have a stack memory for function calls, but in some embodiments, only for the purpose of pushing return addresses onto the stack. The processor may also support sine and cosine functions of sample time.
The hardware resources J2-J14 may include any type of hardware that may be useful for processing digital signals. Some examples include arithmetic units, delays, memories, multiplexers/demultiplexers, waveform generators, decoders/encoders, look-up tables, comparators, shift registers, latches, buffers, etc. The operation unit may include multiple instances of any of the hardware resources, which may be arranged individually, in functional groups, or in any other suitable arrangement.
Although the inventive principles are not limited to any specific arrangement, in some embodiments it may be particularly beneficial to include multiple memories J6, J10, J14 throughout the operation unit as shown in
The instruction generator J20 may be implemented in hardware, software, firmware or a hybrid combination. The instruction words J22 provided by the instruction generator may include any number of fields that define the actions of the operation unit J1. Examples of fields that may be included in the instruction words include control information, address information, coefficients, limits, etc.
The embodiment of
The third multiplexer R8 selects one of multiple sampled inputs from A/D converters R9, reference values R10 which may be provided, for example, by an external or supervisory microprocessor, or from any other suitable input interface resources. The inputs to the second multiplexer R8 may be latched in input registers R11 to synchronize data transfers with tick events on timing signal R12.
A limit checking circuit R13 may be included to provide hardware limit checking on the MAC outputs based on limit data stored in Limit-data RAM memory R14. As with the H-data RAM memory, the Limit-data memory is pre-programmed by the external microprocessor prior to operation. During normal operation, the RAM is read-only, reading data at the same address as the write address to the X-data RAM R6, and essentially limiting the range of values that are allowed to be written at each X-data RAM memory location. The Limit-data RAM is split into two sets of data, upper limits, and lower limits, and each can be set separately by the external processor. A special lower and upper limit code combination (such as a lower limit being greater than an upper limit) can represent a “no limit” state, leaving the MAC output value unchanged if required.
Outputs are taken from the MAC output, with or without limiting, and also applied to the inputs of a first set of registers R15. A second set of registers R16 may be included to synchronize the outputs with tick events on timing signal R12.
In typical operation, a set of data may be read from the input registers R11 on one tick event, processed during the interval between tick events and written to output register R15 as each becomes ready. The corresponding output data from R15 is then written into the output registers R16 on the next tick event, which simultaneously starts the processing of the next set of input data from R11, thereby forming a processing pipeline.
Typically, systems are designed to execute tens to hundreds of MAC instructions between each tick event. If tick periods are too long so that very large numbers of MAC instructions can be executed per tick period, then the system's minimum delay is increased, and its effectiveness in control loops becomes increasingly limited.
If too few MAC instructions can be executed per tick period, then some operations such as linear convolution could not be completed within a single tick period. Furthermore, more complex processing may require splitting a path into multiple paths. In this case, the paths may communicate the results of one path to the next path via X-data memory. The overhead of these extra X-data RAM accesses may become unacceptable.
The outputs from the output latches R16 may be applied to D/A converters, PWMs, or any other suitable output interface resources R17.
The processing unit R0 is controlled by a stream of MAC instruction words from the instruction generator R3. One type of information in an instruction word is an operand address to the H-data memory R2. Another is an operand address to the Limit-data RAM and X-data RAM. For example, if the processing unit is to implement a finite impulse response (FIR) filter, the filter coefficients may be read from the H-data memory through the instruction words, multiplied by the X-data from R6 at another address (via multiplexer R5), accumulated in the MAC, and the result written to another address in the X-data RAM (via limiter R13).
Control information may also be included in an instruction word. For example, the control information may instruct the first and second multiplexers R5 and R8 which inputs to use for an operation, it may instruct the MAC to begin a multiply-accumulate operation, it may instruct the processing unit where to direct the output from a MAC operation, etc.
A feature of the processing unit R0 is that it does not rely on conditional branch logic which is used in conventional systems for checking and decrementing loop counters, checking limits of arithmetic results, etc. Conditional branch logic typically reduces cycle efficiency in conventional systems because the MAC or other arithmetic logic unit (ALU) remains idle while branch instructions are executed in order to test the result of execution.
Instead of using branch logic, the processing unit R0 is fed a continuous stream of MAC instruction words from the generator R3 which handles any loop counting. For example, to implement a 5-tap FIR filter, the processing unit may be fed a continuous stream of five MAC instruction words. Each instruction specifies the source and destination of the data used for the MAC operation. After the fifth instruction is executed, the processing unit may proceed to the next set of instructions provided by the instruction generator. Thus, rather than spending time keeping track of loop iterations, the processing unit may continuously perform substantive signal processing at a high level of cycle utilization.
The use of hardware limit checking may also improve cycle utilization. Rather than executing “compare and branch” instructions to check the limits of mathematical results, the outputs from the MAC may be checked in hardware on a cycle-by-cycle basis or at any other times using Limit-data that is provided in instruction words and stored in Limit-data memory R14. This may enable low or no overhead limit checking.
The hardware limit checking may enable the processing unit to immediately shut down the outputs and/or transfer control to a supervisory processor R18 upon detection of a parameter that is out of bounds.
The hardware limit checking may also enable the supervisory processor to monitor the system operation on a tick-by-tick or even a cycle-by-cycle basis to provide fast response to parameters that are out of bounds or other fault conditions. For example, the supervisory processor may disable the outputs, shut down a plant that is controlled by the processing unit, issue an alarm, send warning message, or take any other suitable action.
Another feature of the processing unit R0 is the use of distributed memories. The X-data, H-data and Limit-data memories may enable simultaneous access by different hardware resources, thereby reducing cycle times. They may also be located physically close to the resources that utilize them, thereby reducing signal propagation delays. Moreover, the use of distributed memories may enable efficient context switching for multi-threading and other types of interleaved processes.
The embodiment of
The embodiments of
At tick t2, process K1(A1) is completed, and the result is applied to output W as an instance W1(K1) during interval T2. A second instance K2(A2) of process K is performed using input A2 during interval T2, and the result is applied as another instance W2(K2) of the output during interval T3. The method continues with additional instances of process K with each instance using an input obtained at the tick at the beginning of the process and output at the tick at the end of the process. Thus, during each time period between ticks, an input is obtained, a process is performed, and an output is provided in an interleaved manner.
An example of the process K is a scaling process where the input is multiplied by a fixed or variable scaling factor. Another example is an offset process where a fixed or variable offset is added to the input.
The embodiment of
Because process K uses more than one sample from an input for each iteration, it may leave cycles between process iterations during which resources may be available but unused. To achieve better cycle utilization, a second process or thread may be added as shown the embodiment of
The instruction generator of
The intermediate instruction words IIW may include any number of different fields such as control, address, limit, and/or coefficient fields similar to those discussed above with respect to
In some embodiments, a first-in, first-out (FIFO) memory S7 may be included to help maintain a steady stream of instruction words out of the instruction generator while accommodating variations in the amount of time it takes the state machine to processes different high level instructions. Some high level instructions such as calls, jumps and context setting instructions may not result in any instruction words being sent to the FIFO, in which case the FIFO occupancy may decrease. However, some instructions implement loop expansions as described below wherein one instruction is expanded into several instructions that are sent sequentially (one-by-one) to the processing unit. During loop expansions, no additional instruction words are read from the FIFO, while instructions may still be issued by the state machine S2, and therefore, the FIFO occupancy may increase.
A loop expansion unit S8 uses the stream of intermediate instruction words IIW to generate a stream of MAC instruction words (MIW) S10 that are applied to the processing unit. The loop expansion unit may include a hardware counter S9 that uses the loop-count field in IIW to determine the number of consecutive MAC instruction words MIW to send to the processing unit. For example, if an intermediate instruction word IIW includes an instruction to perform a FIR filter process, the loop-count field may be set to the number of taps included in the filter. For a 5-tap FIR filter, the loop-count field is set to five. At the beginning of the loop expansion operation, the loop-count field is loaded into the hardware counter S9 which keeps track of the number of MAC instruction words generated by the loop expansion unit. In the case of a 5-tap FIR filter, the hardware counter counts down each iteration until five MAC instruction words MIW have been generated.
The instruction words may be implemented without flow control instructions, thereby eliminating feedback for MAC state information to the address generator. This may simplify the state machine and enable increased operating speeds.
A benefit of the inventive principles is that they may enable the system to set up the MAC unit to execute in response to a single instruction word. This my enable substantial time savings compared to a DSP which typically requires multiple instructions to set up a MAC. For example, in a DSP, it may be necessary to initialize modulo counters and to load various registers or other resources with input, coefficient and/or loop count data, or pointers to such data. All of these operations may take multiple clock cycles to execute before the MAC can begin executing.
In a system that implements some of the inventive principles of this patent disclosure, however, some or all of these setup tasks may be executed through a single instruction word. For example, an intermediate instruction word IIW may include the following fields which, in some embodiments, may be the minimum number of fields needed to set up the MAC unit: a field for the source of input data for the MAC unit; a field for the source of coefficient data for the MAC unit; a field for the destination of output data from the MAC unit; and a field for a loop count. In other embodiments, the minimum fields to set up the MAC unit may also include one or more fields to indicate the type of addressing being used, a field to indicate buffer length, etc. An example embodiment of an intermediate instruction word IIW is illustrated in Appendix A as described below. Depending on the implementation, any subset of the fields shown in Appendix A may be included in an IIW to set up the MAC unit.
The instruction generator and processing unit R0 shown in
The instruction generator of
In the embodiment of
As an example, if the embodiments of
In other embodiments, the level of granularity may be set at higher or lower levels.
Some additional details and refinements to the system of
One potential source of inefficiency is the pipeline nature of MAC systems. There may be some pipeline processing delay from beginning a MAC instruction, reading data from the X-data and H-data memories, possibly accumulating the multiplication results, possibly limiting the accumulation result, and writing the limited accumulation result back to X-data memory. This is illustrated in
In general, the instruction generator may attempt to apply a new instruction word MIW to the processing unit during every cycle of the clock to enable the system to operate as fast as possible. However, this may cause a possible write-before-read (WBR) conflict if a subsequent MAC instruction needs to use the result of a prior MAC instruction that is still pending in the pipeline. Referring again to
To avoid this problem, logic may be included in the processing unit to detect the approaching read of a memory location that is shared with, and scheduled to be written to by, a prior instruction. The logic may suspend the next MAC instruction until the write from the prior MAC instruction has been completed as illustrated by instruction MIW1B′ in
An approach to resolving the WBR problem without stalling the MAC unit is to use multiple threads in a round robin (circular) manner with each thread using its own resources within the X-data memory. This may enable context switching between threads which, in turn, may reduce or eliminate WBR problems. For example, if the number of threads is at least greater than the number of pipeline cycles between an X-data read used in a MAC instruction, and the final write of the MAC result, there may be no WBR problems at all.
This is illustrated in
Even if there are not enough threads to achieve full cycle utilization of the MAC, the use of multiple threads may reduce the number of stalls required for one or more threads.
In some embodiments, each thread may be suspended after it completes its processing for a specific tick. Each thread may then be enabled (woken up) at the next regular tick. In one example implementation of the embodiment of
When a thread is suspended, a no-operation (NO-OP) instruction may still be issued to the MAC as the round-robin thread execution continues. A NO-OP instruction may be implemented, for example, as a MAC instruction that writes to a reserved null address. Thus, even if a thread is suspended, the MAC instruction words MIW may be spaced apart for each thread, and therefore, the number of potentially wasted clock cycles spent on avoiding WBR conflicts may be reduced. This implies setting the maximum number of threads in the thread scheduler so that the round-robin cycle length does not change during execution. NO-OP insertion does not avoid WBR problems on its own unless there is a guaranteed minimum number of threads in the round-robin loop. If this is not the case, then a MAC stall mechanism is still needed.
Alternatively, a more complex thread scheduler can skip immediately to the next running thread as it changes the thread context. Then, as the number of running threads decreases towards the end of a tick period, WBR issues are then avoided by relying on the stall mechanism. This approach may be a little more complex, but allows smaller numbers of threads to run, if needed, and allows more rapid execution of the remaining running threads as the number of running threads diminishes. This is because not all instructions have WBR conflicts, so as the number of running threads decreases, the round-robin thread cycle length decreases, and therefore each remaining running thread may be able to run more often.
Some additional inventive principles of this patent disclosure relate to the processing order of multi-stage decimation processes. In a decimation process where the decimation factor is large, significant computational savings can be obtained by splitting the decimation process into stages as shown in
In an embodiment according to the principles of this patent disclosure, the processing order within a tick may be reversed so that later stages are processed before the earlier stages. An example will be described in the context of a three-stage decimating filter in which each filter stage decimates by two using the following pseudo code where n is the stage number, and filtern is the filter routine for that stage:
b
n=get_datan−1( )
a
n=get_datan−1( )
c
n=filtern(an,bn)
return(cn)
Within a tick, stage 3 is processed first, and the top level of code may appear as follows:
b
3=get_data2( )
a
3=get_data2( )
c
3=filter3(a3,b3)
return(c3)
where a call to get_data2( ) invokes the following code for the second stage:
b
2=get_data1( )
a
2=get_data1( )
c
2=filter2(a2,b2)
return(c2)
a call to get_data1( ) invokes the following code for the first stage:
b
1=get_data0( )
a
1=get_data0( )
c
1=filter1(a1,b1)
return(c1)
and a call to get_data0( ) invokes the following code to get input data:
a0=input data
return(a0)
The call to get_data0( ) may need to suspend the thread for the remainder of the tick. Execution resumes at the beginning of the next tick when new data is available. Thus, an example sequence for three ticks may be as follows, where an arrow (→) indicates a subroutine call:
b3=get_data2( )→b2=get_data1( )→b1=get_data0( ), suspend
input data at start of tick returned as b1, a1=get_data0( ), suspend
input data at start of tick returned as a1, c1=filter1(a1,b1), c1 returned as b2, a2=get_data1( )→b1=get_data0( ), suspend
Some additional inventive principles relate to methods for scheduling tasks within threads to reduce worst-case timing constraints. These principles will be described in the context of hierarchical (multi-stage or cascaded) decimation filtering, but the principles are applicable to other types of processes as well. For example, with hierarchical decimate-by-two filters, the first stage filter process is executed for every other input sample, i.e., once every other tick. The second stage filter process is executed every fourth tick, the third stage is executed every eighth tick, etc. Using a conventional algorithm for decimation filters, there are occasional periodic ticks in which multiple filter processes need to be executed during the same tick, thereby requiring that tick period to accommodate a worst case timing scenario that is excessively long compared to the average time required for each tick.
This will be explained with respect to
a
n=get_datan−1( ) //step (1)
b
n=get_datan−1( ) //step (2)
c
n=filtern(an,bn) //step (3)
return(cn) //step (4)
In step (1), the get_datan−1( ) routine is called to get input “an”. In step (2), the get_datan−1( ) routine is called again to get the next input “bn”. In step (3), the actual decimation filtern(an,bn) routine is called to calculate the output “cn”, and in step (4), the output value “cn” from the decimation filter routine is returned to the next stage or the ultimate output. Each stage uses this same algorithm. Steps (1), (2) and (4) only take a nominal number of clock cycles per tick. Step (3), however, is the actual decimate process which may take a substantially longer time, especially for decimate filters using a large number of filter taps.
In
Referring to
For stage 2, the get_data1( ) routine must wait for RETC from stage one to obtain new data because stage 2 uses the outputs from stage 1 at its inputs. Thus, at tick 2, geta indicates that its call to the stage1 get_data1( ) does not return, but at tick 3, GETA obtains a new input from RETC in stage 1. Also during tick 3, get_data1( ) is called to get input b1, but it does not return until tick 5. Thus, during tick 5, FILT (i.e. filter2(a2,b2)) and RETC for stage 2 are executed. As is apparent from
For stage 3, the get_data2( ) routine must wait additional ticks until stage 2 returns data, but eventually the data is obtained and FILT (i.e. filter3(a3,b3)) and RETC for stage 3 are executed every eighth tick.
From
The following pseudo code illustrates an embodiment of a method according to some inventive principles of this patent disclosure that may reduce or eliminate the execution of multiple filter(a,b) routines during a single tick.
b
n=get_datan−1( ) //step (1′)
c
n=filtern(an,bn) //step (2′)
a
n=get_datan−1( ) //step (3′)
return(cn) //step (4′)
Here, the steps have been rearranged so that the results of the filtern(an,bn) call are not returned to the next stage until a different tick. That is, after cn=filtern(an,bn) is completed, calling an=get_datan−1( ) will prevent return(cn) from being executed because the next “an” data will not be available until a future tick.
This is illustrated in
Other than higher performance, the sequence described in
The method described in the context of the pseudo-code of steps (1′) through (4′) and
Moreover, the inventive principles have been described in the context of a decimation filter, but the inventive principles may be applied to any other type of signal processing system, for example, systems having multi-stage processes, in which processes having relatively long execution times may periodically align to create worst case timing situations that are longer than average timing constraints.
The inventive principle relating to scheduling tasks within threads to reduce worst-case timing constraints as described above with respect to
b
3=get_data2( )
c
3=filter3(a3,b3)
a
3=get_data2( )
return(c3)
where a call to get_data2( ) invokes the following code for the second stage:
b
2=get_data1( )
c
2=filter2(a2,b2)
a
2=get_data1( )
return(c2)
a call to get_data1( ) invokes the following code for the first stage:
b
1=get_data0( )
c
1=filter1(a1,b1)
a
1=get_data0( )
return(c1)
and a call to get_data0( ) invokes the following code to get input data:
a0=input data
return(a0)
where get_data0( ) may need to suspend the thread for the remainder of the tick. Therefore, an example sequence for three ticks may be as follows, where an arrow (→) indicates a subroutine call:
b3=get_data2( )→b2=get_data1( )→b1=get_data0( ), suspend
input data at start of tick returned as b1, c1=filter1(a1,b1), a1=get_data0( ), suspend
input data at start of tick returned as a1, c1 returned as b2, c2=filter2(a2,b2), a2=get_data1( )→>b1=get_data0( ), suspend
Some additional inventive principles of this patent disclosure relate to methods for determining worst case timing conditions for multi-thread processes. In the embodiments of
One technique to calculate the worst case timing for a group of threads is to compute the total number of instructions for every possible combination of thread processes that may occur between ticks. As the number of threads, the number of processes per thread, and/or number of possible combinations of threads and processes increases, the number of possible combinations may rapidly become unmanageable.
To reduce that total number of combinations that must be analyzed to determine worst case timing, a least common multiple routine maybe utilized according to the inventive principles of this patent disclosure. An example is illustrated in
The LCM method may typically be used to check that all instructions can be executed within a tick period in the worst case, and therefore is of benefit when implemented in the compiler software that generates the code to run on the processor invention. Typically, it would be late in the compiler processing, after instructions are generated, optimized and linked. Knowing the execution times of each instruction, and the maximum number of instructions that can be executed within each tick period, the compiler could issue a warning if it finds that this maximum could be exceeded. The compiler may also attempt to change the sequence of operations, e.g., by changing the relative phases of threads, to improve the timing conditions.
Some additional inventive principles of this patent disclosure relate to methods and apparatus for preprocessing inputs to an algebra unit to eliminate conditional branches when generating functions.
Signal processing systems often utilize lookup tables to determine the value of a function in response to an argument. To reduce the amount of memory required for a lookup table, the function may be decomposed into sub-functions that require smaller lookup tables. The output values from the smaller lookup tables are then used as operands for various arithmetic operations that calculate the corresponding value of the original function. The tradeoff for reducing the table size is an increased amount of processing time and power consumption for the arithmetic operations. Moreover, the arithmetic operations may require conditional branches that further reduce the speed of the function generation process, and may add complexity to an arithmetic unit that calculates the final values of the function being generated.
Some example embodiments will be described in the context of sine/cosine function generation, but the inventive principles are not limited to these examples. The description below makes use of the C99 language to describe expressions, examples, and code. An exception is for x̂y in equations, which is used to represent x to the power of y.
Signal processing systems (hardware or software) are commonly required to find approximations to the sine and cosine of angles at high speed while using a minimum of memory and computational resources. One well-known method is to use lookup tables, which are fast, but which may need a lot of memory for even modest precisions. Each input to the function is converted to an integer memory address, and the output value is read directly.
To find sin(x) in radians, x can be represented as a 16-bit unsigned integer int_x, such that 0<=int_x<=0xFFFF represents a full sine or cosine cycle (where “<=” is less-than-or-equal to, and 0xFFFF is hexadecimal FFFF or 2̂16−1=65535 in decimal). The values of x and int_x are then related by:
x=int
—
x*(2*π)/0xFFFF (Eq. 1)
where π is the well-known mathematical constant 3.1415926535 . . . .
The integer representation has the advantage that larger arguments to sine and cosine can be handled by discarding (masking off) bits above the 16-bit unsigned input range. This is because the sine and cosine functions work modulo 2*π, which may be difficult to implement efficiently and accurately for large x, whereas discarding higher bits in int_x is essentially a modulo operation (modulo 2̂16=0x10000 in this example).
To reduce the size of lookup tables, the following well-known trigonometric relations may be used:
sin(a+b)=sin(a)*cos(b)+cos(a)*sin(b) (Eq. 2)
cos(a+b)=cos(a)*cos(b)−sin(a)*sin(b) (Eq. 3)
Now int_x can be split into two parts, a and b, such that
int
—
x=(a*0x100)+b (Eq. 4)
where 0<=a<0x100 (the top 8 bits of x), and 0<=b<0x100 (the bottom 8 bits of x). Therefore, for all integer values of int_x (even beyond 0xFFFF, if larger integer representations are supported), a and b can be determined from int_x using:
a=(int—x>>8)&0xFF (Eq. 5)
b=int_x&0xFF (Eq. 6)
where >> is the C shift-right operator (x>>y is the integer part of x/(2̂y)), and & is the bitwise ‘and’ masking operator. Therefore, for any int_x, a and b may be obtained using Eqs. 5 and 6, and then Eqs. 2 and 3 may be used to obtain sin(int_x) and cos(int_x), requiring only multiplication and addition operations.
From Eqs. 2 and 3, it appears that tables for sin(a), cos(a), sin(b) and cos(b) are required. However, the relation:
cos(x)=sin(π/2−x) (Eq. 7)
can be used to allow cos(a) to be calculated from sin(a), as both tables cover the full domain of each function. This is not true of cos(b) and sin(b), where the small range of b (the bottom 8 bits of 16 in this example) do not overlap. Therefore, just three 8-bit tables may be used to replace two direct 16-bit tables. This requires about 2̂(16−8)=256 times less memory in exchange for some additional simple computations.
The tables are generally initialized prior to operation, and then only the selection and masking (Eqs. 5 and 6) and multiplication, addition, and subtraction operations in (Eqs. 2 and 3) are needed to generate each new sine and cosine value. If both sine and cosine of the same arguments are needed, then computational work can be shared up to and including the lookup tables.
As an added refinement, the mirroring relations shown in Table 1 may be used, where the quadrant numbering is the numeric value of the top two bits of int_x, i.e., with values in the range 0-3. Thus, the first quadrant is quadrant 0, the second quadrant is quadrant 1, the third quadrant is quadrant 2, and the fourth quadrant is quadrant 3.
Mirroring allows the use of tables with a smaller number of address bits. In this example, if 16 bits in ‘int_x’ represent a complete cycle, then mirroring in the inputs and outputs each reduces the number of address bits by 1, so 14 bits can be used instead of 16 bits. The mirroring on inputs and outputs can be implemented for unsigned 16-bit int_x with the equivalent operations of the following C-code fragment:
A problem with this approach is that the mirror_output boolean controls conditional code execution as a final step. This may add complexity in fast hardware dedicated to linear algebra calculations, which primarily consist of pipelined multiplies and adds.
In an embodiment according to some inventive principles of this patent disclosure, a compact lookup table method that takes in an integer angle, processes it with logic, passes the address to lookup tables, and then with some additional logic, passes the result to a multiplication/addition/subtraction linear algebra processing system which then generates sine and cosine outputs directly. Depending on the implementation details, the logic functions may be implemented with relatively simple logic.
The signs of the table outputs of Eqs. 2 and 3 may be changed based on the quadrant, and then the modified table results may be passed to Eqs. 2 and 3 and the results used directly. If Eqs. 2 and 3 are expressed in matrix form:
then by inspection, it is apparent that there are only two methods of obtaining each combination of mirroring (negation) on the outputs of the sin( ) and cos( ) tables as shown in Table 2, where the symbol ← is used to denote behavior equivalent to “simultaneously becomes” in all selected assignments.
Any combination of these two methods can be used for each of three quadrants, giving eight possible combinations. For example, the following code fragment illustrates the use of Method 1 for the mirroring in quadrants 1, 2 and 3:
Similar solutions can use other combinations of Method 1 and Method 2. For example, the following code fragment illustrates the use of Method 1 for quadrants 1 and 3, and Method 2 for quadrant 2:
Returning to the example in which Method 1 is used for the mirroring in quadrants 1, 2 and 3, the following code fragment illustrates how the initial values for sa, sb and cb can be obtained from tables sin_table_top[a], sin_table_bot[b] and cos_table_bot[b], respectively, which have 7-bit addressing to access 128 values in each table. Since cos(x)=sin(π/2−x) as set forth in Eq. 7 above, the initial value of ca can be obtained from sin_table_top[0x80−a].
In an implementation having an algebra unit such as a pipelined multiply-accumulate (MAC) unit, the last two lines of the code fragment above may be executed by the MAC without any conditional code execution (branch instructions). Thus, a fast sine/cosine function generator may be implemented using an existing algebra unit, relatively small lookup tables, and some simple logic to provide preprocessing of the operands for the algebra unit.
The embodiment of
Mirror logic AA6 mirrors the operands sa, ca, sb, cb as needed to enable a MAC unit or other arithmetic unit to calculate the value of the sinusoidal function in response to the operands without conditional code execution.
Although shown as separate blocks in
Appendix E illustrates example code for a sine cosine generation utility which may be integrated into a system such as that shown in
Appendix F illustrates example code that may be used to test the algorithms described above in C.
The inventive principles described herein may be implemented to provide numerous features and/or benefits depending on the implementation details, combinations of features, etc. Some examples are as follows.
In some embodiments, a configurable controller may be reconfigured depending on the specific processes to be implemented with the control strategy. In some embodiments, the hardware may be configured to perform operations without branch instructions. This may eliminate the branch logic and decision delays associated with branching. For example, hardware may be configured or dynamically reconfigured to perform linear convolution or vector processing without branches.
In some embodiments, limits on MAC output values may be imposed using dedicated hardware, which may reduce processing overhead conventionally associated with software limit checks.
In some embodiments, widely distributed memories may improve MAC performance in terms of data bandwidth efficiency.
In some embodiments, a configurable controller may provide zero overhead task switching.
In some embodiments, the inventive principles may be implemented as a configurable controller having hardware acceleration with high cycle utilization.
In some embodiments, there may be no need to coordinate write-before-read issues because the use of no-operation (NOP) elements may help resolve timing issues.
In some embodiments, threads may be implemented, including running the threads in a round-robin fashion, and yielding to the next thread after each instruction. The number and/or type of threads may set to any suitable values.
In some embodiments, as each thread finishes within a tick period, the round-robin thread cycle is shorted to eliminate that thread, and then any WBR faults are detected, and MAC stalls are inserted as a last resort.
In some embodiments, some of the inventive principles may enable the extension of older semiconductor processing technologies to higher performance levels. For example, a fabrication technology that is nearing the end of its useful life may become competitive again in terms of cost, efficiency, performance, etc., if used to implement a controller according to some of the inventive principles of this patent disclosure.
In some embodiments, and depending on the implementation details, some of the inventive principles may provide or enable the following advantages, features, etc.: (1) configurable real-time control for power conversion applications; (2) high-speed independent control processing and acceleration for a microcontroller; efficient real-time implementation of state-space control system; (3) efficient real-time FIR filters for signal conditioning; (4) efficient real-time multi-rate decimation filtering (enables use of high sample rate converters followed by digital filtering to control the bandwidth of the signal); (5) high-speed sine/cosine generation used to drive high sample rate PWMs (used to generate AC with low-distortion/corrected distortion; (6) simple pipelined MAC may allow for low-gate count/low-power with one multiply-accumulate per clock; (7) multiple memory buses may enable a very high cycle utilization; (8) code/address generator may keep the MAC unit feed with close to 100% cycle efficiency; (9) data may be bounded to a user defined min/max level (each address location); (10) this may enable zero-overhead clipping of data, which may be used primarily to limit the values of integrators, but can be used on any state variable; (11) inputs and output may be registered on a clock boundary, e.g., enabling a fixed one ADC clock delay through the system, e.g., output can be skewed relative to this clock; (13) an internal state can be logged without altering the timing; (14) hardware fault detection, e.g., stack/PC overflow/underflows may be detected and outputs may be disabled, thus, completion of code execution in allocated time may be checked and outputs disabled if error is detected.
Some additional following advantages, features, etc., may be realized in some embodiments, and depending on the implementation details: (15) zero overhead task switching (fine grain, instruction level task switching) which may enable hiding the pipeline with other tasks; (16) separate data/coefficient/limit/address RAMs; (17) deterministic run-time behavior; synchronous inputs and output to the host controller (may be deterministic because the number of clock cycles are known in advance); (18) hardware fault detection; redundancy and safety margin improvement.
Appendixes A through E illustrate examples of code, processes and/or methods that can be implemented using the systems of
Appendices A and B illustrate example embodiments of an intermediate instruction word IIW and a MAC external instruction word MIW, respectively, in the format of Verilog code. The symbol “//” marks the start of a comment line which applies to Verilog declaration below the comment. A signal name such as “signal_name[x−1:0]” defines a bus “signal_name” of width×wires, with wire indices 0 through x−1 where 0 is the least significant bit. Bus widths are not defined in the example IIW, but can be chosen based on the level of performance needed. The choice of bus widths affects the number of gates used to implement the instruction words.
Appendix C illustrates an example of code for a signal processing engine using hardware that on each clock can perform a Multiply-Accumulate (MAC) instruction.
Appendix D illustrates example code to run on a compiler using system language as described in Appendix C. The subroutine filt1 illustrates an example of the method for reducing worst case timing constraints as described above in the context of
Appendix E illustrates example code for a sine cosine generation utility which may be useful, for example, in phase lock applications such as locking the output of a AC power source to a grid waveform.
Appendix F illustrates example code that may be used to test the sine/cosine generation algorithms described above.
The inventive principles of this patent disclosure have been described above with reference to some specific example embodiments, but these embodiments can be modified in arrangement and detail without departing from the inventive concepts. For example, some of the embodiments have been described in the context of synchronous logic, but the inventive principles may be applied to embodiments that employ asynchronous logic as well. Such changes and modifications are considered to fall within the scope of the following claims.
Example of intermediate instruction word (IIW) format:
Example of MAC instruction word (MIW) format:
On each clock, can do one of the following Multiply-Accumulate (MAC) instructions in “loops+1” clocks (where loops >=0):
In this example, the processing unit is fed by an address generator called AGEN. The AGEN supports the following instructions:
The “enable_context_switch” can be a bit set concurrently with the other AGEN instructions.
The instructions (a-f) above are AGEN instructions, and the remaining data at each address comprises Very Long Instruction Word (VLIW) instruction data to be sent to the MAC.
The system can include a system language and compiler for the system. The following is an example of code running on it:
For phase locking applications, may need to generate the sin( ) and cos( ) of a value accumulated in the X-DATA memory. This may be done using an equivalent of the following C code in hardware. The main( ) is just to initialize tables (which could be implemented as fixed as ROM in hardware), and to check the results from sincos( ) which actually uses the algorithm to calculate the desired results.
In the system language, we can calculate the final sin and cos values in an array:
This following code is a complete system for testing a sine/cosine function generator algorithm in C. If the code is placed in a file sin_cos.c, then on a Unix or Linux system, the code compiles in its directory using:
A test is run using the command “./sin_cos”
The code also allows one to adjust three independent precision parameters, and check on the precisions of the result, allowing one to experiment to get the smallest satisfactory precision. Note that “top” and “bot” are used in the
code for “a” and “b” respectively as used in the main description.
This application claims priority from U.S. Provisional Patent application Ser. No. 61/239,756 filed Sep. 3, 2009, which is incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61239756 | Sep 2009 | US |