1. Technical Field
The embodiments herein generally relate to dynamic range detection, and, more particularly, to dynamic range detection in CPUs in receivers.
2. Description of the Related Art
Typically central processing unit (CPU) architectures in digital signal processors do not support efficient implementation of block floating point processing on arrays. Even in architectures, where block floating is supported, a lot of control code needs to be added to take care of pre and post scaling of data blocks based on a dynamic range of signals at each stage. The problem with these existing methods is that the overheads in the control code for detecting the dynamic range of signals could be significant to the extent that it may run out of the available MIPS or cycles for a given application.
In addition it needs to support arithmetic data-path widths that are higher than the optimal. This potentially leads to bigger designs, consequently increasing area and leading to more than necessary power dissipation. Both fixed and floating-point implementations have their respective advantages. It is possible to achieve the dynamic range approaching that of floating-point arithmetic while working with fixed-point processors. This can be accomplished by using floating-point emulation software routines.
Fixed point representation is a real data type for a number that has a fixed number of digits after the radix point. Floating point describes a system for representing real numbers which supports a wide range of values. Numbers are in general represented approximately to a fixed number of significant digits and scaled using an exponent. In fixed point processors it is possible to achieve the dynamic range of signals similar to that achieved in floating-point processors by using floating-point emulation software routines. Emulating floating-point behavior on a fixed-point processor is very cycle intensive, since the emulation routine manipulates all arithmetic computations to artificially implement floating-point math on a fixed-point device. This software emulation is only worthwhile if a small portion of the overall computation requires extended dynamic range. Hence, a cost-effective alternative for floating-point dynamic range implemented on a fixed-point processor is needed. This is where block floating point algorithm plays a significant role.
The block floating point algorithm is based on the block automatic gain control (AGC) concept. The AGC scales values at the input stage of a signal processing function and only adjusts the input signal power. The block floating point algorithm takes it a step further by tracking the signal strength from stage to stage to provide a more comprehensive scaling strategy and extended dynamic range. The floating-point emulation scheme discussed here is the block floating-point algorithm. The primary benefit of the block floating-point algorithm emanates from the fact that operations are carried out on a block basis using a common exponent. Here, each value in the block can be expressed in two components namely a mantissa and a common exponent. The common exponent is stored as a separate data word. This leads to a minimum hardware implementation compared to that of a conventional floating-point implementation.
The value of the common exponent is determined by the data element in the block with the largest amplitude. In order to compute the value of the exponent, the number of leading zeros or leading ones bits has to be determined. This is determined by the number of left shifts required for this data element to be normalized to the dynamic range of the processor. If a given block of data of the input signal consists entirely of small values, a large common exponent can be used to shift the small data values left and provide more dynamic range. On the other hand, if a data block contains large data values, then a small common exponent will be applied. Once the common exponent is computed, all data elements in the block are shifted up by that amount, in order to make optimal use of the available dynamic range. Scaling each value up by the common exponent increases the dynamic range of data elements in comparison to that of a fixed-point implementation.
In communication based applications an analog to digital converter (ADC) is used for sampling the input signals. The ADC specifications like Effective number of Bits (ENOB) etc are usually chosen on the basis of worst case channel conditions which is why these have sufficient Headroom beyond the required SNR requirements. This is shown in
There are some hardwired architectures for doing block floating point implementations for FFT computations. Since, they are hardwired blocks there is no overhead due in SW cycles consumption, though they would consume finite cycles. In addition, since these address only one class of signal processing functions like FFT, they cannot be reused for other classes.
In view of the foregoing, an embodiment herein provides a system for computing a block floating point (BFP) scaling factor by detecting a dynamic range of an input signal in a central processing unit (CPU) without additional overhead cycles. The system includes a dynamic range monitoring unit that detects the dynamic range of the input signal by snooping (i) outgoing write data and (ii) incoming memory read data of the input signal. The dynamic range monitoring unit includes a leading zero and leading one detector and counter unit that detects a count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal, a registered maximum count unit that stores the count of leading zeros and leading ones for each sub word of the outgoing write data and the incoming memory read data of the input signal, a least value finder unit that determines a least value of the count of the leading zeros and leading ones over a block of data, and a running maximum count unit that stores the least value of the count of the leading zeros and leading ones over the block of data. The dynamic range is detected based on the least value of the count of the leading zeros and leading ones over the block of data and a least value of a count of trailing zeros over the block of data. The system further includes a scaling factor computation module that computes a block floating point (BFP) scaling factor based on the dynamic range.
The dynamic range monitoring unit further includes a bus swapper unit that bus-swaps each of the sub-word of the outgoing write data and the incoming memory read data of the input signal such that (i) a most significant bit (MSB) position of each of the sub word occupies a least significant bit (LSB) position, and (ii) a LSB position of each of the sub-word occupies a MSB position, a trailing zeros detector and counter unit that detects a count of trailing zeros over the block of data for each of the sub-word of the outgoing write data and the incoming memory read data of the input signal, a registered minimum count unit that stores the count of trailing zeros for each of the sub-word of the outgoing write data and the incoming memory read data, a least value finder unit that determines the least value of the count of trailing zeros over the block of data, and a running minimum count unit that stores the least value of the count of trailing zeros over the block of data.
The count of leading zeros and leading ones, the least value of the counted leading zeros and leading ones, the count of trailing zeros, and the least value of the count of trailing zeros may be preset to a highest possible value before detecting the dynamic range at a start of the load and store operations. The system may further include a CPU control register (CCR) that turns on and turns off the dynamic range using a specified program. The dynamic range may be updated in a control register file at an end of a signal processing operation when a value of control bit signals is cleared to zero. The signal processing operation is at least one of a load operation, a store operation, an arithmetic operation, and a logical function operation. The dynamic range is detected in at least one of a load store unit, an arithmetic unit, and a logical function unit.
In another aspect, a method for implementing a block floating point (BFP) scaling factor by detecting a dynamic range of an input signal in a central processing unit (CPU) without additional overhead cycles is provided. The method includes detecting a count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal using a leading zero and leading one detector and counter unit, storing the count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal using a registered maximum count unit, determining a least value of the count of the leading zeros and leading ones over a block of data using a least value finder unit, storing the least value of the count of the leading zeros and leading ones over the block of data using a running maximum count unit, detecting the dynamic range based on (i) the least value of the count of the leading zeros and leading ones over the block of data and (ii) a least value of a count of trailing zeros over the block of data by a dynamic range monitoring unit, and computing the block floating point (BFP) scaling factor based on the dynamic range.
The method further includes determining whether a signal processing stage is a first stage, the BFP scaling factor is obtained from a previous stage when the signal processing stage is not the first stage, and computing a new BFP scaling factor for a second stage based on the dynamic range, the input data for the second stage is shifted using the new BFP scaling factor along with a first signal processing operation, an output of the first stage is shifted and written on a memory addressed by the CPU using the new BFP scaling factor along with a second signal processing operation.
The new BFP scaling factor may be set to zero when the signal processing stage is the first stage. It may be determined whether the second stage is a last signal processing stage. Arithmetic scaling and residue scaling components of the new BFP scaling factor for the second stage may be determined. A residue exponent may be computed by scaling the residue scaling components of the first stage till until the last signal processing stage to obtain a required native precision of the input signal. Dummy load operations may be performed on two data sets of the input signal. A dynamic range of two data sets may be detected. A scaling factor is computed based on the dynamic range of the two data sets of the input signal. The first signal processing operation and the second signal processing operation is at least one of a load operation, a store operation, an arithmetic operation, and a logical function operation.
The method further includes bus-swapping using a bus swapper unit, each sub-word of the outgoing write data and the incoming memory read data of the input signal such that (i) a most significant bit (MSB) position of the each sub-word occupies a least significant bit (LSB) position, and (ii) a LSB position of the each sub-word occupies a MSB position, detecting the count of trailing zeros for the each sub-word of the outgoing write data and the incoming memory read data of the input signal using a trailing zeros detector and counter unit, storing the count of trailing zeros for the each sub-word of the outgoing write data and the incoming memory read data using a registered minimum count unit, determining a least value of the count of trailing zeros over the block of data using a least value finder unit, and storing the least value of the count of trailing zeros over the block of data using a running minimum count unit.
These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
As mentioned, there remains a need for different types of digital signal processors (e.g., CPUs) like very long instruction word (VLIW) processors or superscalar or single-issue processors for Software defined radio subsystem or for receivers. The embodiments herein achieve this by providing a method by which dynamic range of input signal (for different classes) is detected with zero overhead using dynamic range detection with load store operations in VLIW processors. The scheme is generic and can be extended to any type of CPU architecture like single-issue or superscalar processors. Referring now to the drawings, and more particularly to
In a generic case, the CPU is likely to have one or more Load-Store units. The instruction is first decoded after the dispatch phase and then executed. In addition, the VLIW CPU 200 consists of a register file 210, an arithmetic slot 212, a logical function decode and operand fetch slot 214A, a logical function slot 214B, and a pipeline and interrupt control unit 216. The contents to be written to the memory are fetched from the register file 210 and along with the decoded bits and are latched in an operand fetch phase.
The arithmetic slot 212 processes real or complex signals, along with miscellaneous execution units like logic function slot 214B etc. The CPU control register (CCR) 202 turns on or turns off the dynamic range monitoring function as desired by a programmer for load or store operations from the memory. In one embodiment, the CPU control register (CCR) 202 turns on and turns off the dynamic range using a specified program (as desired by a programmer) for the load or store operations. The load store unit1 (LSU1) decode and operand fetch 206A and 208A perform load and store operations in a DSP processor (e.g., the CPU 200) for detecting the dynamic range of the input signal.
The load-store unit with dynamic range monitoring 300A further includes a memory launch pipe 310 which receives data from an address generation unit 312, and a write data operand signal 314 from the operand fetch pipe 302. The address mode 306 signifies the various types of addressing modes. The address operand 308 signifies the address of the memory operation. The address generation unit 312 generates an address based on an addressing mode of the input signal.
The load-store unit with dynamic range monitoring 300A further includes a dynamic range monitor block 316 that snoops on the write data operand 314 received from the operand fetch pipe 302. In one embodiment, the dynamic range monitor block 316 detects the dynamic range of the input signal by snooping (i) an outgoing write data and (ii) an incoming memory read data of the input signal.
The write back control 318 and the write back address 320 are latched in the write back control pipe 322 for a required number of load-delay cycles for eventually writing back to the register file 210. The memory read pipe 324 receives data from the memory which is clocked using a rdclk_phase signal 326. In one embodiment, an incoming memory read data is latched in a memory read pipe 324 and clocked using a memory read clock (e.g., the rdclk_phase 326). Similarly, the outgoing write data is obtained from the control register file 330 and written on a memory.
In one embodiment, a similar snooping operation is also performed on the register write data bus when the loaded data from the memory read pipe 324 is being written back into the register file 210. The outgoing write data and one or more decoded bits are latched in the operand fetch pipe 302. The outgoing write data and the incoming memory read data are enabled using control bit signals that are generated from the operand fetch pipe 302 (e.g., an operand fetch phase) and are set at a start of the load and store operations (e.g., load or store cycles) which needs to be monitored. The dynamic range monitor block 316 snoops the outgoing write data and incoming memory read data and is enabled using the control bit CPU control register dynamic range control 332.
The signal “cpu control dynamic range update” 328 is used to update the value of dynamic range detected in the CPU control register file 330. In one embodiment, the dynamic range is updated in the control register file 330 at an end of signal processing operations when a value of control bit signals is cleared to zero. In one embodiment, the signal processing operation is at least one of a load operation, a store operation, an arithmetic operation, and/or a logical function operation. A maximum exponent value is computed and latched into the CCR 202 when the control bit signals is cleared to zero. The dynamic range monitoring is turned on using the signal CPU control register dynamic range control 332 (e.g., ccr_dyn_range_ctrl).
The load store unit control signal 304, the address mode signal 306, the address operand signal 308 generate the appropriate address based on the addressing mode (e.g.,a linear addressing mode, a circular addressing mode, a bit reverse addressing mode, and an indirect addressing mode) in the next phase. These signals are launched to the memory interface from the memory launch pipe 310. The data block to be written to the memory, consists of the signal named write data operand 314 (e.g., the write_data_operand 214) which is snooped to determine the dynamic range of a given block.
The signals the write back control 318 and the write back address 320 are preserved throughout the load delay cycles in the intermediate pipeline stages namely write back control pipe-1322, write back control pipe-2. etc., upto write back control pipe-N. The active to inactive transition of the CPU control register dynamic range control bit signals of the dynamic range monitoring 316 can be turned off and the previous value of dynamic range calculated can be latched. In this manner, the load-store unit stores the dynamic range for a block of data of the input signal without adding overhead cycles. Similarly, the dynamic range can be detected in other units (e.g., an arithmetic unit, and/or a logical function unit as shown in
With reference to
The CPU 200 can perform the dynamic range monitoring in the arithmetic unit. During any CPU operation, the arithmetic unit fetches the required operands (e.g., an arithmetic write data operand 342) from the register file 210, where data would be preloaded using any memory read operation. During the pipelined stages of CPU operation the required operands are fetched by the arithmetic slot through the arithmetic decode and operand fetch unit 212 and propagated to the arithmetic unit via the respective operand fetch pipe. The dynamic range monitoring block 316 can snoop on the arithmetic write data operand 342 (e.g., the arith_write_data_operand 342) and compute an appropriate dynamic range while other operations are concurrently ongoing in the arithmetic unit. The active to inactive transition of the CPU control register dynamic range control bit signals of the dynamic range monitoring 316 can be turned off and the previous value of dynamic range calculated can be latched.
With reference to
The CPU 200 can perform the dynamic range monitoring in the logical function unit. During any CPU operation, the logical function unit fetches the required operands (e.g., a logic write data operand 356) from the register file 210, where data would be preloaded using any memory read operation. During the pipelined stages of CPU operation the required operands are fetched by the logic slot through the logical function decode and operand fetch unit 214 and propagated to the logic function unit via the respective operand fetch pipe. The dynamic range monitoring block 316 can snoop on the logic write data operand 356 (e.g., the logic_write_data_operand 356) and compute an appropriate dynamic range while other operations are concurrently ongoing in the arithmetic unit. The active to inactive transition of the CPU control register dynamic range control bit signals of the dynamic range monitoring 316 can be turned off and the previous value of dynamic range calculated can be latched. Hence no additional overhead CPU cycles are required while dynamic range of signals is computed in these units also. Thus, the dynamic range is detected in at least one of a load store unit, an arithmetic unit, and a logical function unit. In one embodiment, the dynamic range can be detected in any of the load store slot with dynamic range monitoring 206A and 208B, the arithmetic slot of
With reference to
With reference to
With reference to
The leading zero or leading one detector 402 detects a count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal. In one embodiment, the leading zero or leading one detector 302 detects the dynamic range of the input signal. The outgoing write data and incoming memory read data are snooped by detecting the count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal. The registered maximum count 404 stores the count of leading zeros and leading ones as a registered maximum count. In one embodiment, the registered maximum count stores the count of leading zeros and leading ones for each sub-word of the outgoing write data and the incoming memory read data of the input signal. A similar process is followed for different sub-words and the least value amongst all and any previous least value is found using the least value finder 414A determines a least value of the count of said leading zeros and leading ones over a block of data and stores in the running maximum count indicator 406. The running maximum count indicator 406 maintains the smallest possible value of K over a block of data of the input signal. In one embodiment, the running maximum count indicator 406 maintains a least value of the count of the leading zeros and leading ones over the block of data of the input signal.
The bus swapper 408 bus-swaps each sub-word of the outgoing write data and the incoming memory read data such that (i) a most significant bit (MSB) position of the each sub-word occupies a least significant bit (LSB) position, and (ii) a LSB position of the each sub-word occupies a MSB position. This swapped data bus is processed to find the leading zeros. A combination of the bus swapper 408 and the trailing zeros detector and counter 410 enables determining the number of trailing zeros. The trailing zeros detector and counter 410 detects a count of trailing zeros for the each sub-word of the outgoing write data and the incoming memory read data of the input signal. The registered minimum count indicator 412 stores the count of trailing zeros for each sub-word of the outgoing write data and the incoming memory read data as a registered minimum count.
A similar process is followed for different sub-words and the least value amongst them and any previous least value is determined using the least value finder 414B and stored in the running minimum count indicator 416. The least value finder 414B determines a least value of the count of trailing zeros over the block of data. The running minimum count indicator 416 maintains the smallest possible value of ‘L’ over a block of data of the input signal. In one embodiment, the running minimum count indicator 416 stores a least value of the count of trailing zeros over a block of data of the input signal. The value ‘L’ is stored as trexp and L+K is stored as a maximum exponent value (e.g., maxexp). The computed value of maxexp is latched into the CPU Control register 202 when the CPU control register dynamic range control bit is cleared to zero, using a high to low transition. In one embodiment, the maximum exponent value is the dynamic range that is detected by adding the least value of the count of the leading zeros and leading ones over the block of data and the least value of a count of trailing zeros over the block of data.
At the start of the operation, before turning on the dynamic range monitoring for a given load-store unit, the registers registered minimum count indicator 412, the registered maximum count 404, the running maximum count indicator 306, and the running minimum count indicator 416 are preset to a highest possible value so that the previous values are not used. In one embodiment, the count of leading zeros and leading ones, the smallest value of the count of the leading zeros and leading ones, the count of trailing zeros, and the smallest value of the count of trailing zeros are preset to a highest possible value before detecting the dynamic range.
Using the contents of maxexp register, an optimum scaling factor is calculated for required different types of operation programmatically. The optimum scaling factor (e.g., a block floating point (BFP) scaling factor) is computed based on the dynamic range. In one embodiment, the block floating point (BFP) scaling factor is computed using a scaling factor computation module that may reside in the dynamic range monitor block 316 of
In one embodiment, a best dynamic range is programmatically selected based on different classes of the input signal and corresponding scaling factors. Some of these scaling factors which are usually used in signal processing are as follows:
G<1/{Xmax*Sigma[mod Hk]for k ranging from 0 to N−1}.
Where Hk is the impulse response of a filter with length N. The summation term sigma [mod Hk] for k ranges from 0 to N−1 which is called L1 norm.
G<1/{Xmax*Sqrt(Sigma[H(k)2] for k ranging from 0 to N−1.)}.
The above norm is called L2 norm and is always lesser than L1.
G<1/{Xmax*max[H(wk)]}
The term max [H (wk)] is known as the Chebyshev norm of the frequency response H (w). This guarantees that the steady state response of the system to a sine-wave input will never overflow.
Since the Xmax value are known, by using the maxexp contents (e.g., maximum exponent) for a block of data, a scaling factor (which could be scale-up or scale-down) may be derived which will ensure that the output is stable and does not exceed the required precision range for a given class of signal processing function. The value of trexp (e.g., number of trailing zeros) is also maintained in a separate control register for further processing at the end of all stage wise signal processing functions. It is assumed that since Frequency response is known in all such scenarios. Similarly, scaling factors can be derived for spectral decomposition operations like Fast Fourier transform on a stage by stage basis.
In step 506, it is checked whether the maximum datum lies between 0.25 (8192 in Q.15) to 0.5 (16384 in Q.15). If the maximum datum lies between 0.25 (8192 in Q.15) to 0.5 (16384 in Q.15), then the input array is normalized by some power of two that gives the maximum datum room for one bit of growth. In step 508, if the maximum datum does not lie between 0.25 (8192 in Q.15) to 0.5 (16384 in Q.15), then the input data is shifted to occupy MSBs. In step 510, the first stage data operation is performed. The data in the subsequent radix-2 stage increases by either zero or one bit. If there is no increase or only fractional increase occurs, then scaling operation is not performed.
In step 512, a maximum value of real or imaginary part is identified. In step 514, it is checked whether the maximum datum lies between 0.25 (8192 in Q.15) to 0.5 (16384 in Q.15). If the maximum datum does not lie between 0.25 (8192 in Q.15) to 0.5 (16384 in Q.15), then the input data is shifted to occupy all but 2 MSB's in step 516. If any real or imaginary data increases by one bit, then all values are scaled down by one bit to prepare for bit growth in a second stage 518. The data in the subsequent radix-2 stage then increases by either zero or one bit. If no increase or only fractional increase occurs, then scaling is not performed.
In step 520, it is checked whether if any real or imaginary data increases by one bit from the maximum value of real or imaginary outputs from previous step. In step 522, it is checked whether the maximum datum lies between ⅛ (4096 in Q.15) to 0.25 (4096 in Q.15). If maximum datum does not lie between ⅛ (4096 in Q.15) to 0.25 (4096 in Q.15), then input data is shifted to occupy all but 2 MSB's in step 524. If maximum datum lies between ⅛ (4096 in Q.15) to 0.25 (4096 in Q.15), then the log2N stages (one stage per loop) is performed in step 526.
The input data is scaled by some factor of two that allows for two bits of growth. In one embodiment, the maximum datum must lie between ⅛ (4096 in Q.15) and 0.25 (8192 in Q.15) to prevent overflow yet maximize the dynamic range and the block exponent of the output magnitude can be recovered. In step 528, scaling factors are recorded from each stage and it is checked whether it is the last stage. If it is the last stage, the total number of shifts (e.g., the block common exponent) is returned to allow the proper output magnitude to be recovered in step 530. Else, the step 520 is repeated.
With reference to
In step 604, the phase 2 is carried out using the arithmetic slot which is capable of embedding such scaling operations along with arithmetic functions. Finally, the phase when data is written out as the operation of Stage 1 and, the max value of this block of data that has to be determined are combined as part of phase 3 in step 606. The load-store unit performs the dynamic range monitoring as a part of the store operations embedded as part of stage 1. In step 608, the phase 4 the stage 2 is carried out using the arithmetic slot which is capable of embedding such scaling operations along with arithmetic functions. Finally, the phase when data is written out as the operation of stage 2 and the max value of this block of data that has to be determined are combined as part of Phase 5 in step 610.
In step 612, the phase 6 is carried out using the arithmetic slot which is capable of embedding such scaling operations along with arithmetic functions. Finally, the phase when data is written out as the operation of stage 3 and the max value of this block of data that has to be determined are combined as part of phase 7 in step 614. In step 616, it is checked whether it is the last stage. If it is the last stage, the total number of shifts (e.g., the block common exponent) is returned to allow the proper output magnitude to be recovered in step 618. Else, if it is not the last stage the step 614 is repeated. Note that, in all intermediate stages where the task of finding out, if the data is within a range like ¼<|Max|<½ or ⅛<|Max|<¼ is required, this is done in the arithmetic unit using the scaling factors found in the previous stages. Thus the block floating point FFT can be efficiently done on the CPU 200 without any overhead cycles.
The instruction set support for handling block floating point is implemented in an arithmetic execution slot. The arithmetic execution slot performs operations on both real and complex blocks of data or signals. It has 3 dedicated scaling registers (SCALEREG1, SCALEREG2 and SCALEREG3) which are selectable for any arithmetic operation. Each of the scaling registers has the following 3 fields, which can be used to pre-scale the sources of post-scale the final result. The most frequently used operation is post-scaling the result.
a) Dest_po (Bits 4-0) is used for post-scaling the result before writing them to the destination register.
b) Src2_pre (Bits 9-5) is used for pre-scaling the second source of an arithmetic operation.
c) Src1_pre (Bits 14-10) is used for pre-scaling the first source of an arithmetic operation.
The following arithmetic operations are supported with 2 source operands and 1 destination operand. Both the source operands are capable each being pre-scaled using the fields Src1_pre and Src2_pre. In addition, the destination output can be post-scaled using the field Dest_po.
A) Complex Multiply and Complex Conjugate Multiply Operations and the SIMD versions (with 2-way simd).
a) CMUL src1,src2,dest,#sc_offset
b) CNMUL src1,src2,dest,#sc_offset
c) CMUL2 [src1_o:src1_e],[src2_o:src2_e],[dest1_o:dest1_e], #sc_offset
a) RMUL src1,src2,dest,#sc_offset
b) RMUL2 src1,src2,dest,#sc_offset
c) RMUL4 [src1o:sr1e],[src2o:src2e],[dest_o:dest_e],#sc_offset
a) BTRT [src_o:src_e], TwiddleReg, dest_hi, dest_lo,#sc_offset
a) BTRF src1, src2, TwiddleReg, dest_lo_o:dest_hi_e,#sc_offset
The output of the stage01002A signal processing stage is stored in the last stage and during this process the dynamic range in monitored. Since, the dynamic range of coefficient is known the required scaling factor of the result can be easily computed based on the type of signal processing operation in stage11002B. The outgoing write data of the input signal is snooped to detect the dynamic range. In one embodiment, a count of leading zeros and leading ones for each sub-word of the outgoing write data of the input signal is detected for detecting the dynamic range. The outgoing write data is latched in an operand fetch phase by writing on a memory of the CPU 200. This scaling factor 1006 computed (BFPScalingFactor) is then used for stage11002B operations to maximally utilize the available arithmetic bit width. In one embodiment, it is determined whether stage1 is a first stage.
A block floating point (BFP) scaling factor is obtained from a previous stage when the stage is not the first stage (e.g., stage 11002B). A new BFP scaling factor is computed for a second stage based on the dynamic range. An input data for the second stage is shifted using the new BFP scaling factor along with a load operation. Arithmetic scaling and residue scaling components of the new BFP scaling factor may be determined, and the BFP scaling factor is set to zero when the stage is the first stage. The new BFP scaling factor is computed based on the dynamic range that is detected by snooping an outgoing write data of the input signal and latching the outgoing write data in an operand fetch phase by writing on a memory of the DSP (e.g., the CPU 200 of
It may be determined whether the second stage is a last stage. If the second stage is the last stage then arithmetic scaling and residue scaling components of the new BFP scaling factor are determined for the last stage. A residue exponent 1008 may be computed by scaling the residue scaling components of the first stage until the last stage. While storing the final outputs of stage11002B, the store unit performs the dynamic monitoring of the processed outputs. This process is then used iteratively across different stages upto stage N of subsequent processing. The different scale factors corresponding to maxexp values used at each stage (s1,s2,s3, . . . ,sN) are used in the final stage to scale up the result with the exponent value (2(s1+s2+s3+. . .+sN)) 1008.
Using such dummy loads 1102, the dynamic range of the data sets can be identified and used subsequently to compute the required scaling factor using a scaling factor computation module 1110. Once the required scaling factor (BFPScaling Factor) is ascertained, it is used for subsequent stages of correlation processing 1104. For all stages, the dynamic range monitoring can be done with load or store operations. It is determined whether a stage is a first stage. In one embodiment, the BFP scaling factor is set to zero when the stage is the first stage. A block floating point (BFP) scaling factor is obtained from a previous stage when the stage is not the first stage.
A new BFP scaling factor is computed for a second stage based on the dynamic range. An output of the first stage is shifted and written on a memory addressed by a DSP (e.g., the CPU 200 of
Else, it is checked whether it is the last stage in step 1208. In one embodiment, it is checked whether the second stage is a last stage. If it is last stage, the process is terminated. Else (If No), a specific signal processing operation for that stage is performed in step 1210. In one embodiment, arithmetic scaling and residue scaling components of the new BFP scaling factor for the second stage are determined. In step 1212, for the subsequent signal processing stages dynamic range monitoring is performed in a hidden or interleaved manner with store or load processes to find out a BFPScaling factor for each stage. Such scaling factors are subsequently used either completely or partially with the arithmetic operations of the next stage and the residue scaling factor can be used to scale up the final computed value in the chain of signal processing steps. In one embodiment, a residue exponent is computed by scaling the residue scaling components of the first stage till until the last stage. The residue scaling components of the first stage till until the last stage are scaled to get back required native precision of the input signal. Dummy load operations on two data sets of the input signal may be performed, and a dynamic range of the two data sets is detected. A scaling factor may be computed based on the dynamic range of the two data sets of the input signal. The dynamic range is detected based on the least value of the count of the leading zeros and leading ones and the least value of the count of trailing zeros over the block of data.
A user of the receiver 1300 may view this stored information on display 1306 and select an item for viewing, listening, or other uses via input, which may take the form of keypad, scroll, or other input device(s) or combinations thereof. When digital content is selected, the processor 1310 may pass information. The content and PSI/SI may be passed among functions within the receiver 1300 using bus 1304. In one embodiment, the CPU 200 is the same processor 1310.
The CPU 200 includes the dynamic range monitor block 316 that detects a dynamic range of the input signal while performing load and store operations in the CPU 200 of
The CPU 200 allows programmatically selecting the best dynamic range for different class of input signals and corresponding scaling factors (e.g., L1 norm, L2 norm, Chebyshev norm, and Euclidean norm). The CPU 200 allows detecting the dynamic range in other slots such as the arithmetic slot and the logical function slot of
Further, in a multi-processor system with multiple DSP (e.g., using more than one CPU 100) which have such type of Load-Store Units (e.g., the load store unit1 (LSU1) decode and operand fetch 206A, the load-store slot with dynamic range detection for unit1206B, the load store unit2 (LSU2) decode and operand fetch 208A, and the load-store slot with dynamic range detection for unit2208B for a memory store or load operation) it is possible to efficiently utilize the arithmetic data-path to maximize a Signal to Quantization Noise Ratio. (SQNR). Alternately in such a scenario for a given SQNR it is possible to use an arithmetic data-path with reduced precision. Thus for a given target SQNR it is possible to turn off bit-slices based on the required precision and save dynamic power dissipation. Further, the CPU 200 enables to communicate the scale-up or scale-down factors required in a signal processing chain for optimally using the arithmetic resources.
The CPU 200 requires minimal interference from software for block floating point (BFP) DSP operations. The method of detecting dynamic range as discussed above can be used for all classes of signal processing operations like correlation, and filtering operations such as a Finite Impulse Response (FIR) filtering, an Infinite Impulse Response (IIR) filtering, an interpolation, and a sample rate conversion filtering, etc., and not just limited to a fast fourier transform (FFT) alone. Further, for a fixed width arithmetic data-path this method of detecting the dynamic range enables to maximize the Signal to Quantization Noise Ratio (SQNR). For a fixed Signal to Quantization Noise Ratio this method allows using the minimum arithmetic data-path width and thus reduces the power dissipation.
The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1510/CHE/2011 | May 2011 | IN | national |