A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
This invention relates generally to data processing, and more particularly to the processing of algorithms in software that benefit from efficient implementation of “butterfly” operations such as for example those used in Fast Fourier Transform (FFT) calculations.
2. Description of Related Technology
The fast Fourier transform (FFT) is a commonly used algorithm that efficiently converts from a time domain to a frequency domain representation of a signal. Uses include spectral analysis, signal compression and filtering. The number of cycles taken to perform an FFT on a particular processor is commonly quoted as a measure of that processor's efficiency.
At the heart of the aforementioned FFT calculation is a sequence of operations commonly known as a “butterfly”. The calculation has three inputs (A0, B0, and the so-called “twiddle factor”), and two outputs (A1, B1), as shown diagrammatically in
Accordingly, an FFT algorithm can be coded using a triple nested loop of the type well known in the programming arts. The outer loop of the triple nested loop repeats log2(N) times. In the case of the 8-point example of
The multiply (MUL) and multiply accumulate (MAC, XMAC) operations of the foregoing code example take longer than normal instructions to complete. This means that the processor will stall (i.e. new instructions will not enter the pipeline) for up to 3 cycles (24-bit case) or 2 cycles (16-bit case) under certain conditions. Hence, a single butterfly calculation may take 8 or more cycles to complete. This is a less than optimal situation from a performance standpoint.
Based on the foregoing, there is a need to provide an improved configuration adapted to reduce the computation time (and particularly, the number of cycles used) for executing the butterfly operations in software. Such reduced computation time would be provided without reducing the maximum clock speed or otherwise utilizing multi-operand instruction slot(s). This improved configuration would also be readily implemented in existing processor instruction set architectures (ISAs) so as to minimize the changes necessary thereto. Furthermore, this improved configuration would ideally be adapted to utilize silicon-efficient hardware (including memory), thereby keeping the size of the processor to a minimum.
The present invention satisfies the aforementioned needs by providing an automated data processor with enhanced instruction execution, and methods associated therewith.
In a first aspect of the invention, an improved configurable, extensible data processor is provided which is adapted to perform processing of algorithms or other operations with a reduced cycle count. In one exemplary embodiment, the processor comprises an extensible reduced instruction set (RISC) processor core which incorporates an extension arithmetic logic unit (ALU) and a 32-bit instruction word and registers. The ALU may be adapted to multiply two 16-bit words, two 24-bit words, or two lots of two 16-bit words (thereby generating two results), depending on the configuration desired by the designer. The instruction set of the processor core further includes a specialized extension instruction adapted for performing “butterfly” calculations associated with fast Fourier transforms (FFTs), such that the cycle count for performing these calculations is greatly reduced over prior art approaches without reducing the maximum clock speed or otherwise utilizing multi-operand instruction slot(s). The specialized instruction is advantageously linked with existing multiply-accumulate instruction circuitry, and has the same latency as the other such instructions. The “dual” 16-bit embodiment of the ALU, coupled with certain hardware modifications, allows further improvement in the overall cycle count by reducing the FFT butterfly operation to three cycles.
In a second aspect of the invention, an improved method of performing a loop calculation within a multi-cycle iterative calculation (such as the aforementioned FFT butterfly) on an extensible processor is disclosed. The method generally comprises providing at least one multiply-accumulate stage having at least one accumulator associated therewith; providing at least one extension instruction (e.g., “FBF”) within the instruction set of the processor, the at least one extension instruction being adapted to (i) subtract a value present in the at least one accumulator from a multiple of a first input value, and (ii) preload the at least one accumulator with a second input value; and writing back the result of the aforementioned subtraction operation to a designated register location using existing pipeline multiply/multiply-accumulate logic. In one exemplary embodiment of the method, the extension instruction comprises a two-operand instruction.
In a third aspect of the invention, an improved method of performing a multi-cycle calculation, such as the aforementioned FFT butterfly, within an extensible processor is disclosed. The method generally comprises providing at least one multiply-accumulate stage having at least one accumulator associated therewith; providing at least one extension instruction within the instruction set of the processor, the at least one extension instruction being adapted to perform loop calculations, and pipelined to the same number of stages as the extension multiply-accumulate (XMAC) instructions of the instruction set so as to avoid stalling of the pipeline during execution; providing a plurality of inputs to said processor; and executing the extension instruction repeatedly to produce the desired result of the multi-cycle calculation in a minimum number of cycles using the at least one accumulator.
In a fourth aspect of the invention, an improved method of manufacturing a processor adapted for performing multi-cycle calculations is disclosed. In one exemplary embodiment, the multi-cycle calculation comprises an FFT butterfly, and the method comprises writing an extension instruction (e.g., FBF instruction) in a hardware description language (HDL); adding the extension instruction to the design of extended data processor; synthesizing the extended processor design; and generating a software function incorporating the aforementioned instruction.
In a fifth aspect of the invention, an improved accumulator data path configuration used in an extended data processor for performing reduced cycle count operations is disclosed. In one exemplary embodiment associated with “single” 16-bit or 24-bit words, the accumulator configuration comprises an accumulator register; an input multiplexer having a pre-load input; accumulator saturation detection logic; additional subtraction logic, and an output multiplexer. In a second embodiment adapted for use with the aforementioned “dual” 16-bit words, the accumulator configuration further includes a second register adapted to store a previous input value for input during a subsequent multiply operation, and logic which facilitates swapping of the two multiply results during execution of a multiply-accumulate instruction following the “FBF” extension instruction.
In a sixth aspect of the invention, an improved method of synthesizing the design of an integrated circuit incorporating the aforementioned enhanced processing capability is disclosed. In one exemplary embodiment, the method comprises obtaining user input regarding the design configuration; creating a customized HDL functional block description based on the user input and existing libraries of functions; determining a design hierarchy based on the user input and existing libraries; running a makefile to create the structural HDL and script; running the script to create a makefile for the simulator and a synthesis script; and synthesizing and/or simulating the design from the simulation makefile or synthesis script, respectively.
In a seventh aspect of the invention, an improved computer program useful for synthesizing processor designs and embodying the aforementioned enhanced processing capability is disclosed. In one exemplary embodiment, the computer program comprises an object code representation stored on the magnetic storage device of a microcomputer, and adapted to run on the central processing unit thereof. The computer program further comprises an interactive, menu-driven graphical user interface (GUI), thereby facilitating ease of use.
In an eighth aspect of the invention, an improved apparatus for running the aforementioned computer program used for synthesizing gate logic associated with the aforementioned algorithm processing functionality is disclosed. In one exemplary embodiment, the system comprises a stand-alone microcomputer system having a display, central processing unit, data storage device(s), and input device.
a is a logical diagram illustrating the calculation of the decimation-in-time (DIT) fast Fourier transform (FFT) “butterfly.”
b is a logical diagram illustrating the grouping of FFT butterfly operations in an exemplary eight-point radix 2 DIT FFT calculation.
Reference is now made to the drawings wherein like numerals refer to like parts throughout.
As used herein, the term “processor” is meant to include any integrated circuit or other electronic device capable of performing an operation on at least one instruction word including, without limitation, reduced instruction set core (RISC) processors such as the ARCtangent™(“Tangent”) and ARCompact™ (“Compact”) user-configurable core manufactured by the Assignee hereof, central processing units (CPUs), and digital signal processors (DSPs). The hardware of such devices may be integrated onto a single piece of silicon (“die”), or distributed among two or more die. Furthermore, various functional aspects of the processor may be implemented solely as software or firmware associated with the processor.
Also as used herein, the terms “extension” and “extensible” refer generally to processor configurations having an additional or extended instruction (or instruction set) adapted to perform specific operations within the processor, such as FFT or Viterbi decode metric calculations. Such an extended instruction (or set) may be user-configurable, or alternatively may be incorporated into the base instruction set of the processor. Hence, these terms are in no way meant to be limiting of the configurability (or lack thereof) of a particular instruction or the instruction set as a whole, but rather merely connote instructions adapted for one or more particular purposes.
Additionally, it will be recognized by those of ordinary skill in the art that the term “stage” as used herein refers to various successive stages within a pipelined processor; i.e., stage 1 refers to the first pipelined stage, stage 2 to the second pipelined stage, and so forth. Such pipeline stages may include, for example, instruction fetch, decode, execute, and writeback stages as is well known in the art.
It is also noted that while the following description is cast in terms of VHSIC hardware description language (VHDL), other hardware description languages (HDL) such as Verilog® may be used to describe various embodiments of the invention with equal success. Furthermore, while an exemplary Synopsys® synthesis engine such as the Design Compiler 2000.05 (DC00) is used to synthesize the various embodiments set forth herein, other synthesis engines such as Buildgates® available from, inter alia, Cadence Design Systems, Inc., may be used. IEEE std. 1076.3-1997, IEEE Standard VHDL Synthesis Packages, describe an industry-accepted language for specifying a Hardware Description Language-based design and the synthesis capabilities that may be expected to be available to one of ordinary skill in the art.
Overview
The ARCtangent processor is a user-customizable 32-bit RISC core for ASIC, system-on-chip (SoC), and FPGA integration. It is synthesizable, configurable, and extendable, thus allowing developers to modify and extend the architecture to better suit specific applications. The ARCtangent microprocessor comprises a 32-bit RISC architecture with a four-stage execution pipeline. The instruction set, register file, condition codes, caches, buses, and other architectural features are user-configurable and extensible. It has a 32×32-bit core register file, which can be doubled if required by the application. Additionally, it is possible to use large number of auxiliary registers (up to 2E32). The functional elements of the core of this processor include the arithmetic logic unit (ALU), register file (e.g., 32×32), program counter (PC), instruction fetch (i-fetch) interface logic, as well as various stage latches.
ARCompact™ is an innovative instruction set architecture (ISA) that allows designers to mix 16 and 32-bit instructions on its 32-bit user-configurable processor. The key benefit of the ISA is the ability to cut memory requirements on a SoC (system-on-chip) by significant percentages, resulting in lower power consumption and lower cost devices in deeply embedded applications such as wireless communications and high volume consumer electronics products.
The main features of the ARCompact ISA include new 32-bit instructions aimed at providing better code density, a new set of 16-bit instructions for the most commonly used operations, and freeform mixing of 16- and 32-bit instructions without a mode switch—significant because it reduces the complexity of compiler usage compared to competing mode-switching architectures. The ARCompact instruction set expands the number of custom extension instructions that users can add to the base-case ARCtangent™ processor instruction set. The existing processor architecture already allows users to add as many as 69 new instructions to speed up critical routines and algorithms. With the ARCompact ISA, users can add as many as 256 new instructions. Users can also add new core registers, auxiliary registers, and condition codes. The ARCompact ISA thus maintains and expands the user-customizable features of ARC's configurable processor technology.
As 32-bit architectures become more widely used in deeply embedded systems, code density can have a direct impact on system cost. Typically, a very high percentage of the silicon area of a system-on-chip (SoC) is taken up by memory.
The ARCompact ISA delivers high density code helping to significantly reduce the memory required for the embedded application, a vital factor for high-volume consumer applications, such as flash memory cards. In addition, by fitting code into a smaller memory area, the processor potentially has to make fewer memory accesses. This can cut power consumption and extend battery life for portable devices such as MP3 players, digital cameras and wireless handsets. Additionally, the new, shorter instructions can improve system throughput by executing in a single clock cycle some operations previously requiring two or more instructions. This can boost application performance without having to run the processor at higher clock frequencies.
The support for freeform use of 16 and 32-bit instructions allows compilers and programmers to use the most suitable instructions for a given task, without any need for specific code partitioning or system mode management. Direct replacement of 32-bit instructions with new 16-bit instructions provides an immediate code density benefit, which can be realized at an individual instruction level throughout the application. As the compiler is not required to restructure the code, greater scope for optimizations is provided, over a larger range of instructions. Application debugging is more intuitive because the newly generated code follows the structure of the original source code.
Detailed description of the improved ISA used in, e.g., the ARCompact core is provided in co-pending U.S. Provisional Patent Application Serial No. 60/353,647 entitled “Configurable Data Processor With Multi-Length Instruction Set Architecture” filed Jan. 31, 2002, commonly owned by the Assignee hereof, and incorporated herein by reference in its entirety.
The extensibility of the aforementioned Tangent and Compact processor cores manufactured by the Assignee hereof allows them to be customized for particular applications. Applications involving FFT's can greatly benefit from such customization. When employing a user-configurable extensible data processor, the end application is known at the time of design/synthesis, and the user configuring the processor can apply the methods of the present invention to produce enhanced calculation capability and processor efficiency. The user can also advantageously configure the processor appropriately so that only the hardware resources required to perform the function are included, resulting in an architecture that is significantly more silicon efficient than fixed architecture digital signal processors (DSPs). For example, users of the present invention can produce a small gate count (<50K gates) data processor capable of performing rapid FFT butterfly calculations. The methodology disclosed herein extends to the compiler, instruction set simulator, and verification strategy.
However, it will be recognized that while the following description is cast in terms of the extensible Tangent/Compact processor cores, other processor types and configurations (extensible or otherwise) may be adapted to incorporate the various aspects of the invention as described herein.
Apparatus and Methods For Iterative Calculation
The enhanced performance provided by the present invention enables the device to calculate the FFT butterfly significantly faster than the same device no so equipped. Accordingly, devices fitted with the enhanced functionality of the invention can perform FFT butterfly calculations using fewer clock cycles, thereby allowing either (i) higher data rates with a fixed clock frequency; or (ii) lower power consumption via reduction of the clock frequency required to perform the coding operation. This type of improvement can have a profound beneficial impact, such as on the power dissipation of an integrated device such as, for example, an audio encoder, decoder, or test chip (such as a device using MPEG or Dolby™ processes), or a DSL modem chip.
The butterfly calculation is performed in the innermost loop of the FFT computation. The iteration therefore has to execute (N/2)*log2(N) times, where N is the number of points on which the FFT calculation operates. Accordingly, every extra cycle taken in performing the butterfly operation has a significant impact on the overall performance of the FFT. For instance, a 256-point FFT (i.e., N=256) that would require 15498 cycles in the 16-bit case, would require 16522 cycles in the 24-bit case. The present invention, by reducing the number of cycles required on the inner loop, reduce the total cycle count. With the 16-bit multiplier, the application of the invention saves two cycles (or 2×1024=2048 total). With the 24-bit multiplier, the invention saves three cycles (3192 total). With the dual 16-bit variant of the invention, three cycles are saved; however, if the requirement that the processor not stall when a mul or mac instruction writes back to a window register is imposed, at total seven (7) cycles are saved.
In the present invention, iterative calculations such as the FFT butterfly are significantly accelerated by: (i) removing the need for the aforementioned pipeline stall (described with respect to
The 16-bit embodiment advantageously comprises a two-operand instruction, which obviates the use of one of limited number of remaining three-operand instruction slots. At the appropriate point in the program, the FBF instruction simultaneously (i) subtracts the accumulator from twice the part of the A0 value to yield the A1 value; and (ii) preloads the accumulator with the other part of the next A0 value. See
Additionally, the dual 16-bit embodiment of the invention (
It is also noted that the present invention also contemplates the storage of one or more “key” algorithms, such as the FFT algorithms described above, in the instruction cache (not shown) or other local memory for ready access by the core. Additionally, the input data associated with these algorithms is readily contained in local static RAM (SRAM) or other storage device, thereby further reducing latency associated with read/write operations of such data.
Method of Generating a Processor Design Adapted for Iterative Calculation
Referring now to
As shown in
Next, in step 906, the extended data processor is synthesized using, e.g., an HDL compiler, to a target silicon technology such as a RISC processor, application specific integrated circuit (ASIC), or a field programmable gate array (FPGA).
A software function is then written (step 908) that uses the aforementioned FBF instruction. Appendices I–III hereto provide exemplary pseudo-code implementing the 16-bit, 24-bit, and dual 16-bit embodiments of the FBF instruction, respectively.
Exemplary Method of Synthesizing
Referring now to
While the following description is presented in terms of an algorithm or computer program running on a microcomputer or other similar processing device, it can be appreciated that other hardware environments (including for example minicomputers, workstations, networked computers, “supercomputers”, and mainframes) may be used to practice the method. Additionally, one or more portions of the computer program may be embodied in hardware or firmware as opposed to software if desired, such alternate embodiments being well within the skill of the computer artisan.
Initially, user input is obtained regarding the design configuration in the first step 1002. Specifically, desired modules or functions for the design are selected by the user, and instructions relating to the design are added, subtracted, or generated as necessary. For example, in signal processing applications, it is often advantageous for CPUs to include a single or even multiple “multiply and accumulate” (MAC) instructions. In the present invention, the instruction set of the synthesized design is modified so as to incorporate the foregoing FBF instruction and associated accumulator logic (or other comparable calculation acceleration functionality) therein.
The technology library location for each VHDL file is also defined by the user in step 1002. The technology library files in the present invention store all of the information related to cells necessary for the synthesis process, including for example logical function, input/output timing, and any associated constraints. In the present invention, each user can define his/her own library name and location(s), thereby adding further flexibility.
Next, in step 1003, the user creates customized HDL functional blocks based on the user's input and the existing library of functions specified in step 1002.
In step 1004, the design hierarchy is determined based on user input and the aforementioned library files. A hierarchy file, new library file, and makefile are subsequently generated based on the design hierarchy. The term “makefile” as used herein refers to the commonly used UNIX makefile function or similar function of a computer system well known to those of skill in the computer programming arts. The makefile function causes other programs or algorithms resident in the computer system to be executed in the specified order. In addition, it further specifies the names or locations of data files and other information necessary to the successful operation of the specified programs. It is noted, however, that the invention disclosed herein may utilize file structures other than the “makefile” type to produce the desired functionality.
In one embodiment of the makefile generation process of the present invention, the user is interactively asked via display prompts to input information relating to the desired design such as the type of “build” (e.g., overall device or system configuration), width of the external memory system data bus, different types of extensions such as the FBF extensions described previously herein, cache type/size, etc. Many other configurations and sources of input information may be used, however, consistent with the invention.
In step 1006, the user runs the makefile generated in step 1004 to create the structural HDL. This structural HDL ties the discrete functional block in the design together so as to make a complete design.
Next, in step 1008, the script generated in step 1006 is run to create a makefile for the simulator. The user also runs the script to generate a synthesis script in step 1008.
At this point in the program, a decision is made whether to synthesize or simulate the design (step 1010). If simulation is chosen, the user runs the simulation using the generated design and simulation makefile (and user program) in step 1012. Alternatively, if synthesis is chosen, the user runs the synthesis using the synthesis script(s) and generated design in step 1014. After completion of the synthesis/simulation scripts, the adequacy of the design is evaluated in step 1016. For example, a synthesis engine may create a specific physical layout of the design that meets the performance criteria of the overall design process yet does not meet the die size requirements. In this case, the designer will make changes to the control files, libraries, or other elements that can affect the die size. The resulting set of design information is then used to re-run the synthesis script.
If the generated design is acceptable, the design process is completed. If the design is not acceptable, the process steps beginning with step 1002 are re-performed until an acceptable design is achieved. In this fashion, the method 1000 is iterative.
It will be appreciated by one skilled in the art that the processor of
It is also noted that many IC designs currently use a microprocessor core and a DSP core. The DSP however, might only be required for a limited number of DSP functions, or for the IC's fast DMA architecture. The invention disclosed herein can support many DSP instruction functions, and its fast local memory system advantageously provides immediate access to data. Appreciable cost savings may be realized by using the methods disclosed herein for both the CPU & DSP functions of the IC.
Additionally, it will be noted that the methodology (and associated computer program) as previously described herein can readily be adapted to newer manufacturing technologies, such as 0.18 or 0.1 micron processes, with a comparatively simple re-synthesis instead of the lengthy and expensive process typically required to adapt such technologies using “hard” macro prior art systems.
System for Synthesizing
Referring now to
It will be recognized that while certain aspects of the invention have been described in terms of a specific sequence of steps of a method, these descriptions are only illustrative of the broader methods of the invention, and may be modified as required by the particular application. Certain steps may be rendered unnecessary or optional under certain circumstances. Additionally, certain steps or functionality may be added to the disclosed embodiments, or the order of performance of two or more steps permuted. All such variations are considered to be encompassed within the invention disclosed and claimed herein.
While the above detailed description has shown, described, and pointed out novel features of the invention as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the invention. The foregoing description is of the best mode presently contemplated of carrying out the invention. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the invention. The scope of the invention should be determined with reference to the claims.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 60/285,456, entitled “Data Processor With Enhanced Instruction Execution and Method” filed Apr. 19, 2001.
Number | Name | Date | Kind |
---|---|---|---|
4117541 | Ali | Sep 1978 | A |
4293921 | Smith, Jr. | Oct 1981 | A |
4486850 | Hyatt | Dec 1984 | A |
4722050 | Lee et al. | Jan 1988 | A |
4763242 | Lee et al. | Aug 1988 | A |
4791590 | Ku et al. | Dec 1988 | A |
4891779 | Hasebe | Jan 1990 | A |
4996661 | Cox et al. | Feb 1991 | A |
5042000 | Baldwin | Aug 1991 | A |
5053987 | Genusov et al. | Oct 1991 | A |
5093801 | White et al. | Mar 1992 | A |
5224063 | Matsunaga | Jun 1993 | A |
5303172 | Magar et al. | Apr 1994 | A |
5450553 | Kitagaki et al. | Sep 1995 | A |
5477478 | Okamoto et al. | Dec 1995 | A |
5481488 | Luo et al. | Jan 1996 | A |
5721892 | Peleg et al. | Feb 1998 | A |
5764553 | Hong | Jun 1998 | A |
5941940 | Prasad et al. | Aug 1999 | A |
5954811 | Garde | Sep 1999 | A |
5968112 | Kirschenbaum et al. | Oct 1999 | A |
5983256 | Peleg et al. | Nov 1999 | A |
6006245 | Thayer | Dec 1999 | A |
6032253 | Cashman et al. | Feb 2000 | A |
6061705 | Hellberg | May 2000 | A |
6081821 | Hopkinson et al. | Jun 2000 | A |
6209017 | Lim et al. | Mar 2001 | B1 |
6272512 | Golliver et al. | Aug 2001 | B1 |
6282633 | Killian et al. | Aug 2001 | B1 |
6317770 | Lim et al. | Nov 2001 | B1 |
6341299 | Romain | Jan 2002 | B1 |
6359938 | Keevill et al. | Mar 2002 | B1 |
6366936 | Lee et al. | Apr 2002 | B1 |
6366937 | Shridhar et al. | Apr 2002 | B1 |
6477683 | Killian et al. | Nov 2002 | B1 |
6477697 | Killian et al. | Nov 2002 | B1 |
6625630 | Vinitzky | Sep 2003 | B1 |
6701515 | Wilson et al. | Mar 2004 | B1 |
6718504 | Coombs et al. | Apr 2004 | B1 |
6732238 | Evans et al. | May 2004 | B1 |
6763327 | Songer et al. | Jul 2004 | B1 |
6848074 | Coombs | Jan 2005 | B1 |
6854046 | Evans et al. | Feb 2005 | B1 |
6862563 | Hakewill et al. | Mar 2005 | B1 |
20030009502 | Katayanagi | Jan 2003 | A1 |
20040010321 | Morishita et al. | Jan 2004 | A1 |
Number | Date | Country | |
---|---|---|---|
20020194236 A1 | Dec 2002 | US |
Number | Date | Country | |
---|---|---|---|
60285456 | Apr 2001 | US |