The present invention relates generally to the field of digital data processors, and more particularly to multithreading and pipelining techniques for use in a digital signal processor (DSP) or other type of digital data processor.
Pipelining is a well-known processor implementation technique whereby multiple instructions are overlapped in execution. Conventional pipelining techniques are described in, for example, John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” Third Edition, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 2003.
In the first stage (IF) instructions are fetched from memory and decoded. In the second stage (RD) the operands are read from the register file. In the third stage (EX) the addition is performed. Finally, in the fourth stage (WB) the results are written back into the register file at location r0. When the addi instruction has completed, the next instruction mull is started. The mull instruction performs an addition of the contents of register r3 and an immediate value 4, and stores the result in register r8.
In single-threaded processors, a common method for reducing pipeline bubbles is known as bypassing, whereby instead of writing the computed value back to the register file in the WB stage, the result is forwarded directly to the processor execution unit that requires it. This reduces but does not eliminate bubbles in deeply pipelined machines. Also, it generally requires dependency checking and bypassing hardware, which unduly increases processor cost and complexity.
It is also possible to reduce pipeline stalls through the use of multithreading. Multithreaded processors are processors that support simultaneous execution of multiple distinct instruction sequences or “threads.” Conventional threading techniques are described in, for example, M. J. Flynn, “Computer Architecture: Pipelined and Parallel Processor Design,” Jones and Bartlett Publishers, Boston, Mass., 1995, and G. A. Blaauw and Frederick P. Brooks, “Computer Architecture: Concepts and Evolution,” Addison-Wesley, Reading, Mass.; 1997, both of which are incorporated by reference herein.
However, these and other conventional approaches generally do not allow multiple concurrent pipelines per thread, nor do they support pipeline shifting.
Accordingly, techniques are needed which can provide improved pipelining in a multithreaded digital data processor.
The present invention in an illustrative embodiment provides a multithreaded processor which advantageously allows multiple concurrent pipelines per thread, and also supports pipeline shifting.
In accordance with one aspect of the invention, a multithreaded processor comprises a plurality of hardware thread units, an instruction decoder coupled to the thread units for decoding instructions received therefrom, and a plurality of execution units for executing the decoded instructions. The multithreaded processor is configured for controlling an instruction issuance sequence for threads associated with respective ones of the hardware thread units. On a given processor clock cycle, only a designated one of the threads is permitted to issue one or more instructions, but the designated thread that is permitted to issue instructions varies over a plurality of clock cycles in accordance with the instruction issuance sequence. The instructions are pipelined in a manner which permits at least a given one of the threads to support multiple concurrent instruction pipelines.
In the illustrative embodiment, the instruction issuance sequence is determined using a token triggered threading approach. More specifically, in an arrangement in which the processor supports N threads, over a sequence of N consecutive processor clock cycles each of the N threads is permitted to issue instructions on only a corresponding one of the N consecutive processor clock cycles.
The illustrative embodiment allows each of the threads to issue up to three instructions on its corresponding one of the processor clock cycles. The instructions are pipelined such that at least five separate instruction pipelines can be concurrently executing for different ones of the threads.
The pipelined instructions in the illustrative embodiment, include a load/store instruction, an arithmetic logic unit instruction, an integer multiplication instruction, a vector multiplication instruction, and a vector multiplication and reduction instruction.
In accordance with another aspect of the invention, the vector multiplication and reduction instruction is pipelined using a number of stages which is greater than a total number of threads of the processor. For example, the vector multiplication and reduction instruction may comprise a pipeline with at least eleven stages, including an instruction decode stage, a vector register file read stage, at least two multiply stages, at least two add stages, an accumulator read stage, a plurality of reduction stages, and an accumulator writeback stage. The accumulator read stage may be combined with another of the stages, such as an add stage. Pipelines for respective vector multiplication and reduction instructions may be shifted relative to one another by a plurality of pipeline stages.
The present invention in the illustrative embodiment provides a number of significant advantages over conventional techniques. For example, a higher degree of concurrency is provided than that achievable using conventional approaches. Also, the need for dependency checking and bypassing hardware is eliminated, since computation results are guaranteed to be written back to the appropriate register file before they are needed by the next instruction from the same thread. Furthermore, the techniques help to limit processor power consumption.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
The present invention will be described in the context of an exemplary multithreaded processor. It should be understood, however, that the invention does not require the particular arrangements shown, and can be implemented using other types of digital data processors and associated processing circuitry.
A given processor as described herein may be implemented in the form of one or more integrated circuits.
The present invention in an illustrative embodiment provides a pipelining technique suitable for use in a multithreaded processor. With this technique, multiple instructions from multiple threads can be concurrently executed in an efficient manner. As will be described in greater detail below, the illustrative embodiment uses variable length execution pipelines, staggered execution, and rotated start execution, to provide concurrent execution while maintaining low power operation. The illustrative embodiment provides a higher degree of concurrency than that are achievable using conventional approaches.
In this example, an integer add instruction addi r0, r2, 8 is initially issued by a first one of the contexts on a first clock cycle. The other two contexts issue instructions on respective subsequent clock cycles. It takes a total of three clock cycles for each of the contexts to issue an instruction. On a fourth clock cycle, the first context issues another instruction, namely an integer multiplication instruction muli r8, r0, 4.
More specifically, in cycle 1, the IF stage of thread 1 is executed for the addi instruction. In cycle 2, the IF stage of thread 2 executes while at the same time the RD stage of thread 1 executes. In cycle 3, the IF stage of thread 3 executes, the RD stage of thread 2 executes, and the EX stage of thread 1 executes. In cycle 4, the IF stage of thread 1 of the muli instruction executes concurrently with the WB stage of the addi instruction. Simultaneously, the EX stage of thread 2 executes and the RD stage of thread 3 executes.
It can be seen from this example that multiple instructions from the same and different threads are overlapped and concurrently executing. It can also be seen that there are no bubbles in the pipeline even though the results of the addi instruction are required by the muli instruction. The
As indicated previously, the present invention can be advantageously implemented in a multithreaded processor. A more particular example of a multithreaded processor in which the invention may be implemented is described in U.S. patent application Ser. No. 10/269,372, filed Oct. 11, 2002 and entitled “Multithreaded Processor With Efficient Processing For Convergence Device Applications,” which is commonly assigned herewith and incorporated by reference herein. This multithreaded processor may be configured to execute RISC-based control code, DSP code, Java code and network processing code. It includes a single instruction multiple data (SIMD) vector processing unit, a reduction unit, and long instruction word (LIW) compounded instruction execution. Examples of threading and pipelining techniques suitable for use with this exemplary multithreaded processor are described in U.S. patent application Ser. No. 10/269,245, filed Oct. 11, 2002 and entitled “Method and Apparatus for Token Triggered Multithreading,” now issued as U.S. Pat. No. 6,842,848, which is commonly assigned herewith and incorporated by reference herein.
The invention can be implemented in other multithreaded processors, or more generally other types of digital data processors. Another such processor will now be described with reference to
The multithreaded processor 400 includes, among other elements, a multithreaded cache memory 410, a multithreaded data memory 412, an instruction buffer 414, an instruction decoder 416, a register file 418, and a memory management unit (MMU) 420. The multithreaded cache 410 includes a plurality of thread caches 410-1, 410-2, . . . 410-N, where N generally denotes the number of threads supported by the multithreaded processor 400, and in this particular example is given by N=4. Of course, other values of N may be used, as will be readily apparent to those skilled in the art.
Each thread thus has a corresponding thread cache associated therewith in the multithreaded cache 410. Similarly, the data memory 412 includes N distinct data memory instances, denoted data memories 412-1, 412-2, . . . 412-N as shown.
The multithreaded cache 410 interfaces with a main memory (not shown) external to the processor 400 via the MMU 420. The MMU 420, like the cache 410, includes a separate instance for the each of the N threads supported by the processor. The MMU 420 ensures that the appropriate instructions from main memory are loaded into the multithreaded cache 410.
The data memory 412 is also typically directly connected to the above-noted external main memory, although this connection is also not explicitly shown in the figure. Also associated with the data memory 412 is a data buffer 430.
In general, the multithreaded cache 410 is used to store instructions to be executed by the multithreaded processor 400, while the data memory 412 stores data that is operated on by the instructions. Instructions are fetched from the multithreaded cache 410 by the instruction decoder 416 and decoded. Depending upon the instruction type, the instruction decoder 416 may forward a given instruction or associated information to various other units within the processor, as will be described below.
The processor 400 includes a branch instruction queue (IQ) 440 and program counter (PC) registers 442. The program counter registers 442 include one instance for each of the threads. The branch instruction queue 440 receives instructions from the instruction decoder 416, and in conjunction with the program counter registers 442 provides input to an adder block 444, which illustratively comprises a carry-propagate adder (CPA). Elements 440, 442 and 444 collectively comprise a branch unit of the processor 400. Although not shown in the figure, auxiliary registers may also be included in the processor 400.
The register file 418 provides temporary storage of integer results. Instructions forwarded from the instruction decoder 416 to an integer instruction queue (IQ) 450 are decoded and the proper hardware thread unit is selected through the use of an offset unit 452 which is shown as including a separate instance for each of the threads. The offset unit 452 inserts explicit bits into register file addresses so that independent thread data is not corrupted. For a given thread, these explicit bits may comprise, e.g., a corresponding thread identifier.
As shown in the figure, the register file 418 is coupled to input registers RA and RB, the outputs of which are coupled to an arithmetic logic unit (ALU) block 454, which may comprise an adder. The input registers RA and RB are used in implementing instruction pipelining. The output of the ALU block 454 is coupled to the data memory 412.
The register file 418, integer instruction queue 450, offset unit 452, elements RA and RB, and ALU block 454 collectively comprise an exemplary integer unit.
Instruction types executable in the processor 400 include Branch, Load, Store, Integer and Vector/SIMD instruction types. If a given instruction does not specify a Branch, Load, Store or Integer operation, it is a Vector/SIMD instruction. Other instruction types can also or alternatively be used. The Integer and Vector/SIMD instruction types are examples of what are more generally referred to herein as integer and vector instruction types, respectively.
A vector IQ 456 receives Vector/SIMD instructions forwarded from the instruction decoder 416. A corresponding offset unit 458, shown as including a separate instance for each of the threads, serves to insert the appropriate bits to ensure that independent thread data is not corrupted.
A vector unit 460 of the processor 400 is separated into N distinct parallel portions, and includes a vector file 462 which is similarly divided. The vector file 462 includes thirty-two registers, denoted VR00 through VR31. The vector file 462 serves substantially the same purpose as the register file 418 except that the former operates on Vector/SIMD instruction types.
The vector unit 460 illustratively comprises the vector instruction queue 456, the offset unit 458, the vector file 462, and the arithmetic and storage elements associated therewith.
The operation of the vector unit 460 is as follows. A Vector/SIMD block encoded either as a fractional or integer data type is read from the vector file 462 and is stored into architecturally visible registers VRA, VRB, VRC. From there, the flow proceeds through multipliers (MPY) that perform parallel concurrent multiplication of the Vector/SIMD data. Adder units comprising carry-skip adders (CSAs) and CPAs may perform additional arithmetic operations. For example, one or more of the CSAs may be used to add in an accumulator value from a vector register file, and one or more of the CPAs may be used to perform a final addition for completion of a multiplication operation, as will be appreciated by those skilled in the art. Computation results are stored in Result registers 464, and are provided as input operands to the reduction unit 402. The reduction unit 402 sums the input operands in such a way that the summation result produced is the same as that which would be obtained if each operation were executed in series. The reduced sum is stored in the accumulator register file 406 for further processing.
When performing vector dot products, the MPY blocks perform four multiplies in parallel, the CSA and CPA units perform additional operations or simply pass along the multiplication results for storage in the Result registers 464, and the reduction unit 402 sums the multiplication results, along with an accumulator value stored in the accumulator register file 406. The result generated by the reduction unit is then stored in the accumulator register file for use in the next iteration, in the manner previously described.
The accumulator register file 406 in this example includes a total of sixteen accumulator registers denoted ACC00 through ACC15.
The multithreaded processor 400 may make use of techniques for thread-based access to register files, as described in U.S. patent application Ser. No. 10/269,373, filed Oct. 11, 2002 and entitled “Method and Apparatus for Register File Port Reduction in a Multithreaded Processor,” which is commonly assigned herewith and incorporated by reference herein.
The multithreaded processor 400 is well suited for use in performing vector dot products and other types of parallel vector multiply and reduce operations, as described in the above-cited U.S. patent application Ser. No. 10/841,261.
The illustrative embodiment of the present invention utilizes an approach known as token triggered threading. Token triggered threading is described in the above-cited U.S. patent application Ser. No. 10/269,245, now issued as U.S. Pat. No. 6,842,848. The token triggered threading typically assigns different tokens to each of a plurality of threads of a multithreaded processor. For example, the token triggered threading may utilize a token to identify in association with a current processor clock cycle a particular one of the threads of the processor that will be permitted to issue an instruction for a subsequent clock cycle.
In accordance with the token triggered threading illustrated in
In the
Although token triggered threading is used in the illustrative embodiment, the invention does not require this particular type of multithreading, and other types of multithreading techniques can be used.
The figure depicts example pipelines for Load/Store (Ld/St), Arithmetic Logic Unit (ALU), Integer Multiplication (I_Mul), Vector Multiplication (V_Mul), and Vector Multiplication and Reduction (V_Mul Reduce) instructions. In this implementation, up to three pipelines may be simultaneously started and all five may be in various phases of execution concurrently.
The Ld/St pipeline has nine stages, denoted stage 0 through stage 8. In the first stage, stage 0 (Inst Dec), an instruction is fetched and decoded. This stage is common to all five pipelines and determines which queue the instructions should be routed to. In stage 1 (RF Read), the register file operands are read. This will form the base address for the load or store operation. In the case of a store instruction, the data to be stored is also read. In stage 2 (Agen), any immediate values are added to the address and the full address is generated. In stage 3 (Xfer), the computed address is transferred to the memory subsystem. In stage 4 (Int/Ext), a determination is made as to whether the memory access is to internal or external memory. In stages 5-7 (Mem0, Mem1, Mem2), the value is read from or written to memory. In stage 8 (WB), the value read from memory on a Load instruction is written into the register file.
The ALU pipeline has seven stages, denoted stage 0 through stage 6. As in the Ld/St pipeline, the first stage, stage 0 (Inst Dec), fetches and decodes all instructions. In stage 1 (Wait), a wait cycle is inserted. This allows the Ld/St and ALU hardware to share the same register file read ports. In the following stage, stage 2 (RF Read), the operands for the arithmetic function are read from the register file. Stages 3 and 4 (Exec1, Exec2) then compute the arithmetic result (e.g., an add, compare, shift, etc.). In stage 5 (Xfer), the result is transferred to the register file. In stage 6 (WB), the result is written back into the register file.
The I_Mul pipeline is similar to the ALU pipeline, as they share common architected resources. The figure indicates that the pipeline stages are identical except for an additional execution stage (Exec 3) in the I_Mul pipeline. Thus, an additional cycle is available for computing the result of a multiply.
The V_Mul pipeline uses different architected resources than the previously-described ALU and I_Mul pipelines. It may therefore execute concurrently with those instructions without resource conflicts. Stage 0 (Inst Dec) is as in all instructions and allows for routing of the decoded instruction to the correct pipeline. In stage 1 (VRF Read) the vector register file operands are read. Stages 2-5 (MPY1, MPY2, Add1, Add2) perform the multi-element vector arithmetic. The two add stages are present to convert the multiplication results from carry-save format back into two's complement format. Additionally, if the vectors only require simple arithmetic, this can be performed in the add stages. In stage 6 (Xfer), the results are transferred back to the vector register file, and in stage 7 (WB), the results are written back.
The V_Mul Reduce pipeline is similar to the V_Mul pipeline except that an additional reduction operation is performed. The reduction takes the 4 vector element products, along with an accumulator operand, and reduces them to a single scalar element. Typically this involves adding all of the products to the accumulator or subtracting all of the products from the accumulator, although other combinations are possible. The V_Mul and V_Mul Reduce pipelines are the same until stage 5. In stage 5 (Add2, ACC Read), an additional architected accumulator register file is read. This value is arithmetically combined with the vector elements and reduced to a single scalar. Four stages (Reduce1, Reduce2, Reduce3, Reduce4) are devoted to this reduction and then the scalar value is written back to the accumulator register file (i.e., a different architected space from the vector register file) in stage 10 (ACC WB).
If a single thread issued instructions each cycle as in
As mentioned previously, in this implementation, all five processor pipelines may be simultaneously active with instructions from multiple hardware thread units. This fills potential bubbles in the pipeline with work from other thread units.
It should be noted that a given V_Mul Reduce pipeline may be shifted in locality from a V_Mul pipeline in that the back-to-back reduction operations of the V_Mul Reduce pipeline do not cause bubbles. It appears that such a shift might lead to pipeline bubbles because the V_Mul Reduce pipeline is longer in duration than the number of hardware thread units (eight in this implementation). In other words, the computational cycle of the pipeline (eleven clock cycles for V_Mul Reduce) is longer than the issue cycle (each thread gets to issue once every eight clock cycles). In fact, this does not happen because the accumulator register file read phase is shifted from the V_Mul pipeline computations.
The illustrative embodiment described above advantageously allows multiple concurrent pipelines per thread and provides for pipeline shifting in deeply multithreaded pipelines. It also eliminates the need for dependency checking and bypassing hardware, since results are guaranteed to be written back to the register file before they are needed by the next instruction from the same thread.
It should be noted that the particular processor, multithreading, pipelining and shifting arrangements shown in the figures are presented by way of illustrative example only, and additional or alternative elements not explicitly shown may be included, as will be apparent to those skilled in the art.
It should also be emphasized that the present invention does not require the particular multithreaded processor configuration shown in
Thus, the above-described embodiments of the invention are intended to be illustrative only, and numerous alternative embodiments within the scope of the appended claims will be apparent to those skilled in the art. For example, the particular arrangement of hardware thread units, instruction decoder and execution units shown in
The present application is a continuation application of U.S. patent application Ser. No. 11/096,917, filed Apr. 1, 2005, which claims the priority of U.S. Provisional Application Ser. No. 60/560,199, filed Apr. 7, 2004 and entitled “Processor Pipeline With Multithreaded Support,” which is incorporated by reference herein. The present application is also related to U.S. patent application Ser. No. 10/841,261, filed May 7, 2004 and entitled “Processor Reduction Unit for Accumulation of Multiple Operands With or Without Saturation,” which is incorporated by reference herein. The present application is further related to U.S. patent application Ser. No. 12/579,893, filed Oct. 15, 2009 and entitled “Multithreaded Processor with Multiple Concurrent Pipelines per Thread.” The present application is also related to U.S. patent application Ser. No. 12/579,867, filed Oct. 15, 2009 and entitled “Multithreaded Processor with Multiple Concurrent Pipelines per Thread.”
Number | Name | Date | Kind |
---|---|---|---|
4001692 | Fenwick et al. | Jan 1977 | A |
4706211 | Yamazaki et al. | Nov 1987 | A |
4769779 | Chang et al. | Sep 1988 | A |
5181184 | Shim et al. | Jan 1993 | A |
5404469 | Chung et al. | Apr 1995 | A |
5613114 | Anderson et al. | Mar 1997 | A |
5864703 | Van Hook et al. | Jan 1999 | A |
5889689 | Alidina et al. | Mar 1999 | A |
5907702 | Flynn et al. | May 1999 | A |
5949996 | Atsushi | Sep 1999 | A |
5958041 | Petolino et al. | Sep 1999 | A |
5983256 | Peleg et al. | Nov 1999 | A |
5991785 | Alidina et al. | Nov 1999 | A |
6078941 | Jiang et al. | Jun 2000 | A |
6092175 | Levy et al. | Jul 2000 | A |
6161166 | Doing et al. | Dec 2000 | A |
6212544 | Borkenhagen et al. | Apr 2001 | B1 |
6295600 | Parady | Sep 2001 | B1 |
6377619 | Denk et al. | Apr 2002 | B1 |
6470443 | Emer et al. | Oct 2002 | B1 |
6530010 | Hung et al. | Mar 2003 | B1 |
6530014 | Alidina et al. | Mar 2003 | B2 |
6557022 | Sih et al. | Apr 2003 | B1 |
6606704 | Adiletta et al. | Aug 2003 | B1 |
6687724 | Mogi et al. | Feb 2004 | B1 |
6694425 | Eickemeyer | Feb 2004 | B1 |
6697935 | Borkenhagen et al. | Feb 2004 | B1 |
6842848 | Hokenek et al. | Jan 2005 | B2 |
6898694 | Kottapalli et al. | May 2005 | B2 |
6904511 | Hokenek et al. | Jun 2005 | B2 |
6912623 | Hokenek et al. | Jun 2005 | B2 |
6925643 | Hokenek et al. | Aug 2005 | B2 |
6968445 | Hokenek et al. | Nov 2005 | B2 |
6971103 | Hokenek et al. | Nov 2005 | B2 |
6973471 | Nguyen | Dec 2005 | B2 |
6990557 | Hokenek et al. | Jan 2006 | B2 |
7251737 | Weinberger et al. | Jul 2007 | B2 |
7360064 | Steiss et al. | Apr 2008 | B1 |
7428567 | Schulte et al. | Sep 2008 | B2 |
7475222 | Glossner et al. | Jan 2009 | B2 |
7593978 | Schulte et al. | Sep 2009 | B2 |
7797363 | Hokenek et al. | Sep 2010 | B2 |
7873815 | Sih et al. | Jan 2011 | B2 |
20010047468 | Parady | Nov 2001 | A1 |
20020038416 | Fotland et al. | Mar 2002 | A1 |
20030041228 | Rosenbluth et al. | Feb 2003 | A1 |
20060095729 | Hokenek et al. | May 2006 | A1 |
20090193279 | Moudgill et al. | Jul 2009 | A1 |
20090235032 | Hoane | Sep 2009 | A1 |
20090276432 | Hokenek et al. | Nov 2009 | A1 |
20100031007 | Moudgill | Feb 2010 | A1 |
20100115527 | Kotlyar et al. | May 2010 | A1 |
20100122068 | Hokenek et al. | May 2010 | A1 |
20100199073 | Hokenek et al. | Aug 2010 | A1 |
20100241834 | Moudgill | Sep 2010 | A1 |
20100299319 | Parson et al. | Nov 2010 | A1 |
Number | Date | Country |
---|---|---|
0444088 | Sep 1991 | EP |
0 725 334 | Aug 1996 | EP |
0793168 | Sep 1997 | EP |
2 389 433 | Dec 2003 | GB |
19970012141 | Mar 1997 | KR |
305310 | Sep 2001 | KR |
WO 0161860 | Aug 2001 | WO |
WO 03019358 | Mar 2003 | WO |
Entry |
---|
Brunett, Sharon, Thornley, John, Ellenbecker, Marrq. “An initial evaluation of the Tera multithreaded architecture and programming system using the C3I parallel benchmark suite” Proceedings of the 1998 ACM/IEEE SC98 Conference. |
Microsoft Computing Dictionary, 5th edition, 2002, p. 458. |
Hennessy, John L., Patterson, David A. “Computer Architecture: A Quantitative Approach” 3rd Edition, May 17, 2002, pp. 608-609. |
Mat Loikkanen, Nader Bagherzadeh. “A Fine-Grain Multithreading Superscalar Architecture” in Proc. Int'l Conf. Parallel Architectures and Compilation Techniques' 96. |
Sato et al., Thread-based Programming for the EM-4 Hybrid Dataflow Machines, Proc. Int'l Symp. Computer Architecture 19, pp. 146-155 (May 19, 1992). |
Diep et al., Performance Evaluation of the PowerPC 620 Microarchitecture, Proceedings of the 22nd. Annual Symposium on Computer Architecture, ACM, vol. 22, Feb. 22, 1995, pp. 163-174. |
International Search Report for PCT Patent Application No. PCT/US05/11614. |
European Patent Office Search Report for EPC Patent Application No. 05732166.3. |
Ackland et al., Mar. 2000, A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP, IEEE, 35(3):412-424. |
Alverson et al., 1990, The Tera computer system, ACM, pp. 1-6. |
Alverson et al., Jan. 1, 1992, Exploiting heterogeneous parallelism on a multithreaded multiprocessor, International Conference on Supercomputing, ACM, pp. 188-193. |
Glossner et al., Mar. 4, 2004, 6. The Sandbridge sandblaster communications processor, Software Defined Radio: Baseband Technologies for 3G Handsets and Basestations, John Wiley & Sons, Ltd., pp. 129-159. |
Shen et al., 2003, Modern Processor Design, McGraw Hill, pp. 106, 232-233. |
Snavely et al, 1998, Multi-processor performance on the Tera MTA, Proceedings of the 1998 ACM/IEEE SC98 Conference, pp. 11. |
Official Communication dated Feb. 24, 2009 in European App. No. 05732166.3. |
Office Action dated Aug. 20, 2007 in U.S. Appl. No. 11/096,917. |
Office Action dated May 21, 2008 in U.S. Appl. No. 11/096,917. |
Office Action dated Jan. 29, 2009 in U.S. Appl. No. 11/096,917. |
Office Action dated Jun. 15, 2009 in U.S. Appl. No. 11/096,917. |
Office Action dated Mar. 22, 2010 in U.S. Appl. No. 11/096,917. |
Extended European Search Report dated Jul. 26, 2011 in App. No. 11001888.4. |
Extended European Search Report dated Jul. 28, 2011 in App. No. 11001890.0. |
Extended European Search Report dated Jul. 26, 2011 in App. No. 11001889.2. |
Balzola et al., Sep. 26, 2001, Design alternatives for parallel saturating multioperand adders, Proceedings 2001 International Conference on Computer Design, pp. 172-177. |
Balzola, Apr. 2003, Saturating arithmetic for digital signal processors, PhD Thesis, Lehigh University. |
Glossner et al, 2000, Trends in compilable DSP architecture, IEEE Workshop in Signal Processing Systems, pp. 1-19. |
Glossner et al., Apr. 2001, Towards a very high bandwidth wireless battery powered device, IEEE Computer Society Workshop in VLSI, pp. 3-9. |
Glossner et al., Nov. 2002, A multithreaded processor architecture for SDR, The Proceedings of the Korean Institute of Communication Sciences, 19(11):70-84. |
Glossner et al., Nov. 11-12, 2002, Multi-threaded processor for software-defined radio, Proceedings of the 2002 Software Defined Radio Technical Conference, vol. 1, 6 pp. |
Glossner et al., Jan. 2003, A software defined communications baseband design, IEEE Communications Magazine, 41(1):120-128. |
Glossner et al., Septemer 22, 23, 2003, Multiple communication protocols for software defined radio, IEEE Colloquium on DSP Enable Radio, ISIL, Livingston, Scotland, pp. 227-236. |
Hennessey et al., 2003, Computer architecture: a quantitative approach, 3rd ed., Morgan Kaufmann Publishers, Appendix A, section A.5, “Extending the MIPS Pipeline to Handle Multicycle Operations,” pp. A-47-A-57. |
Jinturkar et al., Mar. 31-Apr. 3, 2003, Programming the Sandbridge multithreaded processor, Proceedings of the 2003 Global Signal Processing Expo (GSPx) and International Signal Processing Conference (ISPC), Dallas, Tx. |
Loikkanen et al., 1996, A fine-grain multithreading superscalar architecture, IEEE Proceedings of PACT '96, pp. 163-168. |
Moreno et al., Sep. 15, 2002, IBM Research Report RC22568—an innovative low-power high-performance programmable signal processor for digital communications, 31 pp. |
Peleg et al., Aug. 1, 1995, MMX technology extension to the intel architecture, IEEE Micro, 16(4):42-50. |
Schulte et al., Nov. 19, 2000, Parallel saturating multioperand adders, Cases '00, pp. 172-179. |
Schulte et al., Nov. 2004, A low-power multithreaded processor for baseband communication systems, Lecture Notes in Computer Science, 3133:393-402. |
Ungerer, Mar. 2003, A survey of processors with explicit multithreading, ACM Computing Surveys, 35(1):29-63. |
Office Action dated Sep. 7, 2010 in U.S. Appl. No. 12/579,867. |
Office Action dated Feb. 9, 2011 in U.S. Appl. No. 12/579,867. |
Office Action dated Oct. 29, 2010 in U.S. Appl. No. 12/579,893. |
Office Action dated Mar. 9, 2011 in U.S. Appl. No. 12/579,893. |
IPRP dated Oct. 19, 2006 in PCT/US05/11614. |
Official Communication dated Dec. 10, 2009 in European App. No. 05732166.3. |
Summons to attend oral proceedings dated Oct. 27, 2010 in European App. No. 05732166.3. |
Decision to refuse dated Apr. 14, 2011 in European App. No. 05732166.3. |
Blaauw et al., 1997, Computer Architecture: Concepts and Evolution, Addison-Wesley, Reading, Mass., 7 pp. |
Glossner et al., Sep. 2004, Sandblaster Low-Power Multithreaded SDR Baseband Processor, Proceedings of the 3rd Workshop on Applications Specific Processors (WASP'04), Stockholm, Sweden, pp. 53-58. |
Office Action dated Nov. 18, 2011 in U.S. Appl. No. 12/579,867. |
Office Action dated Nov. 17, 2011 in U.S. Appl. No. 12/579,893. |
Notice to File a Response dated Jun. 23, 2011 in Korean App. No. 10-2006-7022996. |
Office Action dated Oct. 18, 2012, in U.S. Appl. No. 13/282,800. |
Notice to File a Response dated Nov. 27, 2012 in Korean App. No. 10-2012-7022421. |
Notice to File a Response dated Nov. 27, 2012 in Korean App. No. 10-2012-7022422. |
Notice to File a Response dated Feb. 27, 2012 in Korean App. No. 10-2006-7022996. |
Office Action dated Apr. 12, 2012 in U.S. Appl. No. 12/579,867. |
Office Action dated Apr. 18, 2012 in U.S. Appl. No. 12/579,893. |
Official Communication dated Jul. 16, 2012 in App. No. 11001890.0. |
Official Communication dated Jul. 3, 2012 in App. No. 11001888.4. |
Official Communication dated Jul. 3, 2012 in App. No. 11001889.2. |
Diefendorff K., et al., “AltiVec Extension to PowerPC Accelerates Media Processing”, IEEE Micro, vol. 20, No. 2, pp. 85-95, Mar. 2000. |
Kim Y., et al., “A Low Power Carry Select Adder with Reduced Area”, Proceedings of IEEE International Symposium on Circuits and Systems, pp. IV-218-IV-221, 2001. |
Lee R.B., “Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures”, Proceedings of the IEEE 11.sup.th International Conference on Application-Specific Systems, Architectures and Processor, pp. 3-14, Jul. 2000. |
Tullsen, D.M. etal., Simultaneous Multithreading: Maximizing On-Chip Paralellism, 1995, ACM, pp. 392-403. |
Tyagi A., “A Reduced-Area Scheme for Carry-Select Adders”, IEEE Transactions on Computers, vol. 42, No. 10, pp. 1163-1170, Oct. 1993. |
Yadav N., et al., “Parallel Saturating Fractional Arithmetic Units”, Proceedings of the Ninth Great Lakes Symposium on VLSI, pp. 214-217, Mar. 1999. |
SGS-Thomson Microelectronics, Jul. 1995, 16-48K ROM HCMOS MCU with on screen display and voltage tuning output, 22 pp. |
Basic Features of the HEP Supercomputer, http://www-ee.eng.hawaii.edu/˜nava/HEP/introduction.html, Mar. 6, 2001, 1 page. |
Number | Date | Country | |
---|---|---|---|
20100199075 A1 | Aug 2010 | US |
Number | Date | Country | |
---|---|---|---|
60560199 | Apr 2004 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11096917 | Apr 2005 | US |
Child | 12579912 | US |