The present invention relates to computer processor architecture in general, and more particularly to multithreading computer processor architectures and pipelined computer processor architectures.
Pipelined computer processors are well known in the art. A typical pipelined computer processor increases overall execution speed by separating the instruction processing function into four pipeline phases. This phase division allows for an instruction to be fetched (IF) during the same clock cycle as a previously-fetched instruction is decoded (D), a previously-decoded instruction is executed (E), and the result of a previously-executed instruction is written back into its destination (WB). Thus, the total elapsed time to process a single instruction (i.e., fetch, decode, execute, and write-back) is four clock cycles. However, the average throughput is one instruction per machine cycle because of the overlapped operation of the four pipeline phases.
In many computing applications that are executed by pipelined computer processors a large percentage of instruction processing time is wasted due to pipeline stalling and idling. This is often due to cache misses and latency in accessing external caches or external memory following the cache misses, or due to interdependency between successively executed instructions that necessitates a time delay of one or more clock cycles in order to stabilize the results of a prior instruction before that instruction's results can be used by a subsequent instruction.
Increasing the number of pipeline phases in a given processor results in a processor that may operate at a higher clock frequency. For example, doubling the number of pipeline phases by splitting each phase into two sub-phases, where each sub-phase's execution time is half of the original clock cycle, will result in a pipeline that is twice as deep as the original pipeline, and will enable the processor to operate at up to twice the clock frequency relative to the clock frequency of the original processor. However, the processor's performance with respect to an application is not doubled, since its performance is reduced due to pipeline stalling and idling, given the increased overlap of subsequently executed instructions. Furthermore, increasing the number of pipeline phases in a given processor will result in a new processor that is not compatible with the original processor, as the cycle-by-cycle execution pattern is different, since new idling cycles are inserted. Thus, applications written for the original processor would likewise be incompatible with the new processor and would need to be recompiled and optimized for use with the new processor.
One technique for reducing stalling and idling in pipelined computer processors is hardware multithreading, where instructions are processed during otherwise idle cycles. Applying hardware multithreading to a given processor may result in improved performance, due to reduced stalling and idling. However, as is the case with increased pipeline phases, the new multithreaded processor is not compatible with the original processor, as the cycle-by-cycle execution pattern is different from that of the original processor, since idling cycles are eliminated. An application that is compiled and optimized for execution by the original processor will generally include idling operations to adjust for pipeline limitations and interdependency between subsequent instructions. Thus, applications written for the original processor would need to be recompiled and optimized for use with the new multithreading processor in order to take advantage of the reduced need for idling operations and of other benefits of multithreading.
An embodiment of the present invention provides a method of converting a computer processor into a virtual multiprocessor that overcomes disadvantages of the prior art. This embodiment improves throughput efficiency and exploits increased parallelism by introducing a combination of multithreading and pipeline splitting to an existing and mature processor core. The resulting processor is a single physical processor that operates as multiple virtual processors, where each of the virtual processors is equivalent to the original processor.
In one aspect of the present invention a method is provided for converting a computer processor configuration having a k-phased pipeline into a virtual multithreaded processor, including dividing each pipeline phase of the processor configuration into a plurality n of sub-phases, and creating at least one virtual pipeline within the pipeline, the virtual pipeline including k sub-phases.
In another aspect of the present invention the method further includes executing a different thread within each one of the virtual pipelines.
In another aspect of the present invention the executing step includes executing any of the threads at an effective clock rate equal to the clock rate of the k-phased pipeline.
In another aspect of the present invention the dividing step includes determining a minimum cycle time T=1/f for the computer processor configuration and dividing each pipeline phase of the processor configuration into the plurality n of sub-phases, where each sub-phase has a propagation delay of less than T/n.
In another aspect of the present invention the method further includes replicating the register set of the processor configuration, and adapting the replicated register sets to simultaneously store the machine states of the threads.
In another aspect of the present invention the method further includes selecting any of the threads at a clock cycle, and activating at the clock cycle the register set that is associated with the selected thread.
In another aspect of the present invention any of the steps are applied to a single-threaded processor configuration.
In another aspect of the present invention any of the steps are applied to a multithreaded processor configuration.
In another aspect of the present invention any of the steps are applied to a given processor configuration a plurality of times for a plurality of different values of n, thereby creating a plurality of different processor configurations.
In another aspect of the present invention any of the steps are applied to a given processor configuration a plurality of times for a plurality of different values of n until a target processor performance level is achieved.
In another aspect of the present invention the dividing step includes selecting a predefined target processor performance value, and selecting a value of n being in predefined association with the predefined target processor performance level.
It is appreciated throughout the specification and claims that the term “processor” may refer to any combination of logic gates that is driven by one or more clock signals and that performs and processes one or more streams of input data or any stored data elements.
The disclosures of all patents, patent applications and other publications mentioned in this specification and of the patents, patent applications and other publications cited therein are hereby incorporated by reference in their entirety.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:
Reference is now made to
Reference is now made to
Reference is now made to
Reference is now made to
Reference is now made to
The VMP acts as n virtual processors served by n virtual pipelines, where each virtual processor time-shares one physical pipeline. Each of the n virtual processors is compatible with the original processor and runs at an n-fold faster clock frequency, but is activated every n'th clock cycle. Thus, it is as if each virtual processor operates at the same frequency as the original processor. Each of the n virtual pipelines is a k-phased pipeline, equivalent to the original processor's single k-phased pipeline, and is activated every n phases of the n*k phased physical pipeline. Each application that is capable of being executed by the original processor is executed as one of the n threads by one of the n virtual processors in the same manner. No change to the application software is required, as each virtual pipeline behaves exactly as the original processor pipeline with respect to instruction processing and pipeline phases.
In the method of
The set of registers that store the processor state information, referred to herein as the register set, is then adapted to simultaneously store the multiple machine states of the n threads. This may be achieved by using any register set extension technique. In one such technique the register set is replaced by n identical register sets, where each of the n register sets is dedicated to one of the threads. Selection logic is then used to activate one of the n register sets at each clock cycle. An alternative method replaces the register set with a “public” register pool, whose individual registers are dynamically allocated to the n threads, depending on their required resources, such that each thread owns a part of the public register file that is sufficient to store its machine states. Selection logic is then used to activate the appropriate register at each cycle as indicated by the part of the register file that is assigned to the active thread and according to the active thread's register access request. Yet another alternative is a combination of the two above mentioned methods, where the extended register set is composed of n partial register sets, each dedicated to one of the n threads, and one register file, whose individual registers are dynamically allocated to the n threads depending on the resources required by each thread, such that each thread has its own register set in addition to a share in the register file, the combination of which is sufficient to store the state of each thread.
Continuing with the method of
It is appreciated that the method of
While the present invention has been described with reference to a thread scheduling scheme where the threads are interleaved on a cycle-by-cycle basis and the thread's real-time execution pattern is compatible with the original processor's cycle-by-cycle real-time behavior, the present invention may utilize any thread-scheduling scheme. Thus, the thread scheduler may select the thread to be activated at each clock cycle based on a combination of criteria, such as thread priority, expected behavior of the selected thread, and the effect of selecting a specific thread on the overall utilization of the processor resources and on the overall performance.
The method of
The method of
Microprocessor 620 comprises a processing core 622, which comprises a processing pipeline 624 and a register set 626. The core elements communicate with a memory 628 and a clock circuit 630, as well as with other elements not shown in the figure. Pipeline 624 comprises a sequence of stages including an instruction fetcher (IF) 632, a decoder 634, an execution engine 636, and a writeback (WB) stage 638.
In order to configure pipeline 624 for multithreading while maintaining the original design frequency of the microprocessor (i.e., with each thread running at the original design frequency), each stage of the pipeline is split into first and second sub-stages (or phases) 640 and 642. Typically, a logic storage element (not shown) is inserted in the design between the two sub-stages. During a given clock cycle, sub-stage 640 can then process an instruction belonging to a first thread, while sub-stage 642 processes an instruction belonging to another thread. During the next clock cycle, sub-stage 642 completes the processing of the instruction belonging to the first thread, while sub-stage 640 begins processing the next instruction of the other thread. Clock circuit 630 may thus drive pipeline 624 so that both threads are processed at the nominal, single-thread throughput of the original processing core.
Each of the threads that is processed by pipeline 624 has its own set of machine states (context), which is held in register set 626 and accessed by the pipeline stages during processing. To enable the interleaving of the threads in the pipeline, the register set comprises register replication circuits 644, corresponding to the original registers (R1, R2, . . . , Rn) of the original microprocessor design. Each circuit 644 holds the contexts of both of the executing threads and switches the context that is made available to the pipeline stages at the (accelerated) clock rate of the pipeline. For proper multithread operation, the context switching performed by the register replication circuits must be carefully synchronized with the pipeline.
In one embodiment, each register replication circuit 644 has a single clock input, as described in PCT patent application PCT/IL2006/000280, filed Mar. 1, 2006, which is assigned to the assignee of the present patent application, and whose disclosure is incorporated herein by reference. Each circuit 644 comprises a main storage element for holding and outputting the context data of one thread and a shadow storage element for holding the context data of the other thread (not shown in the figures). The main and shadow storage elements are connected in cascade so as to exchange the context data held in the main and shadow storage elements in response to the clock signal received via the single clock input. This approach has been found to simplify the timing of the microprocessor and reduce chip size and power consumption.
An input multiplexer 650 accepts inputs to both of the threads that are to be processed by pipeline 624 (referred to herein as input 0 and input 1, respectively). The multiplexer places the input data in alternation at the same input address, so that the pipeline finds the input data for both threads at the address at which it was programmed to find the data in the original, single-threaded design. Similarly, a demultiplexing circuit 651 accepts the outputs from both threads at the same output address as in the original pipeline. This multiplexing and demultiplexing scheme (together with the other features described above) maintains binary compatibility with the original design. In the example shown in
As yet another alternative, input and/or output multiplexing may be achieved by duplicating the logic in the first stage and/or the last stage in the pipeline.
Although the example shown in
It is appreciated that one or more of the steps of any of the methods described herein may be omitted or carried out in a different order than that shown, without departing from the true spirit and scope of the invention.
While the methods and apparatus disclosed herein may or may not have been described with reference to specific hardware or software, it is appreciated that the methods and apparatus described herein may be readily implemented in hardware or software using conventional techniques.
While the present invention has been described with reference to one or more specific embodiments, the description is intended to be illustrative of the invention as a whole and is not to be construed as limiting the invention to the embodiments shown. It is appreciated that various modifications may occur to those skilled in the art that, while not specifically shown herein, are nevertheless within the true spirit and scope of the invention.
This application is a continuation-in-part of U.S. patent application Ser. No. 10/043,223, filed Jan. 14, 2002, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 10043223 | Jan 2002 | US |
Child | 11454423 | Jun 2006 | US |