1. Technical Field
The present disclosure relates generally to information processing systems and, more specifically, to a mechanism that maintains the register values for inactive software threads in storage area separate from the primary physical register file.
2. Background Art
In order to increase performance of information processing systems, such as those that include microprocessors, both hardware and software techniques have been employed. On the hardware side, microprocessor design approaches to improve microprocessor performance have included increased clock speeds, pipelining, branch prediction, super-scalar execution, out-of-order execution, and caches. Many such approaches have led to increased transistor count, and have even, in some instances, resulted in transistor count increasing at a rate greater than the rate of improved performance.
Rather than seek to increase performance through additional transistors, other performance enhancements involve software techniques. One software approach that has been employed to improve processor performance is known as “multithreading.” In software multithreading, an instruction stream may be split into multiple instruction streams that can be executed in parallel. Alternatively, independent software threads may be executed concurrently.
In one approach, known as time-slice multithreading or time-multiplex (“TMUX”) multithreading, a single processor switches between threads after a fixed period of time. In still another approach, a single processor switches between threads upon occurrence of a trigger event, such as a long latency cache miss. In this latter approach, known as switch-on-event multithreading (“SoEMT”), only one thread, at most, is active at a given time.
Increasingly, multithreading is supported in hardware. For instance, in one approach, processors in a multi-processor system, such as a chip multiprocessor (“CMP”) system, may each act on one of the multiple threads concurrently. In another approach, referred to as simultaneous multithreading (“SMT”), a single physical processor is made to appear as multiple logical processors to operating systems and user programs. For SMT, multiple threads can be active and execute concurrently on a single processor without switching. That is, each logical processor maintains a complete set of the architecture state, but many other resources of the physical processor, such as caches, execution units, branch predictors control logic and buses are shared. For SMT, the instructions from multiple software threads may thus execute concurrently on each logical processor.
The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of an apparatus, system and method for a mechanism that maintains register values for inactive SoEMT software threads in a secondary register file.
In the following description, numerous specific details such as processor types, multithreading approaches, microarchitectural structures, architectural register names, and thread switching methodology have been set forth to provide a more thorough understanding of embodiments of the present invention. It will be appreciated, however, by one skilled in the art that embodiments of the invention may be practiced without such specific details. Additionally, some well-known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the embodiments.
A particular hybrid of multithreading approaches is disclosed herein. Particularly, a combination of SoEMT and SMT multithreading approaches is referred to herein as a “Virtual Multithreading” approach. For SMT, two or more software threads may run concurrently in separate logical contexts. For SoEMT, only one of multiple software threads is active in a logical context at any given time. These two approaches are combined in Virtual Multithreading. In Virtual Multithreading, each of two or more logical contexts supports two or more SoEMT software threads, referred to as “virtual threads.”
For example, three virtual software threads may run on an SMT processor that supports two separate logical thread contexts. Only two of the thread virtual software threads are active at any given time; one on each logical processor. Any of the three software threads may begin running, and then go into an inactive state upon occurrence of an SoEMT trigger event. The inactive state may be referred to herein as a “sleep” state, although the term “sleep state” is not intended to be limiting as used herein. “Sleep state” thus is intended to encompass, generally, the inactive state for an SoEMT thread. An inactive virtual thread may sometimes be referred to herein as a “sleeping” thread.
Because expiration of a TMUX multithreading timer may be considered a type of SoEMT trigger event, the use of the term “SoEMT” with respect to the embodiments described herein is intended to encompass multithreading wherein thread switches are performed upon the expiration of a TMUX timer, as well as upon other types of trigger events, such as a long latency cache miss, execution of a particular instruction type, and the like.
When resumed, a sleeping software thread need not resume in the same logical context in which it originally began execution—it may resume either in the same logical context or in another logical context. In other words, a virtual software thread may switch back and forth among logical contexts over time. Disclosed herein is a mechanism to efficiently maintain register values for multiple active and inactive software threads in order to support the hybrid Virtual Multithreading (VMT) environment.
The processor 104 thus may include a front end 120 that prefetches instructions that are likely to be executed. For at least one embodiment, the front end 120 includes a fetch/decode unit 222 that includes a logically independent sequencer 420A-420M for each of two or more physical thread contexts. The physical thread contexts may also be interchangeably referred to herein as “logical processors” and/or “physical threads.” The single physical fetch/decode unit 222 thus includes a plurality of logically independent sequencers 420A-420M, each corresponding to one of M physical threads. The front end 120 delivers the fetched instructions 145 to later stages of an execution pipeline.
For at least one embodiment, the processor 104 supports virtual multithreading in that the M physical threads may support N virtual software threads, wherein N>M. For at least one such embodiment, only one of the N virtual software threads is active on a physical thread at any given time. In other words, only M of the N software threads may be running at any given time, while the other of the N−M software threads are inactive.
For at least one embodiment, the front end 120 is to provide special register swap instructions that it has either generated or has obtained from memory or software. For at least one embodiment, these register swap instructions are micro-operations. In other words, the register swap instructions may be understood and executed by an execution unit 190 but are not architecturally visible instructions. For other embodiments, of course, the register swap instructions may be architecturally visible instructions.
While
Regarding renaming, compiled or assembled software instructions reference the relatively small set of logical registers defined in the instruction set for a target processor. Superscalar processors attempt to exploit instruction level parallelism by issuing multiple instructions in parallel, thus improving performance. The instruction set for a processor commonly includes a limited number of available logical registers. As a result, the same logical register is often used in compiled code to represent many different variables, although a logical register represents only one variable at any given time.
However, the processor may provide a larger number of actual registers to store register values. This storage area is commonly a set of physical registers referred to as a physical register file 160. For example, a particular processor architecture might specify only eight (8) general-use registers while the processor 104 may provide 128 physical general-use registers in the physical register file 160.
The register rename logic 140 is to map each occurrence of the general use logical registers in an instruction stream to one of the physical registers 160. The renaming logic 140 may utilize a rename table 150 to keep track of the latest version of each architectural (logical) register to tell the next instruction(s) where (that is, from which physical register 160) to get its input operands. For at least one embodiment, the rename table 150 is referred to as a register alias table (RAT). For at least one embodiment, each logical processor 420A-420M may maintain and track its own architecture state and therefore may maintain its own RAT 150, or may be allocated a partitioned portion of a global RAT 150.
Commonly, the general-purpose register file 160 is shared among logical processors within a processor 104. This scheme may result in inefficient utilization of the register file 160 by sleeping virtual threads. If all logical registers for each of the virtual threads is renamed to a register in the general purpose register file 160, then the various virtual threads, even the inactive virtual threads, may utilize a relatively large number of the available physical registers 160. In addition to being inefficient such approach may, for at least some embodiments, lower the overall performance of the processor 104. Therefore, one of the challenges for a processor 104 that supports virtual multithreading and utilizes renaming is the storing and tracking of general purpose register values for inactive virtual threads.
Due to the dynamic nature of virtual multithreading, a particular secondary register file 130 is not allocated to any particular virtual thread, but may be utilized to hold register values for any virtual thread that happens to be inactive at a given time.
The number of entries in each secondary register file 130 may be equivalent to the number of architectural registers defined for the processor 104. For the above example of an eight-register architecture, for instance, each secondary register file 130 may include eight entries, one for each general-purpose logical register. In some embodiments, therefore, the secondary register file 130 is quite a bit smaller than the general-purpose register file 160. Also, the secondary register files 130 may each be implemented with a single read port and a single write port. Secondary register files 130 may be implemented, for example, as arrays having a single read and write port. This implementation requires less overhead than a register file 160 implemented with multiple read and write ports. One should note that the example of an array data structure for the secondary register files 130 is given for purposes of illustration only, and should not be taken to be limiting. The secondary register files 130 may be implemented as any appropriate storage structure, including, for instance, an array (including a memory array or register array), a latch or group of latches, a register, or a buffer.
The read and write ports of each register secondary register file 130 may be accessed by an execution unit 190, responsive to a register swap micro-operation. When execution unit 190 executes the micro-operation, the execution unit 190 is directed to place a register value from one of the secondary register files 130, rather than from the general register file 160, into the destination register. Such direction may be facilitated, at least in part, by action of the rename logic 140, as is discussed below.
The register swap micro-operation may be generated by control logic (not shown). For at least one other embodiment, the register swap micro-operation may be retrieved from a memory location, such as a microcode read only memory (ROM). For at least one other embodiment, the register swap micro-operation may be generated by software.
The register swap micro-operation may, for at least one embodiment, include a value that indicates which entry of the secondary register file 130 is to be accessed in order for the execution unit 190 to obtain the desired register value. For at least one embodiment, this value may be implicit. That is, the logical register identifier (provided as a source operand) may be utilized as the index into the secondary register file 130.
For an embodiment having more than one secondary register file 130, such as the embodiment illustrated in
Reference is now made to
The point in the t0 instruction stream where thread 0202 will stop executing instructions (until re-activated) is referred to herein as the “swap point.”
In response to detection of the thread switch trigger event 210, the front end 120 (
The example illustrated in Table 1 assumes that logical registers r1 through rx are subject to renaming. The term “switch_spool_op” indicates an opcode that is understood and executed by an execution unit 190 to result in the actions described below in connection with
The front end 120 (
The register swap micro-operations discussed above are thus provided by the front end 120. Each may constitute an instruction 145 that is renamed by rename logic 140. The register swap micro-operations 212 are thus renamed just like any other instruction. Accordingly,
Although
At block 304, it is determined whether a thread switch operation has been triggered by a trigger event. If so, then processing proceeds to block 306. Otherwise, processing ends at block 316.
At block 306, a register swap micro-operation is provided by the front end (such as, for example, front end 120 illustrated in
At block 308, each register swap micro-operation that was generated at block 306 is renamed. In particular, for each of the register swap micro-operations, blocks 310, 312 and 314 are performed.
At block 310, the source operand registers are renamed to reflect the physical register (such as, for example, one of physical registers 106 in
From block 310, processing proceeds to block 312. At block 312, the micro-operation is renamed such that a physical register is designated for the destination operand. Again, the illustrative embodiment shown in
From block 312, processing proceeds to block, 314. At block 314, the micro-operation is modified to append a logical register index to the micro-operation. This action 314 is performed because, when the source register is renamed 310, the renamed micro-operation becomes disassociated from the original logical register designation. The execution unit may utilize the appended register index in order to locate the secondary register file 130 entry to be “swapped.” The appending 314 of a logical register index is optional. For at least one other embodiment, for example, the execution unit may consult a storage device, similar to a register alias table, that maps logical registers to the entries of the secondary register file 130 (
From block 314, processing ends at block 316. A processor, such as, for example, processor 104 illustrated in
Generally, when the micro-operation 402 is renamed 308, logical source and destination register identifiers are replaced with physical source and destination register identifiers in the renamed micro-operation 404.
Generally,
One of skill in the art will recognize that the format illustrated in Table 1, as well as the example micro-operation 402 illustrated in
Also,
At block 606, the appropriate entry (indicated by the logical register index) of the appropriate secondary register file 130 (indicated by the secondary register file identifier) is read. For at least one embodiment, this read operation provides the indicated secondary register file 130 entry value to the execution unit 190. Processing then proceeds to block 608.
At block 608, the source operand is read and retrieved from the primary register file (see, for example, 106 in
At block 610, the source operand value retrieved from the primary register file 160 (which is the value of the indicated logical register for the dozing thread) is written to the appropriate entry of the secondary register file 130. In this manner, the logical register value for the dozing thread is “swapped out” of the primary register file 106 to be stored as the secondary register file 130 value for that logical register. Processing then proceeds to block 612.
At block 612, the source operand value that was retrieved from the secondary register file 130 at block 608 is placed on the result bus to be written to the primary register file 160. In this manner, the logical register value for the waking thread, which was read from the secondary register file 130 at block 606, is “swapped in” to the primary register file 160 to be stored as the current value for the indicated logical register. The register file 160 now holds, at the destination register, the current value of the logical register of interest for the waking thread. After such swap of the logical register values between the primary and secondary register files is completed at block 612, processing ends at block 614.
The execution unit 190 also utilizes the secondary register file identifier (see “secondary register file identifier” field of Table 1, above) of the register swap micro-operation 402 to determine the appropriate secondary register file 130 for the waking thread. For our example, the execution unit 190 determines that the secondary register file identifier 706 (“const0”) of the renamed micro-operation 404 indicates that a value from secondary register file 0130(0) is to be swapped in. For at least one other embodiment, the secondary register file identifier 706 is not appended to the micro-operation. Instead, a global signal is utilized to indicate to the functional unit which thread is the waking thread. The functional unit utilizes this global signal to determine the appropriate secondary register file 130.
For our example, the appended register index, “r1” 702, indicates that the r1 entry 710 of the secondary register file 130 is to be read 606. The value of the secondary register file indicator 706 is a constant value of zero (“const0”), indicating that secondary register file 0, 130(0), contains the logical register values of the waking thread. Accordingly, the execution unit 190 reads 606 the indicated entry 710 of the specified secondary register file 130(0). For our example, the indicated entry 710 contains the most current value of logical register r1 for the waking thread, t1 (see 204,
Similarly,
In summary, the discussion above discloses embodiments of a processor and methods for utilizing secondary register files to maintain register values for inactive virtual threads. According to at least some of the disclosed embodiments, register values for each of a plurality of active virtual threads are maintained in a primary register file 160, while register values for inactive threads are maintained in separate secondary register files. All registers of the primary register file 160 are available to rename logic 140. By maintaining register values for inactive threads in a secondary register file, more entries of the primary register file 160 are available for renaming of logical registers for active threads.
While the secondary register file 130 embodiments disclosed herein may be practiced to maintain and swap active and inactive state element values for a plurality (N) of SoEMT software threads on a single physical thread, for at least one embodiment the number of physical threads is greater than one (M≧2).
One of skill in the art will also recognize that blocks 606, 608, 610 and 612 need not necessarily be performed in the order illustrated. Indeed, any alternative ordering of the illustrated processing may be utilized, as long as it achieves the functionality illustrated in
Memory 802 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory and related circuitry. Memory 802 may store instructions 810 and/or data 812 represented by data signals that may be executed by processor 804. The instructions 810 and/or data 812 may include code for performing any or all of the techniques discussed herein.
The processor 804 may include a front end 870 along the lines of front end 120 described above in connection with
Front end 870 also supplies other instruction information to the execution core 830 and may include a fetch/decode unit 222 that includes M logically independent sequencers 420. For at least one embodiment, the front end 870 prefetches instructions that are likely to be executed. For at least one embodiment, the front end 870 may supply the instruction information to the execution core 830 in program order.
For at least one embodiment, the execution core 830 prepares instructions for execution, executes the instructions, and retires the executed instructions. The execution core 830 may include out-of-order logic (not shown) to schedule the instructions for out-of-order execution. The execution core 830 may also include one or more execution units 190 to perform the execution of instructions (as used herein, the term “instructions” includes micro-operations). The execution core 830 may also include a primary register file 160, secondary register files 130, rename logic 140 and one or more register alias tables 150, all of which are discussed above in connection with
The execution core 830 may include retirement logic (not shown) that reorders the instructions, executed in an out-of-order manner, back to the original program order. This retirement logic receives the completion status of the executed instructions from the execution Unit(s) 190 and processes the results so that the proper architectural state is committed (or retired) according to the program order.
As used herein, the term “instruction information” is meant to refer to basic units of work that can be understood and executed by the execution core 830. Instruction information may be stored in a cache 825. The cache 825 may be implemented as an execution instruction cache or an execution trace cache. For embodiments that utilize an execution instruction cache, “instruction information” includes instructions that have been fetched from an instruction cache and decoded. For embodiments that utilize a trace cache, the term “instruction information” includes traces of decoded micro-operations. For embodiments that utilize neither an execution instruction cache nor trace cache, “instruction information” also includes raw bytes for instructions that may be stored in an instruction cache (such as I-cache 844).
The processing system 800 includes a memory subsystem 840 that may include one or more caches 842, 844 along with the memory 802. Although not pictured as such in
The foregoing discussion describes selected embodiments of methods, systems and apparatuses to maintain architectural register values for a plurality of virtual software threads within a processor. For purposes of explanation, specific numbers, examples, systems and configurations were set forth in order to provide a more thorough understanding. However, it is apparent to one skilled in the art that the described method and apparatus may be practiced without the specific details. In other instances, well-known features were omitted or simplified in order not to obscure the method and apparatus.
Embodiments of the method may be implemented in hardware, hardware emulation software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented for a programmable system comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
A program may be stored on a storage media or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device) readable by a general or special purpose programmable processing system. The instructions, accessible to a processor in a processing system, provide for configuring and operating the processing system when the storage media or device is read by the processing system to perform the procedures described herein. Embodiments of the invention may also be considered to be implemented as a machine-readable storage medium, configured for use with a processing system, where the storage medium so configured causes the processing system to operate in a specific and predefined manner to perform the functions described herein.
At least one embodiment of an example of such a processing system is shown in
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects.
For example, although the foregoing discussion focuses, for purposes of illustration, on embodiments for which only general purpose architectural register values are maintained in secondary register files 130, one of skill in the art will recognize that other embodiments may be fashioned to maintain the values of other types of registers, such as control registers, predicate registers, and the like.
Accordingly, one of skill in the art will recognize that changes and modifications can be made without departing from the present invention in its broader aspects. The appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.