Embodiments of the invention relate to microprocessor architecture. More particularly, embodiments of the invention relate to a technique for virtualizing register resources within a microprocessor.
High performance microprocessors typically use multi-stage (“deep”) pipeline architectures to facilitate running at high frequencies. In order to maintain high instruction parallelism with these deep pipelines, large buffering resources are typically used to minimize stalling of instructions within the pipeline.
For the example of load operations, a deeply pipelined processor typically has enough load buffers to ensure that, at least most of the time, issuing of new load instructions will not be stalled because all of the available load buffers are currently allocated to un-retired load instructions. This may be true of other operations, such as store operations, as well.
However, increasing the number of buffering resources may not always be the optimal solution. One reason is that a large buffer structure is more difficult to design than a smaller one. Furthermore, processor performance may be lost if accesses to a large buffer structure are pipelined in order to meet the operating frequency targets.
Typical high-performance processors are designed with sufficient buffering resources to cover their pipeline depth, at least for the majority of circumstances. Conversely, the pipeline depth can be balanced with the size of buffers that may be successfully implemented at the target frequency. Furthermore, processors with deeper pipelines typically needed more buffers than those with shorter pipelines. Adding more buffers to accommodate deeper pipelines in microprocessors can add cost, increase power consumption, and be difficult to implement.
In prior art microprocessor architectures, buffer allocation is typically allocated early in the processor pipeline. Therefore, when the physical buffers are allocated, the processor typically stalls the next load instruction (and all subsequent instructions) at the allocate stage of the pipeline until a physical buffer is available. The allocation stage of the pipeline is typically before the scheduling stage of the pipeline in deeply pipelined processors. Consequently, buffer allocation must occur prior to the operations being scheduled, which can degrade processor performance if the pipeline stalls.
Embodiments of the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Embodiments of the invention pertain to microprocessor architecture. More particularly, embodiments of the invention pertain to virtualizing physical buffers within a microprocessor.
The term “buffer” shall be used as a generic term for any computer memory structure, including registers and static random access memory (RAM), and dynamic RAM (DRAM). Furthermore, although numerous references are made to load buffers throughout, the concepts and principals described herein may readily be applied to other types of buffers, including store buffers.
Buffer virtualization techniques described herein involve increasing the number of allocate-able buffers over the actual number of buffers within or used by a processor in order to facilitate higher processor performance without significantly increasing the cost or complexity of the processor design. For example, a relatively large number of load operations, such as 128 load operations, could be active in the processor at a given time, even though a relatively small number of physical buffers, such as 64 physical load buffers, are actually available.
In order to increase the effective buffer resources available to a processor architecture, embodiments of the invention involve techniques to map each virtual buffer to a physical buffer when necessary and to ensure that multiple operations, such as load and store operations, that share the same physical buffer entry do not interfere with each other when accessing that physical buffer entry.
In at least one embodiment of the invention, virtual buffers are mapped to physical buffers by indexing the lower n bits of the virtual buffer address into 2n physical buffer entries. Advantageously, if the number of virtual load buffers is a power of 2 multiple of the physical buffers, for example (e.g. the number of virtual load buffers is 2, 4, 8 etc. times larger than the number of physical load buffers), then each physical buffer can be shared by the same number of virtual load buffers.
In order to prevent two (or more) load operations that share the same physical buffer entry from interfering with each other when accessing the same buffer, a physical buffer check (PBC) algorithm may be used.
After a reset operation that places the processor in a known state, a head buffer pointer (HBP) is set to point to the last physical load buffer at operation 101. When a load buffer is de-allocated, the HBP is incremented by 1, wrapping back to 0 after pointing to the last virtual load buffer entry at operation 105. Whenever a load operation wants to check if the correct physical load buffer is available for it to use, it can check to see if the virtual load buffer index is less than or equal to the HBP (Virtual LB index<=HBP) at operation 110. If the virtual load buffer index is less than or equal to HBP, then the physical load buffer is available at operation 115. Otherwise, the load operation can wait until the HBP is incremented making the above equation true at operation 120.
The HBP 210 would be initialized to 63 in this machine, such that the first load operation will successfully access the load buffer, since the virtual load buffer index is<=HBP, or 0<=63. However, the last load will fail this check, since the equation, virtual load buffer index<=HBP, will not be true. After the first load operation retires and de-allocates its load buffer, the HBP will increment to 64 and enabling the last load (with virtual load buffer index=64) to access the physical load buffer at index 0.
The PBC algorithm may be implemented at various stages in the processor pipeline. However, implementing the PTC algorithm at a stage earlier in the pipeline than the stage at which the physical buffer needs to be accessed by an operation, such as a load or store operation, can yield advantageous results.
The PBC algorithm may be implemented partially or completely in logic within any portion of the microprocessor. However, advantageous results can result if the PBC algorithm is implemented in logic within the scheduler unit 315. The exact or relative location of the execution unit and portions of embodiments of the invention are not intended to be limited to those illustrated within
By implementing the PBC algorithm within the scheduler of the processor in
The main memory may be implemented in various memory sources, such as dynamic random-access memory (DRAM), a hard disk drive (HDD) 420, or a memory source located remotely from the computer system via network interface 430 containing various storage devices and technologies. The cache memory may be located either within the processor or in close proximity to the processor, such as on the processor's local bus 407. Furthermore, the cache memory may contain relatively fast memory cells, such as a six-transistor (6T) cell, or other memory cell of approximately equal or faster access speed.
The computer system of
The
At least one embodiment of the invention may be located within the memory controller hub 572 or 582 of the processors. Other embodiments of the invention, however, may exist in other circuits, logic units, or devices within the system of
Various aspects of embodiments of the invention may be implemented using complimentary metal-oxide-semiconductor (CMOS) circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out embodiments of the invention. Furthermore, some embodiments of the invention may be performed solely in hardware, whereas other embodiments may be performed solely in software.
While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments, which are apparent to persons skilled in the art to which the invention pertains are deemed to lie within the spirit and scope of the invention.