The present invention relates to the field of information or data processing. More specifically, this invention relates to the field of implementing a processor achieving increased performance and energy saving via instruction pre-completion without having to schedule instruction execution in processor execution units.
In conventional processor architectures, instructions require an operation in an execution unit to be completed. For example, an instruction could be an arithmetic instruction (e.g., add and subtract), requiring an integer or floating-point computation unit to execute the instruction and return the result. Generally, processors decode instructions to determine what needs to be done. Next, the instruction is scheduled for execution and any necessary operands and source or destination registers are identified. At execution time, data and/or operands are read from source registers, the instruction is processed and the result returned to a destination register. By processing all instructions in the same manner, conventional processors have the potential to waste operational cycles and power by scheduling and executing instructions that could be performed without use of an execution unit. Moreover, latency increases since scheduling an instruction that could be completed without use of an execution unit prevents other instructions from being processed.
An apparatus is provided for achieving increased performance and energy saving via instruction pre-completion without having to schedule instruction execution in all the processor execution units. The apparatus comprises an operational unit for determining whether an instruction can be completed without scheduling use of an execution unit of the processor, and units within the operational unit capable of completing the instruction outside the conventional schedule and execute paths. In this way, the instruction is completed without use of one or more execution units of the processor.
A method is provided for achieving increased performance and energy saving via instruction pre-completion without having to schedule instruction execution in processor execution units. The method comprises determining that an instruction can be completed without use of an execution unit of a processor and then pre-completing the instruction without the execution unit such as by employing alternate or equivalent processes or techniques to complete the instruction.
The present invention will hereinafter be described in conjunction with the following drawing figures, wherein like numerals denote like elements, and
The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Thus, any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Moreover, as used herein, the word “processor” encompasses any type of information or data processor, including, without limitation, Internet access processors, Intranet access processors, personal data processors, military data processors, financial data processors, navigational processors, voice processors, music processors, video processors or any multimedia processors. All of the embodiments described herein are exemplary embodiments provided to enable persons skilled in the art to make or use the invention and not to limit the scope of the invention which is defined by the claims. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary, the following detailed description or for any particular processor microarchitecture.
Referring now to
Referring now to
In operation, the decode unit 24 decodes the incoming operation-codes (opcodes) to be dispatched for the computations or processing. The decode unit 24 is responsible for the general decoding of instructions (e.g., x86 instructions and extensions thereof) and how the delivered opcodes may change from the instruction. The decode unit 24 will also pass on physical register numbers (PRNs) from an available list of PRNs (often referred to as the Free List (FL)) to the rename unit 28.
The rename unit 28 maps logical register numbers (LRNs) to the physical register numbers (PRNs) prior to scheduling and execution. According to various embodiments of the present disclosure, the rename unit 28 can be utilized to rename or remap logical registers in a manner that eliminates the need to store known data values in a physical register. In one embodiment, this is implemented with a register mapping table stored in the rename unit 28. According to the present disclosure, renaming or remapping registers saves operational cycles and power, as well as decreases latency.
The scheduler 30 contains a scheduler queue and associated issue logic. As its name implies, the scheduler 30 is responsible for determining which opcodes are passed to execution units and in what order. In one embodiment, the scheduler 30 accepts renamed opcodes from rename unit 28 and stores them in the scheduler 30 until they are eligible to be selected by the scheduler to issue to one of the execution pipes.
The register file control 32 holds the physical registers. The physical register numbers and their associated valid bits arrive from the scheduler 30. Source operands are read out of the physical registers and results written back into the physical registers. In one embodiment, the register file control 32 also checks for parity errors on all operands before the opcodes are delivered to the execution units. In a multi-pipelined (super-scalar) architecture, an opcode (with any data) would be issued for each execution pipe.
The execute unit(s) 34 may be embodied as any generation purpose or specialized execution architecture as desired for a particular processor. In one embodiment the execution unit may be realized as a single instruction multiple data (SIMD) arithmetic logic unit (ALU). In another embodiment, dual or multiple SIMD ALUs could be employed for super-scalar and/or multi-threaded embodiments, which operate to produce results and any exception bits generated during execution.
In one embodiment, after an opcode has been executed, the instruction can be retired so that the state of the floating-point unit 16 or integer unit 18 can be updated with a self-consistent, non-speculative architected state consistent with the serial execution of the program. The retire unit 36 maintains an in-order list of all opcodes in process in the floating-point unit 16 (or integer unit 18 as the case may be) that have passed the rename 28 stage and have not yet been committed by to the architectural state. The retire unit 36 is responsible for committing all the floating-point unit 16 or integer unit 18 architectural states upon retirement of an opcode.
According to embodiments of the present disclosure, instructions are identified that can be pre-completed without scheduling that instruction for execution in an execution unit. Pre-completed (or pre-completing) in this sense, means using processes or processor architectural improvements to complete certain instructions without using one or more execution unit(s). That is, instructions are pre-completed from the perspective of one or more execution units since those execution units are not utilized for processing instruction as in conventional processor architectures. By using alternate or equivalent techniques, processes or processor architectural improvements to pre-complete instructions, operational cycles and power are saved and latency is reduced by bypassing or avoiding the scheduling and certain execution stages. Certain examples of such instructions are presented below, however, these examples do not limit the scope of the present disclosure and numerous other instructions from various processor architectures and/or instructions sets can benefit from the advantages of the present disclosure.
Referring now to
While conventional processor architectures would schedule and execute an FINCSTP (increment stack pointer) instruction in an execution unit (such as by executing a write instruction to write a new address into the stack pointer), the present disclosure achieves an advantage by completing the FINCSTP instruction without scheduling the use of an execution unit or using that execution unit in the completion of the instruction. That is, in one embodiment, the processor and method of the present disclosure pre-completes the FINCSTP instruction without use of the scheduling unit (30 in
Referring now to
Referring now to
Various processor-based devices may advantageously use the processor (or computational unit) of the present disclosure, including laptop computers, digital books, printers, scanners, standard or high-definition televisions or monitors and standard or high-definition set-top boxes for satellite or cable programming reception. In each example, any other circuitry necessary for the implementation of the processor-based device would be added by the respective manufacturer. The above listing of processor-based devices is merely exemplary and not intended to be a limitation on the number or types of processor-based devices that may advantageously use the processor (or computational unit) of the present disclosure.
While at least one exemplary embodiment has been presented in the foregoing detailed description of the invention, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the invention, it being understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims and their legal equivalents.