The subject matter presented here relates to the field of information or data processor architecture. More specifically, this invention relates to scalar code optimization.
Information or data processors are found in many contemporary electronic devices such as, for example, personal computers, personal digital assistants, game playing devices, video equipment and cellular phones. Processors used in today's most popular products are known as hardware as they comprise one or more integrated circuits. Processors execute software to implement various functions in any processor based device. Generally, software is written in a form known as source code that is compiled (by a complier) into object code. Object code within a processor is implemented to achieve a defined set of assembly language instructions that are executed by the processor using the processor's instruction set. An instruction set defines instructions that a processor can execute. Instructions include arithmetic instructions (e.g., add and subtract), logic instructions (e.g., AND, OR, and NOT instructions), and data instructions (e.g., move, input, output, load, and store instructions). As is known, computers with different architectures can share a common instruction set. For example, processors from different manufacturers may implement nearly identical versions of an instruction set (e.g., an x86 instruction set), but have substantially different architectural designs.
SSE (Streaming Single-Instruction-Multiple-Data Extensions) and AVX (Advanced Vector Extensions) are extensions of the x86 instruction sets. Scalar code (instructions) executed in SSE or AVX require merging of a portion of the XMM register, which can produce dependencies between operations that can greatly reduce performance. Generally, the upper portion of the XMM register in scalar code (instructions) will be logical zero. Thus, for double-precision scalar data, bits 64 through 127 are often zero, while bits 32 through 127 are often zero for single-precision scalar data. Since the scalar operations are defined such that the value in the upper portion remain unchanged and are simply passed through, dependencies causing scalar instructions to wait for processing of the upper portion of scalar data during execution of the scalar instructions are wasteful of power and operational cycles of the processor. Such dependencies are known as false dependencies and should be eliminated whenever possible.
An apparatus is provided for increased efficiency and enhanced power saving in a processor via scalar code optimization. The apparatus comprises an operational unit capable of determining whether an instruction comprises a scalar instruction and execution units responsive that determining for processing the scalar instruction using only a lower portion of an XMM register of the processor. By not processing the upper portion of the XMM register efficiency is increased and power saving is enhanced.
A method is provided for increased efficiency and enhanced power saving in a processor via scalar code optimization. The method comprises determining that an instruction comprises a scalar instruction and then processing the instruction using only a lower portion of an XMM register. By not processing the upper portion of the XMM register efficiency is increased and power saving is enhanced.
Embodiments of the present invention will hereinafter be described in conjunction with the following drawing figures, wherein like numerals denote like elements, and
The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Thus, any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Moreover, as used herein, the word “processor” encompasses any type of information or data processor, including, without limitation, Internet access processors, Intranet access processors, personal data processors, military data processors, financial data processors, navigational processors, voice processors, music processors, video processors or any multimedia processors. All of the embodiments described herein are exemplary embodiments provided to enable persons skilled in the art to make or use the invention and not to limit the scope of the invention which is defined by the claims. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary, the following detailed description or for any particular processor microarchitecture.
Referring now to
Referring now to
In operation, the decode unit 24 decodes the incoming operation-codes (opcodes) dispatched (or fetched by) a computational unit. The decode unit 24 is responsible for the general decoding of instructions (e.g., x86 instructions and extensions thereof) and how the delivered opcodes may change from the instruction. The decode unit 24 will also pass on physical register numbers (PRNs) from an available list of PRNs (often referred to as the Free List (FL)) to the rename unit 26.
The rename unit 26 maps logical register numbers (LRNs) to the physical register numbers (PRNs) prior to scheduling and execution. According to various embodiments of the present disclosure, the rename unit 26 can be utilized to rename or remap logical registers in a manner that eliminates the need to actually store known data values in a physical register. This saves operational cycles and power, as well as decrease latency.
The scheduler 28 contains a scheduler queue and associated issue logic. As its name implies, the scheduler 28 is responsible for determining which opcodes are passed to execution units and in what order. In one embodiment, the scheduler 28 accepts renamed opcodes from rename unit 26 and stores them in the scheduler 28 until they are eligible to be selected by the scheduler to issue to one of the execution pipes.
The execute unit(s) 30 may be embodied as any general purpose or specialized execution architecture as desired for a particular processor. In one embodiment the execution unit may be realized as a single instruction multiple data (SIMD) arithmetic logic unit (ALU). In other embodiments, dual or multiple SIMD ALUs could be employed for super-scalar and/or multi-threaded embodiments, which operate to produce results and any exception bits generated during execution.
In one embodiment, after an opcode has been executed, the instruction can be retired so that the state of the floating-point unit 16 or integer unit 18 can be updated with a self-consistent, non-speculative architected state consistent with the serial execution of the program. The retire unit 32 maintains an in-order list of all opcodes in process in the floating-point unit 16 (or integer unit 18 as the case may be) that have passed the rename 26 stage and have not yet been committed by to the architectural state. The retire unit 32 is responsible for committing all the floating-point unit 16 or integer unit 18 architectural states upon retirement of an opcode.
Referring now to
In a similar manner,
As an example, and not as a limitation, the following table lists some of the scalar instructions in x86 SSE and AVX128 instruction sets that potentially may cause dependencies and can benefit from breaking those false dependencies due to the merged ‘upper’ data.
Many, other scalar instructions can benefit by the power savings enhancement of the present disclosure, even if those scalar instructions do not cause false dependencies. That is, while a scalar instruction may not cause a dependency, it may still produce a logic-zero value for the upper portion of the XMM register. Accordingly, not processing the known-zero upper portion value for those instructions provide power savings enhancement even if there are no false dependencies to eliminate. As an example, and not as a limitation, the x86 instruction set contains approximately 100 scalar instructions that could benefit from the power saving enhancement of the present disclosure.
Referring now to
Referring now to
Beginning in step 50, an instruction is decoded. Based on the decoded instruction, the Z-bit value is determined as a function of the instruction and the Z-bit values of its source and destination operands. Using this information, the Z-bit information is set in step 51. Next, decision 52 determines if the instruction is a scalar instruction (single-precision or double-precision). If so, decision 54 determines whether the particular scalar instruction has false dependencies that can be removed. If so, step 55 breaks (eliminates) those false dependencies by not waiting for the merged data source for the upper (zero) data portion of the XMM register. Following this, and also in the event that the determination of decision 54 is that there are no dependencies to be eliminated, the instruction is processed using only the lower portion of the XMM register (step 56), which affords the power saving enhancement of the present disclosure. When the scalar instruction is completed, it can be retired 58 (see 32 of
Various processor-based devices that may advantageously use the processor (or any computational unit) of the present disclosure include, but are not limited to, laptop computers, digital books or readers, printers, scanners, standard or high-definition televisions or monitors and standard or high-definition set-top boxes for satellite or cable programming reception. In each example, any other circuitry necessary for the implementation of the processor-based device would be added by the respective manufacturer. The above listing of processor-based devices is merely exemplary and not intended to be a limitation on the number or types of processor-based devices that may advantageously use the processor (or any computational) unit of the present disclosure.
While at least one exemplary embodiment has been presented in the foregoing detailed description of the invention, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the invention, it being understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims and their legal equivalents.