This application is related to processors and methods of instruction processing.
Dedicated pipeline queues have been used in multi-pipeline execution units of processors in order to achieve faster processing speeds. In particular, dedicated queues have been used for execution units having multiple pipelines that are configured to execute different subsets of supported instructions. But dedicated queuing has generated various bottlenecks and problems for scheduling instructions that require both numeric manipulation and retrieval/storage of data.
Additionally, processors are conventionally designed to process instructions that are typically identified by operation codes (OpCodes). In the design of new processors, it is important to process all of a standard set of instructions so that existing computer programs based on standardized codes will operate without the need for translating instructions into an entirely new code base. Processor designs may further incorporate the ability to process new instructions, but backwards compatibility to older instruction sets is often desirable.
Execution of instructions is typically performed in an execution unit of a processor core. To increase processing speed, multi-core processors have been developed. Also to facilitate faster execution throughput, “pipeline” execution of instructions within an execution unit of a processor core is used. Cores having multiple execution units for multi-thread processing are also being developed. However, there is a continuing demand for faster throughput for processors.
One type of standardized set of instructions is the instruction set compatible with the x86 chips, e.g. 8086, 286, 386, etc. that have enjoyed widespread use in many personal computers. Instruction sets, such as the “x86” instruction set, include operations requiring numeric manipulation, operations requiring retrieval and/or storage of data, and operations that require both numeric manipulation and retrieval/storage of data. To execute such instructions, execution units within processor cores have included two types of pipelines: arithmetic logic pipelines (“EX pipelines”) to execute numeric manipulations, and address generation pipelines (“AG pipelines”) to facilitate load and store operations.
To quickly and efficiently process instructions as required by a particular computer program, the program commands are decoded into operations within the supported set of instructions and dispatched to the execution unit for processing. Conventionally, an OpCode is dispatched that specifies what operation is to be performed along with associated information that may include items such as an address of data to be used for the operation and operand designations.
Dispatched operations are conventionally queued for a multi-pipeline scheduler of an execution unit. Queuing is conventionally performed with some type of decoding of a instruction's OpCode in order for the scheduler to appropriately direct the operations for execution by the pipelines with which it is associated within the execution unit.
Some instructions require more than one cycle to complete. Examples include multiply (MUL) and divide (DIV) operations. Each multi-cycle instruction has a known latency, i.e. each operation requires a predetermined number of cycles to complete.
Two aspects of accounting for multi-cycle instructions are discussed. First, unlike single cycle instructions which do not consume any resources outside of the cycle in which they are executed, multi-cycle instructions consume resources in successive clock cycles making those resources unavailable until the multi-cycle operation is complete. Thus, each multi-cycle operation has a resource contention associated with it which prevents additional multi-cycle operations from being issued using the same resource until the first multi-cycle operation is complete. This is called the repeat rate of the instruction.
Second, the results of multi-cycle instructions can only be distributed after the instruction is complete. Each multi-cycle instruction is given priority to distribute its result over a single cycle instruction. To avoid resource conflicts, counters have been used to track the latency of multi-cycle instructions. Since a multi-cycle instruction could be in any of the plurality of scheduler entries, each entry is required to have dedicated counter logic. When a multi-cycle instruction in a particular entry is picked, the counter associated with that entry will count the cycles as the instruction is being processed. When a predetermined threshold is reached, the counter will issue a flag to prevent an instruction from being picked which requires use of a result bus in the next cycle. The instruction distributes the result free of any resource conflicts.
One skilled in the art will recognize that the above noted issues increase chip area and power requirements for a scheduler block and decreases the processing efficiency of the execution unit since most instructions are single cycle.
A method of processing multi-cycle instructions includes picking a multi-cycle instruction and directing the picked multi-cycle instruction to a pipeline. The method further includes detecting the repeat rate and latency of the picked multi-cycle instruction, and counting clock cycles based on the detected repeat rate and the latency of the picked multi-cycle instruction.
An apparatus for processing multi-cycle instructions includes a pipeline configured to process multi-cycle instructions. The pipeline includes a pipeline control configured to detect a latency and a repeat rate for each multi-cycle instruction and to count clock cycles based on the detected latency and the detected repeat rate. The apparatus further includes a scheduler queue configured to queue a plurality of instructions for pipeline processing, and a picker configured to pick a multi-cycle instruction from the scheduler queue and to direct the picked multi-cycle instruction to the pipeline for processing.
A computer-readable storage medium storing a set of instructions for execution by one or more processors to process multi-cycle instructions that includes a picking code segment for picking a multi-cycle instruction and a directing code segment for directing the picked multi-cycle instruction to a pipeline. The instructions further include a detecting code segment for detecting a repeat rate of the picked multi-cycle instruction and a counting code segment for counting clock cycles based on the detected repeat rate of the picked multi-cycle instruction. A detecting code segment for detecting a latency of the picked multi-cycle instruction and a counting code segment for counting clock cycles based on the detected latency of the picked multi-cycle instruction are also included.
A floating point unit 25 may be provided for execution of floating point instructions. The decoder unit 15 may dispatch instruction information packets over a common bus 18 to both the fixed point execution unit 20 and the floating point unit 25.
The execution unit 20 includes a mapper 30 associated with a scheduler queue 35 and a picker 40. These components control the selective distribution of operations among a plurality of arithmetic logic (EX) pipelines 451, 452 and address generation (AG) pipelines 501, 502 for pipeline execution. The pipelines execute operations queued in the scheduling queue 35 by the mapper 30 that are picked therefrom by the picker 40 and directed to an appropriate pipeline. In executing an instruction, the pipelines identify the specific kind of operation to be performed by a respective operation code (OpCode) assigned to that kind of instruction.
In the example illustrated in
In the example execution unit shown in
DIV and MUL operations generally require multiple clock cycles to execute. The complexity of both arithmetic pipelines EX0451 and EX1452 may be reduced by not requiring either to perform all possible arithmetic operations and dedicating multi-cycle arithmetic operations for execution by only one of the two arithmetic pipelines. This saves chip real estate while still permitting a substantial overlap in the sets of operations that can be executed by the respective arithmetic pipelines EX0451, EX1452.
The processing speed of the execution unit 20 may be affected by the operation of any of its components. Since all the instructions that are processed must be mapped by the mapper 30 into the scheduling queue 35, any delay in the mapping/queuing process may adversely affect the overall speed of the execution unit.
The scheduler queue 35 may be configured as a unified queue for queuing instructions for all execution pipelines within the execution unit 20.
In the example illustrated in
Depending upon the kind of operation, an instruction executed in one of the pipelines may require a single clock cycle to complete or multiple clock cycles to complete. For example, a simple add instruction can be performed by either arithmetic pipeline EX0451 or EX1452 in a single clock cycle. However, arithmetic pipeline EX0451 requires multiple clock cycles to perform a division operation, and arithmetic pipeline EX1452 requires multiple clock cycles to perform a multiplication operation.
As an example, any given type of multi-cycle arithmetic operation may be dedicated to only one of the arithmetic pipelines EX0451, EX1452, and most single cycle arithmetic operations are within the execution domains of both arithmetic pipelines EX0451, EX1452. In the x86 based instruction set, there are various multi-cycle arithmetic operations, namely multi-cycle Division (DIV) operations 60 that fall within the execution domain of the arithmetic pipeline EX0451, and multi-cycle Multiplication (MUL) operations 70 and multi-cycle Branch(BRN) operations 75 that fall within the execution domain of the arithmetic pipeline EX1452. Accordingly, in the example, the execution domains of the arithmetic pipelines EX0451, EX1452 substantially overlap with respect to single cycle arithmetic operations, but they are mutually exclusive with respect to multi-cycle arithmetic operations.
There are three kinds of operations requiring retrieval and/or storage of data, namely, load (LD), store (ST) and load/store (LD-ST). These operations are performed by the address generation pipelines AG0501, AG1502 in connection with a Load-Store (LS) unit 80 of the execution unit 20 in
Both LD and LD-ST operations generally are multi-cycle operations that typically require a minimum of 4 cycles to be completed by the address generation pipelines AG0, AG1. LD and LD-ST operations identify an address of data that may be loaded into one of the PRNs of the PRN sets 551, 552 associated with the pipelines. Time may be required for the LS unit 80 to retrieve the data at the identified address, before that data can be loaded in one of the PRNs. For LD-ST operations, the data that may be retrieved from an identified address may be processed and subsequently stored in the address from where it was retrieved.
ST operations typically require a single cycle to be completed by the address generation pipelines AG0501, AG1502. This may be because ST operations will identify where data from one of the PRNs of the PRN sets 551, 552 may be stored. Once that address is communicated to the LS unit 80, LS unit 80 may perform the actual storage so that the activity of the address generation pipeline AG0501, AG1502 may be complete after a single clock cycle.
In the example illustrated in
The mapper 30 may be configured to make a top to bottom scan and a bottom to top scan in parallel of the queue positions QP12301-QPn 230n to identify a top-most open queue position and bottom-most open queue position; one for each of the two instructions corresponding to two instruction information packets received in a given clock cycle.
Where the OpType field data of a dispatched instruction information packet indicates a floating point (FP) OpType, the instruction corresponding to that instruction information packet may not be queued because it may only require execution by the floating point unit 25. Accordingly, even when two instruction information packets are received from the decoder 15 in one clock cycle, one or both instructions may not be queued in the scheduler queue 35 for this reason.
Each queue position QP12301 . . . QPn 230n may be associated with memory fields including an Address Generation instruction (AG Payload) 235; an Arithmetic/Logic instruction (ALU Payload) 240; four Wake Up Content Addressable Memories (CAMs) Source A 205, Source B 210, Source C 215, and Source D 220 which identify addresses of PRNs that contain source data for the instruction; and a destination RAM 225 which identifies a PRN where the data resulting from the execution of the instruction may be stored.
A separate data field (Immediate/Displacement data field 230) may be provided for accompanying data that an instruction may use. Such data may be sent by the decoder in the dispatched instruction information packet for that instruction. For example, a load operation LD may be indicated in queue position QP12301 that seeks to have data stored at an address 6F3D (indicated in the Immediate/Displacement data field 230) into the PRN identified as P5. In this example, the address 6F3D was data contained in the instruction information packet dispatched from the decoder 15, which information was transferred to the Immediate/Displacement data field 230 for queue position QP12301 in connection with queuing that instruction to queue position QP12301.
The address generation (AG) Payload field 235 and the arithmetic logic (ALU) payload field 240 are configured to contain the specific identity of an instruction as indicated by the instruction's OpCode along with relative address indications of the instruction's required sources and destinations that are derived from the corresponding dispatched instruction information packet. In connection with queuing, the mapper 30 translates relative source and destination addresses received in the instruction information packet into addresses of PRNs associated with the pipelines.
The mapper 30 tracks relative source and destination address data received in the instruction information packets so that it can assign the same PRN address to a respective source or destination where two instructions reference the same relative address. For example, P5 may be indicated as one of the source operands in an ADD instruction queued in queue position QP22302, and P5 may also be identified as the destination address of the result of the LD operation queued in queue position QP12301. This indicates that the dispatched instruction information packet for the LD instruction indicated the same relative address for the destination of the LD operation as the dispatched instruction information packet for the ADD instruction had indicated for one of the ADD source operands.
In the scheduler queue 35, flags are provided to indicate eligibility for picking the instruction for execution in the respective pipelines as indicated in the columns respectively labeled EX02451, EX12452, AG02501, and AG12502. The execution unit picker 40 may include an individual picker for each of the four pipelines EX0 picker 2451, EX1 picker 2452, AG0 picker 2501, and AG1 picker 2502. Each respective pipeline's picker scans the respective pipeline picker flags of the queue positions to find queued operations that are eligible for picking, i.e. are capable of being processed on at least one of the respective pipelines. Upon finding an eligible queued operation, the picker checks if the instruction is ready to be picked, i.e. the instruction does not have any other conflicts with other instructions in execution. If it is not ready, the picker resumes its scan for an eligible instruction that is ready to be picked. For example, EX0 picker 2451 and AG0 picker 2501 scan the flags from the top queue position QP12301 to the bottom queue position QPn 230n and the EX1 picker 2452 and AG1 picker 2502 scan the flags from the bottom queue position QPn 230n to the top queue position QP12301 during each cycle. A picker will stop its scan when it finds an eligible instruction that is ready and then direct that instruction to its respective pipeline for execution. This may occur in a single clock cycle.
Readiness for picking may be indicated by the source wake up CAMs for the particular operation component being awake and indicating a ready state, meaning all source operands are present in the CAMs. Where there is no wake up CAM being utilized for a particular instruction component, the instruction is automatically ready for picking. For example, the LD operation queued in queue position QP12301 does not utilize any source CAMs so that it is automatically ready for picking by either of AG0 picker 2501 or AG1 picker 2502 upon queuing. In contrast, the ADD instruction queued in queue position QP22302 uses the queue position's wake up CAMs Source A 205 and Source B 210. In other words, the ADD instruction queued in queue position QP22302 may not be ready to be picked until the PRNs P1 and P5 have been indicated as ready by wake up CAMs Source A 205 and Source B 210.
If one of the arithmetic pipelines is performing a multi-cycle operation, the pipeline may provide its associated picker with an instruction to suspend picking operations until the arithmetic pipeline completes execution of that multi-cycle operation. In contrast, the address generation pipelines may be configured to commence execution of a new address generation instruction without awaiting the retrieval of load data for a prior instruction. Accordingly, the pickers will generally attempt to pick an address generation instruction for each of the address generation pipelines AG02501, AG12502 for each clock cycle when there are available address generation instructions that are indicated as ready to pick.
The queue position's picker flags may be set in accordance with the pipeline indications in
In the single component instruction case, the pipeline designations indicate that the instruction may be either an arithmetic operation or an address generation operation through the eligible pipe indication. Where an arithmetic operation is indicated, the ALU payload field 240 of the queue position may be filled with the OpCode data to indicate the specific kind of operation and appropriately mapped PRN address information indicating sources and a destination. Where an address generation operation is indicated, the AG payload field 235 of the queue position may be filled with the OpCode data to indicate the specific kind of operation and appropriately mapped PRN address information indicating sources and a destination. In both cases, the wake up CAMs Source A 205, Source B 210, Source C 215, and Source D 220 can be supplied with the sources indicated in the payload data, and the destination RAM 225 can be supplied with the destination address indicated in the payload data.
As noted above, in conventional execution units, decoding of an instruction's OpCode may typically be performed in order to queue operations for execution on an appropriate pipeline. This OpCode decoding correspondingly consumes processing time and power. Unlike conventional execution units, the example mapper 30 does not perform OpCode decoding in connection with queuing operations into the scheduling queue 35.
To avoid the need for OpCode decoding by the mapper 30, the decoder 15 may be configured to provide a relatively small additional field in the instruction information packets that it dispatches. This additional field reflects a defined partitioning of the set of instructions into categories that directly relate to execution pipeline assignments. Through this partitioning, the OpCodes are categorized into groups of operation types (OpTypes).
The partitioning may be such that there are at least half as many OpTypes as there are OpCodes. As a result, an OpType can be uniquely defined through the use of at least one less binary bit than may be required to uniquely define the OpCodes.
Configuring the mapper 30 to conduct mapping/queuing based on OpType data instead of OpCode data enables the mapper 30 to perform at a higher speed, since there may be at least one less bit to decode in the mapping/queuing process. Accordingly, the decoder 15 may be configured to dispatch instruction information packets that include a low overhead, i.e. relatively small, OpType field in addition to a larger OpCode field. The mapper 30 may then be able to utilize the data in the OpType field, instead of the OpCode data, for queuing the dispatched operations. The OpCode data may be passed via the scheduler to the pipelines for use in connection with executing the respective instruction, but the mapper does not need to do any decoding of the OpCode data for the mapping/queuing process.
In the example discussed below where support may be provided for an “x86” based instruction set, the mapper 30 only needs to process a 4-bit OpType, instead of an 8-bit OpCode in the mapping/queuing process. This translates into an increase in the speed of the mapping/queuing process. The mapping/queuing process may be part of a timing path of the execution unit 20 since all instructions to be executed must be queued. Thus an increase in the speed of the mapping/queuing process in turn permits the execution unit 20 as a whole to operate at an increased speed.
As noted above, in an example embodiment, the processing core 10 may be configured to support an instruction set compatible with the “x86” chips. This requires support for about 190 standardized “x86” instructions. As illustrated in
The x86 based instruction set may be partitioned into a plurality of OpTypes. These OpTypes are uniquely identified by a four digit binary number as shown. As graphically illustrated in
The use of an OpType field 310 also provides flexibility for future expansion of the set of instructions that are supported without impairing the mapping/queuing process. Where more than 256 instructions are to be supported, the size of the OpCode field 310 would necessarily increase beyond 8-bits. However, as long as the OpCodes can all be categorized into 16 or less OpTypes, a 4-bit OpType 320 field can be used.
A two bit load/store type (LD/ST Type) 330 may be provided in the instruction information packets dispatched from the decoder 15 to indicate whether the instruct has a LD, ST or LD-ST component or no component requiring retrieval/storage of data. A 2-bit identification of these characteristics may be reflected in the LD/ST Type column in
In an example embodiment there may be three OpTypes executable on only one of two arithmetic pipelines EX0451 or EX1452. DIV operations execute exclusively on EX0451, while MUL and BRN operations execute exclusively on EX1452. Such arithmetic multi-cycle operations have known latency, meaning the number of cycles each operation requires may be known in advance of execution by the EX0451 or EX1452.
Referring to
Referring to
Latency and Repeat Rate are determined from the instruction packet, including the Opcode, OpType and Operand data size. Examplary data sizes are 8, 16, 32, and 64 bits. For example, for 8-bit and 16-bit MULs the latency may be 4 cycles and repeat rate may be 2 cycles. For 32b data size, latency may be 5 cycles and repeat rate may be 2 cycles. For 64-bit data size, latency may be 7 cycles and repeat rate may be 4 cycles. For DIVs, latency may determined by one of the Operand Values (e.g., iteration count) and repeat rate may be fixed as latency minus 2 cycles, Variable latency for DIVs is due to the variable nature of the iteration count. This iteration count may be predetermined by software.
If the instruction contains a multi-cycle operation, for example a MUL operation, pipeline control 528 determines the length of the clock cycle latency for the multi-cycle operation, i.e. the number of cycles the operation requires to execute. Pipeline control 528 also determines the repeat rate for the operation, i.e., the number of cycles that must pass before the same type of multi-cycle operation may be picked again. Pipeline control 528 then sends an indication back to picker 522 along data path 544 that prevents any additional multi-cycle instructions from being picked for the duration of the repeat rate count.
When a multi-cycle operation is executed, scheduler queue 520 and picker 522 no longer track the instruction. Thus, in another aspect of the embodiment, an alternate method of waking up an operation dependent on the multi-cycle operation may be provided. As illustrated in
Pipeline control 528 also regulates the use of result bus 529 by sending an instruction to picker 522 that prevents single cycle operations from being picked in the same cycle when the multi-cycle operation is completed.
An example embodiment of a pipeline control regulation process is illustrated in
MUL P15:P11, P2, P33—Multiply register contents in Operands P2 and P33 and write the results in P15 and P11. Multiply operations can result in double width results requiring two register entries to store.
ADD P34, P15, Imm32—Add Operand P15 with Immediate data (32 bits) and write result in P34.
MOV P22, P11—Move contents of P11 into P22.
In accordance with the example embodiment (see
Pipeline control 528 also issues a “Suppress 1 cycle EX1” indication 606 that causes single cycle operations to be ineligible for picking and prevents the picker from scheduling a single cycle operation on pipeline 525 in cycles 4 and 5 which is shown as 1-Cycle Suppr. EX1608 in
One skilled in the art will notice that the use of the alternate broadcast data path 546 (
When the MUL operation 601 is complete, there are two results that get distributed. In cycle 6, when there is no dependent operation picked for execution in the next cycle, the first result P15 may be written back into the physical register for use by a future picked instruction. When a dependent instruction is picked as a result of the alternate tag broadcast discussed above, the result P15 may be bypassed directly to the pipeline executing the dependent instruction. In this example, it is sent to EX0451 which executes the ADD instruction 602 through bypass path 610 in cycle 7. Note that the alternate broadcast of P15 in cycle 5 awakens the dependent ADD instruction such that the result of the MUL operation, issued over the result bus RES1 and written into P15 in cycle 6, is bypassed directly to execute the ADD instruction in cycle 7. In the same way, since the MOV instruction 603 may be picked by the alternate tag broadcast, the result P11 may be bypassed to AG0501 for the execution of the MOV instruction 603 through bypass path 620. One skilled in the art will notice this same process may be used regardless of whether the dependent operations are awakened by the alternate tag broadcast during multi-cycle instruction processing or by the normal wake up during single cycle instruction processing. Bypass logic may receive Destination and source PRN Tags from ALU Payload 526 and pipeline control 528. However, it may be a separate logic than pipeline control 528 that attempts to match Destinations to sources of Operations which were picked in any particular cycle.
The processor 702 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 704 may be located on the same die as the processor 702, or may be located separately from the processor 704. The memory 704 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 706 may include a fixed or removable storage, for example, hard disk drive, solid state drive, optical disk, or flash drive. The input devices 708 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 710 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 752 communicates with the processor 702 and the input devices 708, and permits the processor 702 to receive input from the input devices 752. The output driver 754 communicates with the processor 702 and the output devices 710, and permits the processor 702 to send output to the output devices 710.
Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements. The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Suitable processors may be any one of a variety of processors such as a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). For instance, they may be x86 processors that implement an x86 64-bit instruction set architecture and are used in desktops, laptops, servers, and superscalar computers, or they may be Advanced RISC (Reduced Instruction Set Computer) Machines (ARM) processors that are used in mobile phones or digital media players. Other embodiments of the processors are contemplated, such as Digital Signal Processors (DSP) that are particularly useful in the processing and implementation of algorithms related to digital signals, such as voice data and communication signals, and microcontrollers that are useful in consumer applications, such as printers and copy machines. Other embodiments may include, by way of example, a plurality of processors, one or more processors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
Typically, a processor receives instructions and data from a read-only memory (ROM), a random access memory (RAM), and/or a storage device. Storage devices suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks and DVDs. In addition, while the illustrative embodiments may be implemented in computer software, the functions within the illustrative embodiments may alternatively be embodied in part or in whole using hardware components such as ASICs, FPGAs, or other hardware, or in some combination of hardware components and software components.
Embodiments of the invention may be represented as instructions and data stored on a computer readable memory. For example, aspects of the invention may be included in a hardware description language (HDL) code stored on such computer readable media. Such instructions, when processed may generate other intermediary data (e.g., netlists, GDS data, or the like) that can be used to create mask works that are adapted to configure a manufacturing process (e.g., a semiconductor fabrication facility). Once configured, such a manufacturing process is thereby adapted to manufacture processors or other semiconductor devices that embody aspects of the present invention.
While specific embodiments of the present invention have been shown and described, many modifications and variations could be made by one skilled in the art without departing from the scope of the invention. The above description serves to illustrate and not limit the particular invention in any way.