The disclosure generally relates to a data processing system, and more specifically, to configure the data processing system to handle data dependency in an out-of-order environment.
In an instruction pipeline of data processing system, an instruction is decoded and issued in an order to a functional unit to perform an operation designated by the opcode of the instruction. In some cases, source operand data designated by the instruction is not ready, where the source operand data may be a result data of the functional unit or other functional unit or data to be loaded from cache or memory. Instructions with data dependency go to an execution queue or reservation station to be sent to a functional unit at later time for execution. The mechanism to issue instructions from the queue or reservation station are either complex, large, and power hungry or not optimal for performance and limited by the queue size.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. For example, the formation of a first feature over or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
To avoid stalling of the instruction pipeline due to data dependency, an execution queue with priority scheduler logic is placed between an instruction decode/issue unit and a functional unit. The execution queue with priority scheduler logic prioritizes the instruction issued by the instruction decode/instruction issue unit without data dependency and placing only issued instruction with data dependency into the execution queue. In some embodiments of the disclosure, the execution queue selects between the issued instruction and instructions from the entries of the execution queue (e.g., first 2 entries in the execution queue) with the highest priority given to the issued instruction if it has no data dependency. This priority scheme can achieve similar performance in comparison with, for example, the reservation station with much more complexity and power hungry. In the reservation station, all instructions in all entries are actively checking for data dependency and priority is given to oldest instruction. With the execution queue and the priority scheduler logic in this disclosure, the data processing system is much simpler, smaller, and less power but with the same performance as in totally out-of-order method. The reason for performance advantage for disclosed priority scheme is that putting the instructions without data dependency into the queue create another data dependency chain especially the instruction is part of the loop branch instructions. For example, a loop count instruction to count down the iterations for a loop is often without data dependency and putting loop count instruction into the execution queue will cause the next loop iteration to stall.
The processor 100 includes a central processing unit (CPU) 110, a memory management unit (MMU) 150, and a cache 170. The CPU 110 is a processor for implementing data processing operations. Each of CPU 110, MMU 150, and the cache 170 are coupled to the internal bus 101 via a respective input/output (I/O) port or terminal and communicate therebetween. The processor 100 functions to implement a variety of data processing functions by executing a plurality of data processing instructions. Cache 170 are a temporary data store for frequently-used information that is needed by the CPU 110. For example, the cache 170 may be an instruction cache, a data cache, level two cache, etc. Information needed by the CPU 110 that is not within cache 170 are stored in memory 13. The microprocessor 100 may include branch prediction unit (not shown), co-processor (not shown), and other enhancements that are not relevant to the disclosure.
The MMU 150 controls interaction of information between the CPU 110 and the cache 170 and the memory 13. The MMU 150 also includes an instruction translation lookaside buffer (e.g., iTLB), a data translation lookaside buffer, and level-2 translation lookaside buffer, etc. The TLB may store the recent translations of virtual address to physical address, which may be used for quick virtual address lookup. The virtual address is an address that is used by CPU 110 and by code that is executed by CPU 110. The physical address is used to access the cache 170 and various higher-level memory such as memory 13 (e.g., RAM memory.)
The bus interface unit 190 is only one of several interface units between the processor 100 and the system bus 11. The bus interface unit 190 functions to coordinate the flow of information related to instruction execution by the CPU 110.
The instruction fetch unit 111 is configured to identify and implement the fetching of instructions including the fetching of groups of instructions. Instruction addresses are fetched by instruction fetch unit (either individually or in groups of two or more at a time) from cache 170 or memory 13, and each instruction fetched by may be placed in an instruction buffer. The instruction decode unit 113 is configured to perform instruction decoding to determine the type of the operation (OP code), source register(s), destination register(s). For example, a sample instruction may be “add C, A, B”, which means an add integer operation that adds the content of source register A (source operand data in register A) to the content of source register B (source operand data in register B), and then place the result data in the destination register C. Depending on the type of the operation designated by the instruction (Op-code), the instruction decode unit 113 issues the instruction to the appropriate functional unit 116 via the execution queue 115, or bypassing the execution queue 115 directly to priority scheduler 118.
As described above, the performance of the data processing system is reduced due to the long latency instructions such as load instructions, where subsequent depended instructions may be stalled in the execution queue 115 due to data dependency. Data dependency is referring to a situation where source register is the same as the destination register of previous instruction and the previous instruction is not yet completed. For example, a previously issued instruction has not written hack the result data to the register which is to be accessed by the instruction that is currently being decoded and to be issued. Such situation may be referred to read-after-write (RAW) dependency. In some cases, data dependency may be from write-after-write (WAW) or write-after-read (WAR) dependency where the previous instruction must write back or read from the register file before the subsequent instruction can write to the register file, respectively. The description focuses on the RAW dependency hut the issued instruction can be stalled in the execution queue 115 due to the other types of data dependency. In the embodiments, the instruction issue unit 114 further includes the execution queue 115 and the priority scheduler 118. The execution queue 115 may be a buffer and configured to have a plurality of entries for storing a plurality of instructions to be issued. The priority scheduler 118 may include a combination of logic circuits. The priority scheduler 118 is configured to determine whether the source operand data designated by the issue instruction is ready or not, and then send the issue instruction with highest priority to the functional unit 116. In the embodiments, an issued instruction having all of the source operand data ready (also referred to as “operand data ready”) has highest priority in the priority scheduler 118. Operand data ready refers to, for example, the operand data of the instruction is in the source register designated by the instruction, or the operand data may be forwarded from the functional unit designated by the instruction or other functional units.
If there is a data dependency, the instruction decode unit 113 puts the issue instruction in the execution queue 115, where the instruction waits until all of the source operand data is ready. If there is no data dependency, the instruction decode unit 113 issues the instruction to the priority scheduler 118, where the priority scheduler 118 sends the instruction to the functional unit 116. In the embodiments, the execution queue 115 can select and schedule one valid instruction in the queue with operand data ready for issuing to the functional unit 116. The priority scheduler 118 would select between the instruction from the execution queue 115 and an issued instruction from the instruction decode unit 113.
The functional unit 116 may include a number of functional units including, but not limited to, an arithmetic logic unit (ALU), shifter, an address generation unit (AGU), a floating-point unit (FPU), a load-store unit (LSU), and a branch execution unit (BEU). In some embodiments, a reservation station (not shown) may be coupled to the functional unit 116 to receive any ready instruction for out-of-order execution. The reservation station may receive information from the scoreboard or register that indicates the operand data is ready.
Although
In some embodiments, the execution queue 115 can be a first-in-first-out (FIFO) queue where only the first instruction can be issued to functional unit 116. In other embodiments, the execution queue 115 can be a reservation station. The reservation station is designed to issue any instruction in the execution queue 115 as long as the source operand data ready. The reservation station has higher performance than the FIFO queue but with a cost of complexity, area, and power. For example, if the execution queue has 8 entries and each entry has 3 source operands, then the reservation station is actively looking for 24 source operand data ready. In addition, the reservation station must keep source operand data, which is 24 sets of registers. In yet other embodiments, the FIFO execution queue 115 can be enhanced by allowing either of the first two entries to be issued from the execution queue. Coupling with the priority scheduler 118 to give highest priority to the issued instruction, the performance of the FIFO execution queue can match that of the reservation station.
With reference to
In
The priority-select logic 1180 selects the instruction from instruction decode logic 113 or the execution queue 115 before accessing the register file 117. In other embodiments, due to timing paths, the instruction issue unit 118 may be in different clock cycle than the cycle of accessing the register file 117 and the result data bus 1164 for source operand data. The multiplexes 1162A and 1162B may select between more source operand data from the register file 117 and the result data bus 1164.
In the disclosure, the priority-select logic 1180 gives the instruction from the instruction decode unit 113 highest priority if the operand check logic 1182 indicates source operand ready. The “source-operand ready” instruction from the instruction decode unit 113 may be a new stream of instruction and should be executed immediately, so that subsequent instructions are not blocked. In the disclosure, the execution queue 115 may be FIFO queue which is much simpler in implementation, smaller area, and less power dissipation in comparison to fully out-of-order queue such as the reservation station where any entry in the execution queue 115 can be selected for issuing with the oldest priority-select logic.
With reference to
The operand check logics 1182, 1184, 4186 may be any of the method to handle data dependency such as register scoreboard, register renaming, re-order buffer, etc. The data dependency checking logic includes fetching source operand data from the register file 117, the result data bus 1164, or temporary storage of data such as future file (not shown), re-order buffer (not shown), and large physical register file (not shown) which is a combination of architectural and renamed registers.
As described in the embodiments of
In accordance with one of the embodiments of the disclosure, a microprocessor is provided. The microprocessor includes a register file having a plurality of registers, an instruction decode unit, a function unit, an execution queue having a plurality of entries and coupled between the functional unit and the instruction, and a priority scheduler coupled between the functional unit, the instruction decode unit, and the execution queue. The instruction decode unit decodes an instruction for at least one source operand and issues the instruction to the priority scheduler or the execution queue. The functional unit receives the issue instruction and performs an operation designated by the issue instruction. In the execution queue, each entry of the execution queue stores a queued instruction originated from the instruction decode unit in which at least one source operand of the queued instruction has a data dependency at a clock cycle when the queued was to be issued. In addition, the priority scheduler prioritizes one of the issued instruction and the queued instruction based on the availability of operand data corresponding to the issued instruction and the queued instruction_, and then issues one of the issued instruction and queued instruction to the functional unit as the issue instruction based on the respective priority assigned to the issued instruction and the queued instruction;
In accordance with one of the embodiments of the disclosure, a method for issuing an issue instruction to a functional unit for execution with priority scheduling is provided. The method comprises the following steps. An issued instruction is received from an instruction decode unit, and a queued instruction is received from an execution queue. One of the issued instruction and the queued instruction is prioritized based on availability of operand data corresponding to the issued instruction and the queued instruction. Then, one of the issued instruction or the queued instruction is issued to the functional unit as the issue instruction based on the respective priority assigned to the issued instruction and the queued instruction.
In accordance with one of the embodiments of the disclosure, a data processing system is provided. The data processing system includes a microprocessor, a main memory coupled to the microprocessor, a bus bridge coupled to the microprocessor, and an input/output device coupled to the bus bridge. The microprocessor includes a register file having a plurality of registers, an instruction decode unit, a function unit, an execution queue having a plurality of entries and coupled between the functional unit and the instruction, and a priority scheduler coupled between the functional unit, the instruction decode unit, and the execution queue. The instruction decode unit decodes an instruction for at least one source operand and dispatches the instruction to the priority scheduler or the execution queue. The functional unit receives the issue instruction and performs an operation designated by the issue instruction. In the execution queue, each entry of the execution queue stores a queued instruction originated from the instruction decode unit in which at least one source operand of the queued instruction has a data dependency at a clock cycle when the queued was to be issued. In addition, the priority scheduler includes a first operand check logic coupled to the instruction decode unit, a second operand check logic coupled to the execution queue, and a priority select logic coupled to the first and second operand check logics respectively. The priority select logic is configured to prioritize the instruction directly received from the instruction decode unit through the first operand check logic or a queued instruction received from the execution queue through the second operand check logic, where the instruction sent directly from the instruction decode unit with the corresponding operand data available has higher priority over the queued instruction. The priority select logic issues one of the instruction directly from the instruction decode unit or the queued instruction to the functional unit as the issue instruction based on the respective priority of the instruction directly from the instruction decode unit without data dependency over and the queued instruction.
The foregoing has outlined features of several embodiments so that those skilled in the art may better understand the detailed description that follows. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions and alterations herein without departing from the spirit and scope of the present disclosure.